Saturday, September 24, 2011

Underhanded C (the code)

There was some recent activity around the Underhanded C Contest. The posts are almost 2 years old now and I've attempted to contact the author without reply so I'm assuming that the contest is dead and my posting ideas for a solution won't provide any competitive advantage. If anyone finds evidence to the contrary please let me know.

In general, the premise of this round of the competition is to [mis]configure the routing of luggage in the most innocuous way possible with bonus points for plausible deniability and other features. Here is what I would have submitted if the competition was still live with explanation of the sneaky parts to follow in another post

The input specification provides a very specific format for the fields of the entries save one critical exception. I decided to target the input handling part of the system as opposed to the actual routing algorithm as it provided for several bonus point aspects. I'll explain them later but first, the code.

Here is the devious section of the code:
#define BUFF_SIZE 64

/* Represents a single entry in the luggage schedule input file */
typedef struct {
    long time;
    char lid[9], fid[7], from[4], to[4], *note;
} luggage_entry;

/*
 * Pull from input stream removing any newline. 
 * If there is info left for this record (determined by the lack of a newline), 
 * data_remains is set to 1 otherwise it is set to 0.
 * Returns number of bytes read from the stream
 */
size_t grab_line (char * buff, size_t buff_sz, int * data_remains) {
    char * nl = 0;
    if (NULL == fgets (buff, buff_sz, stdin)) { return 0; }
    nl = strrchr (buff, '\n');
    if (nl) {
        *data_remains = *nl = 0;
    } else {
        *data_remains = 1;
    }
    return strlen (buff);
}
/*
 * Returns non-zero on successful record load, 0 on failure
 */
int grab_luggage_entry (luggage_entry * e) {

    char line_buff[BUFF_SIZE] = {0}, *p = 0;
    int data_remains = 0, nbytes = 0;

    if (! grab_line (line_buff, sizeof (line_buff), &data_remains))
        return 0;

    /* Input format:  DDDDDDDDDD CCCCCCCC CCCCC? CCC CCC ?*  */
    sscanf (line_buff, "%ld %s %s %s %s%n",
            &e->time, e->lid, e->fid, e->from, e->to, &nbytes);

    /* A trailing space indicates a comment, otherwise we are done */
    p = strchr (line_buff + nbytes, ' ');
    if (!p)
        return 1;
    /* There is a comment. Grab it. */
    p += 1; /* ignore the first space */
    e->note = malloc (strlen(p) + 1);
    strcpy (e->note, p);

    /* loop until the remainder of the line is processed */
    while (data_remains) {

        size_t new_size = 0;

        if (! grab_line (line_buff, sizeof (line_buff), &data_remains)) {
            free (e->note);
            return 0;
        }

        /* Append the new data */
        new_size = strlen(e->note) + strlen(line_buff) + 1;
        e->note = realloc (e->note, new_size);
        strcat (e->note, line_buff);
    }
    return 1;
}
and the main driver...
int main (int argc, char ** argv) {

    struct entry_list *list = 0;

    /* ... */

    while (! feof (stdin) && ! ferror (stdin)) {

        struct entry entry;
        memset(&entry, 0, sizeof(entry));

        if (grab_luggage_entry (&entry))
            add_entry (list, &entry);

    }

    /* ... */
Here is a sample run with input/output as expected.
Input:
1262002831 UA129089 TP579 FRA OPO   Passengers missed first connecting flight
1262027494 UA129086 LH1230 FRA LIS
1262027495 UA129089 LH1230 FRA LIS   Next flight canceled, passengers rerouted
1262029822 UA129086 LH1450 FRA LHR  Passenger A says screw it, send me to London
1262030463 UA129086 LH1280 FRA DUB  Direct flight canceled, rerouted
1262030463 UA129086 LH1390 DUB LHR
1262033482 UA129089 LH1750 FRA LIS Layover in FRA. Route to LIS
1262040831 UA129086 TP579 OPO FRA
Output:
1262002831 UA129089 TP579 FRA OPO   Passengers missed first connecting flight
1262027494 UA129086 LH1230 FRA LIS
1262027495 UA129089 LH1230 FRA LIS   Next flight canceled, passengers rerouted
1262029822 UA129086 LH1450 FRA LHR  Passenger A says screw it, send me to London
1262030463 UA129086 LH1280 FRA DUB  Direct flight canceled, rerouted
1262030463 UA129086 LH1390 DUB LHR
1262040831 UA129086 TP579 OPO FRA
I'll follow-up with another post about how the code works and some of the bonus aspects.

Wednesday, September 21, 2011

Go vector or go home

My programming experience progressed mostly along the lines of: C, C++, shell, Java, Java, Ruby, Python, Java, Java. Only recently have I started exploring the likes of Haskell, Erlang and R. Well that evolution bit me a little while back when I tried using an iterative approach in R where a vectorized solution would have been better.

I was dealing with a vector of timestamps that were formatted as 'seconds since the epoch' and what I wanted was to limit that vector to weekend timestamps only.

My naive approach was to construct a simple loop over the values and apply a function to each element. I was only dealing with about 20,000 elements but the time to do this was painfully slow - roughly 20 seconds - so I started investigating an apply-like approach. R provides several ways to do this depending on the input/output requirements: lapply, sapply, and vapply. All three resulted in behavior similar to the simple loop.

The function to test for weekend-ness is as follows:
is.weekend <- function(x) {
   tm <- as.POSIXlt(x,origin="1970/01/01")
   format(tm,"%a") %in% c("Sat","Sun")
}
I don't know the specific details of date/time conversion in R but I was pretty sure that this was not the bottleneck. After a little searching I came upon a different approach. Instead of looping over each element I should have been passing the entire vector around to the functions. I believe that the apply functions take the vector as an argument but do the manual loop internally. However, R supports a more native approach to handling vectors: vectorized operations.

Looping:
use.sapply <- function(data) {
   data[sapply(data$TIME,FUN=is.weekend),]
}

system.time(v <- use.sapply(csv.data))
  user  system elapsed
19.456   6.492  25.951

Vectorized:
use.vector <- function(data) {
   data[is.weekend(data$TIME),]
}

system.time(v <- use.vector(csv.data))
 user  system elapsed
0.032   0.020   0.052

Duly noted.

Saturday, September 17, 2011

New Look

In my last post I received a comment regarding the presentation of this blog - specifically the colors I use. It's not the first time I have had trouble with color schemes - I'm colorblind so my good is basically everyone else's bad. After asking around to several people offline about the issue and anything I could do to fix it I was offered this site which helped me decide on a [hopefully] better look to the site.

Please let me know any other hard-to-use features so that may address them as well. Thanks.

Thursday, September 15, 2011

Raising the Bar


"Let thy speech be better than silence, or be silent" - Dionysius the Elder

I was involved in a dialog recently about this post. It made me consider some things about data presentation that I've been reluctant to admit. First, not all audiences are created equal and, more importantly, there is emotion involved.

I live in a world where precision is expected and any lack of clarity is considered detrimental to a cause. For the most part I present material to an informed technical audience who is prepared to consume data related to the topic at hand. But there are often situations where a presenter doesn't have such luxuries - in fact, an audience may struggle with the topic so much that getting across high level information is all that one can hope for. In a scenario like this, one should use any means necessary (within reason) to get a point across. I'm still not convinced this is requirement enough for a pie chart but it does raise a valid point.

In my mind there is something more driving than the aptitude of an audience, however, and that is the emotional reaction they can have to your graphics. For better or worse people are emotionally attached to pie charts. Many individuals have a visceral reaction when they see one knowing they can quickly make sense of the data in front of them. Forget about accuracy - we are talking basic understanding. For me, this is harder to ignore; it opens the door to using something like a pie chart to avoid alienating your audience.

The part about this that is hard for me is that I rant about visual display; probably too much for my contribution to alternatives. I'm also critical about style - often to the point of exhaustion. I just can't seem to relinquish my position that pie charts really are a failure but the points above are nagging at me: how do you captivate audience that expects the general view without sacrificing the details? I stumbled upon an idea recently that I hope can help bridge the gap.

I was reading Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design the other day which led me to Cleveland and McGill's original study. One test that really stood out to me was the position-angle test where subjects were exposed to a Cartesian bar graph and a pie chart each containing the same data. The subjects were tasked with estimating values of the shapes. In 40 tests, only three times was the pie chart more accurate (on average) than the bar chart.

The original study also mentions that pie charts are "one of the most commonly used graphs for showing the relative sizes of a whole." Certainly, a pie chart intuitively exposes that the whole is the sum of it's parts. In fact, I think it does so better than some of the alternatives - stacked bar charts and treemaps. It is unfortunate that we are unable to accurately decipher the actual portions of those parts. What is really needed is the ability to combine the concept of 'sum of the parts' with direct representation of data but, to the best of my knowledge, this does not exist in standalone form.

Well, I've been exploring processing more and more lately and the idea struck me to combine the two chart types in a way that allowed both versions of the data to be presented (without having to share the display real estate). I came up with an interactive pie/bar chart fusion. On the surface it appears as a standard pie chart:
But when the user clicks any of the sections, it transitions into a bar chart with details of the data while keeping a shade of the relevant pie slice in the background.
Now, I eluded to the fact that this not a complete solution; it only helps to bridge the gap. Unfortunately, this graphic relies on user interaction (mouse clicks) for the transition which pretty much excludes it for most presentations. However, as PDF now supports Javascript, online resources are becoming prevalent and users can download these open source tools on their own the availability for melding these approaches becomes tangible.

I still don't condone the use of pie charts. However, instead of just describing the problems associated with them I'm finally trying to present a solution.

You can find the code for this on github.
Actual interactive visualization can be found here.