Wednesday, January 25, 2012

Return on Investment

About a year ago, a friend and I were discussing the evolution of learning new tools while working under time constraints. He claimed that, regardless of the pressure, it is almost always better to sacrifice the upfront time in learning a new tool than to stay in your comfort zone to construct the illusion of progress. By illusion he simply meant that early stage progress is shadowed by the long-term effect of having to maintain and update a homegrown solution.

He used the following illustration to convey his point:



The effect is that if you choose to develop your own tools you satisfy the immediate gratification of results but the final solution is more likely to end up as a house of cards.

I had forgotten about this conversation until recently. The program I am currently working on had me in line to provide relevant results in a shortened amount of time. I am working with the llvm compiler infrastructure and it's unique assembly language. Part of what I needed to do is understand, at a high level, the contents of an llvm assembly file. My natural reaction to this was to implement some minimal parser in Ruby to get to a working set of results quickly. Things went well until I wanted to make a decent tool for broader use within the team: I was starting to spend more time on developing the Ruby code than making headway on my tasks. Not only did I dig myself a grave in that respect but since I was quick to satisyfy the initial demand for results I set an expectation that the work was less complicated than it truly is (and will continue to be). I was building a house of cards.

I ended up scrapping the Ruby implementation and I'm now happily coming up to speed on how to manage llvm code with, well, llvm tools. This is not without effect from the leads of the program, of course. Live and learn, I suppose.

Monday, January 16, 2012

Just making sure

I was pouring over some code recently and came across one expression that really made me shake my head:
buff[strlen(buff)] = 0;
I pretty sure the intent was
buff[strlen(buff) - 1] = 0;
Either that, or they really wanted to make sure the string was null-terminated.

Friday, December 9, 2011

Underhanded C (the explanation)

In my last post I provided what would have been my entry to the underhanded C contest if it was still active. In this post I will give the details of how the code works and what bonus points I tried to target.

My solution really boils down to an off-by-one error that is hidden in a return value. A sharp observer would certainly see that I bail silently in the middle of parsing the comment section when a full luggage record was already input; that would be a huge breadcrumb to the source of the error in an investigation/review.

Due to how I handle newlines in the grab_line function it is possible to read only a newline character and return a 0-length result (considered an error in the context of grab_luggage_entry). If any entry is exactly BUFF_SIZE characters long the entire entry is silently discarded because the last read suffers from the 0-length return.

Here are the evaluation criteria and my attention to them:

Short code:
The critical section is less than 70 lines of code (with comments) but the logic for filtering made things a bit longer. This was by far the hardest target to hit with respect to the contest

Universal:
This code works on multiple platforms (Windows/Linux) but in different ways since the line endings are different ('\n' vs. '\r\n') and only '\n' is used in the check

Plausible deniability:
Comment size is unbounded in the spec so using fgets is justified/plausible as a input mechanism
getline is not portable and we want to work on multiple platforms (more justification for fgets)

Syntax coloring:
Passes syntax coloring, high compiler warnings and valgrind tests


... with bonus points for...


Mis-routing appears innocent:
The length of the line is the only qualifier to misroute making the error appear arbitrary and random in most contexts

Flexible mis-routing:
An informed operator may arbitrarily choose to trigger the behavior for nasty patrons (qualifies for spite bonus as well)
Additionally, by making the buffer size a macro (gcc -DBUFF_SIZE=128) behavior can be modified without changing the code.


While the contest appears no longer active I still enjoyed spending some time thinking up how I would write this. I've seen past winners' entries and this is certainly not of that caliber but it was nice to see that I was able to address the majority of the scoring points in at least some way.

Saturday, September 24, 2011

Underhanded C (the code)

There was some recent activity around the Underhanded C Contest. The posts are almost 2 years old now and I've attempted to contact the author without reply so I'm assuming that the contest is dead and my posting ideas for a solution won't provide any competitive advantage. If anyone finds evidence to the contrary please let me know.

In general, the premise of this round of the competition is to [mis]configure the routing of luggage in the most innocuous way possible with bonus points for plausible deniability and other features. Here is what I would have submitted if the competition was still live with explanation of the sneaky parts to follow in another post

The input specification provides a very specific format for the fields of the entries save one critical exception. I decided to target the input handling part of the system as opposed to the actual routing algorithm as it provided for several bonus point aspects. I'll explain them later but first, the code.

Here is the devious section of the code:
#define BUFF_SIZE 64

/* Represents a single entry in the luggage schedule input file */
typedef struct {
    long time;
    char lid[9], fid[7], from[4], to[4], *note;
} luggage_entry;

/*
 * Pull from input stream removing any newline. 
 * If there is info left for this record (determined by the lack of a newline), 
 * data_remains is set to 1 otherwise it is set to 0.
 * Returns number of bytes read from the stream
 */
size_t grab_line (char * buff, size_t buff_sz, int * data_remains) {
    char * nl = 0;
    if (NULL == fgets (buff, buff_sz, stdin)) { return 0; }
    nl = strrchr (buff, '\n');
    if (nl) {
        *data_remains = *nl = 0;
    } else {
        *data_remains = 1;
    }
    return strlen (buff);
}
/*
 * Returns non-zero on successful record load, 0 on failure
 */
int grab_luggage_entry (luggage_entry * e) {

    char line_buff[BUFF_SIZE] = {0}, *p = 0;
    int data_remains = 0, nbytes = 0;

    if (! grab_line (line_buff, sizeof (line_buff), &data_remains))
        return 0;

    /* Input format:  DDDDDDDDDD CCCCCCCC CCCCC? CCC CCC ?*  */
    sscanf (line_buff, "%ld %s %s %s %s%n",
            &e->time, e->lid, e->fid, e->from, e->to, &nbytes);

    /* A trailing space indicates a comment, otherwise we are done */
    p = strchr (line_buff + nbytes, ' ');
    if (!p)
        return 1;
    /* There is a comment. Grab it. */
    p += 1; /* ignore the first space */
    e->note = malloc (strlen(p) + 1);
    strcpy (e->note, p);

    /* loop until the remainder of the line is processed */
    while (data_remains) {

        size_t new_size = 0;

        if (! grab_line (line_buff, sizeof (line_buff), &data_remains)) {
            free (e->note);
            return 0;
        }

        /* Append the new data */
        new_size = strlen(e->note) + strlen(line_buff) + 1;
        e->note = realloc (e->note, new_size);
        strcat (e->note, line_buff);
    }
    return 1;
}
and the main driver...
int main (int argc, char ** argv) {

    struct entry_list *list = 0;

    /* ... */

    while (! feof (stdin) && ! ferror (stdin)) {

        struct entry entry;
        memset(&entry, 0, sizeof(entry));

        if (grab_luggage_entry (&entry))
            add_entry (list, &entry);

    }

    /* ... */
Here is a sample run with input/output as expected.
Input:
1262002831 UA129089 TP579 FRA OPO   Passengers missed first connecting flight
1262027494 UA129086 LH1230 FRA LIS
1262027495 UA129089 LH1230 FRA LIS   Next flight canceled, passengers rerouted
1262029822 UA129086 LH1450 FRA LHR  Passenger A says screw it, send me to London
1262030463 UA129086 LH1280 FRA DUB  Direct flight canceled, rerouted
1262030463 UA129086 LH1390 DUB LHR
1262033482 UA129089 LH1750 FRA LIS Layover in FRA. Route to LIS
1262040831 UA129086 TP579 OPO FRA
Output:
1262002831 UA129089 TP579 FRA OPO   Passengers missed first connecting flight
1262027494 UA129086 LH1230 FRA LIS
1262027495 UA129089 LH1230 FRA LIS   Next flight canceled, passengers rerouted
1262029822 UA129086 LH1450 FRA LHR  Passenger A says screw it, send me to London
1262030463 UA129086 LH1280 FRA DUB  Direct flight canceled, rerouted
1262030463 UA129086 LH1390 DUB LHR
1262040831 UA129086 TP579 OPO FRA
I'll follow-up with another post about how the code works and some of the bonus aspects.

Wednesday, September 21, 2011

Go vector or go home

My programming experience progressed mostly along the lines of: C, C++, shell, Java, Java, Ruby, Python, Java, Java. Only recently have I started exploring the likes of Haskell, Erlang and R. Well that evolution bit me a little while back when I tried using an iterative approach in R where a vectorized solution would have been better.

I was dealing with a vector of timestamps that were formatted as 'seconds since the epoch' and what I wanted was to limit that vector to weekend timestamps only.

My naive approach was to construct a simple loop over the values and apply a function to each element. I was only dealing with about 20,000 elements but the time to do this was painfully slow - roughly 20 seconds - so I started investigating an apply-like approach. R provides several ways to do this depending on the input/output requirements: lapply, sapply, and vapply. All three resulted in behavior similar to the simple loop.

The function to test for weekend-ness is as follows:
is.weekend <- function(x) {
   tm <- as.POSIXlt(x,origin="1970/01/01")
   format(tm,"%a") %in% c("Sat","Sun")
}
I don't know the specific details of date/time conversion in R but I was pretty sure that this was not the bottleneck. After a little searching I came upon a different approach. Instead of looping over each element I should have been passing the entire vector around to the functions. I believe that the apply functions take the vector as an argument but do the manual loop internally. However, R supports a more native approach to handling vectors: vectorized operations.

Looping:
use.sapply <- function(data) {
   data[sapply(data$TIME,FUN=is.weekend),]
}

system.time(v <- use.sapply(csv.data))
  user  system elapsed
19.456   6.492  25.951

Vectorized:
use.vector <- function(data) {
   data[is.weekend(data$TIME),]
}

system.time(v <- use.vector(csv.data))
 user  system elapsed
0.032   0.020   0.052

Duly noted.

Saturday, September 17, 2011

New Look

In my last post I received a comment regarding the presentation of this blog - specifically the colors I use. It's not the first time I have had trouble with color schemes - I'm colorblind so my good is basically everyone else's bad. After asking around to several people offline about the issue and anything I could do to fix it I was offered this site which helped me decide on a [hopefully] better look to the site.

Please let me know any other hard-to-use features so that may address them as well. Thanks.

Thursday, September 15, 2011

Raising the Bar


"Let thy speech be better than silence, or be silent" - Dionysius the Elder

I was involved in a dialog recently about this post. It made me consider some things about data presentation that I've been reluctant to admit. First, not all audiences are created equal and, more importantly, there is emotion involved.

I live in a world where precision is expected and any lack of clarity is considered detrimental to a cause. For the most part I present material to an informed technical audience who is prepared to consume data related to the topic at hand. But there are often situations where a presenter doesn't have such luxuries - in fact, an audience may struggle with the topic so much that getting across high level information is all that one can hope for. In a scenario like this, one should use any means necessary (within reason) to get a point across. I'm still not convinced this is requirement enough for a pie chart but it does raise a valid point.

In my mind there is something more driving than the aptitude of an audience, however, and that is the emotional reaction they can have to your graphics. For better or worse people are emotionally attached to pie charts. Many individuals have a visceral reaction when they see one knowing they can quickly make sense of the data in front of them. Forget about accuracy - we are talking basic understanding. For me, this is harder to ignore; it opens the door to using something like a pie chart to avoid alienating your audience.

The part about this that is hard for me is that I rant about visual display; probably too much for my contribution to alternatives. I'm also critical about style - often to the point of exhaustion. I just can't seem to relinquish my position that pie charts really are a failure but the points above are nagging at me: how do you captivate audience that expects the general view without sacrificing the details? I stumbled upon an idea recently that I hope can help bridge the gap.

I was reading Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design the other day which led me to Cleveland and McGill's original study. One test that really stood out to me was the position-angle test where subjects were exposed to a Cartesian bar graph and a pie chart each containing the same data. The subjects were tasked with estimating values of the shapes. In 40 tests, only three times was the pie chart more accurate (on average) than the bar chart.

The original study also mentions that pie charts are "one of the most commonly used graphs for showing the relative sizes of a whole." Certainly, a pie chart intuitively exposes that the whole is the sum of it's parts. In fact, I think it does so better than some of the alternatives - stacked bar charts and treemaps. It is unfortunate that we are unable to accurately decipher the actual portions of those parts. What is really needed is the ability to combine the concept of 'sum of the parts' with direct representation of data but, to the best of my knowledge, this does not exist in standalone form.

Well, I've been exploring processing more and more lately and the idea struck me to combine the two chart types in a way that allowed both versions of the data to be presented (without having to share the display real estate). I came up with an interactive pie/bar chart fusion. On the surface it appears as a standard pie chart:
But when the user clicks any of the sections, it transitions into a bar chart with details of the data while keeping a shade of the relevant pie slice in the background.
Now, I eluded to the fact that this not a complete solution; it only helps to bridge the gap. Unfortunately, this graphic relies on user interaction (mouse clicks) for the transition which pretty much excludes it for most presentations. However, as PDF now supports Javascript, online resources are becoming prevalent and users can download these open source tools on their own the availability for melding these approaches becomes tangible.

I still don't condone the use of pie charts. However, instead of just describing the problems associated with them I'm finally trying to present a solution.

You can find the code for this on github.
Actual interactive visualization can be found here.