Friday, December 9, 2011

Underhanded C (the explanation)

In my last post I provided what would have been my entry to the underhanded C contest if it was still active. In this post I will give the details of how the code works and what bonus points I tried to target.

My solution really boils down to an off-by-one error that is hidden in a return value. A sharp observer would certainly see that I bail silently in the middle of parsing the comment section when a full luggage record was already input; that would be a huge breadcrumb to the source of the error in an investigation/review.

Due to how I handle newlines in the grab_line function it is possible to read only a newline character and return a 0-length result (considered an error in the context of grab_luggage_entry). If any entry is exactly BUFF_SIZE characters long the entire entry is silently discarded because the last read suffers from the 0-length return.

Here are the evaluation criteria and my attention to them:

Short code:
The critical section is less than 70 lines of code (with comments) but the logic for filtering made things a bit longer. This was by far the hardest target to hit with respect to the contest

Universal:
This code works on multiple platforms (Windows/Linux) but in different ways since the line endings are different ('\n' vs. '\r\n') and only '\n' is used in the check

Plausible deniability:
Comment size is unbounded in the spec so using fgets is justified/plausible as a input mechanism
getline is not portable and we want to work on multiple platforms (more justification for fgets)

Syntax coloring:
Passes syntax coloring, high compiler warnings and valgrind tests


... with bonus points for...


Mis-routing appears innocent:
The length of the line is the only qualifier to misroute making the error appear arbitrary and random in most contexts

Flexible mis-routing:
An informed operator may arbitrarily choose to trigger the behavior for nasty patrons (qualifies for spite bonus as well)
Additionally, by making the buffer size a macro (gcc -DBUFF_SIZE=128) behavior can be modified without changing the code.


While the contest appears no longer active I still enjoyed spending some time thinking up how I would write this. I've seen past winners' entries and this is certainly not of that caliber but it was nice to see that I was able to address the majority of the scoring points in at least some way.

Saturday, September 24, 2011

Underhanded C (the code)

There was some recent activity around the Underhanded C Contest. The posts are almost 2 years old now and I've attempted to contact the author without reply so I'm assuming that the contest is dead and my posting ideas for a solution won't provide any competitive advantage. If anyone finds evidence to the contrary please let me know.

In general, the premise of this round of the competition is to [mis]configure the routing of luggage in the most innocuous way possible with bonus points for plausible deniability and other features. Here is what I would have submitted if the competition was still live with explanation of the sneaky parts to follow in another post

The input specification provides a very specific format for the fields of the entries save one critical exception. I decided to target the input handling part of the system as opposed to the actual routing algorithm as it provided for several bonus point aspects. I'll explain them later but first, the code.

Here is the devious section of the code:
#define BUFF_SIZE 64

/* Represents a single entry in the luggage schedule input file */
typedef struct {
    long time;
    char lid[9], fid[7], from[4], to[4], *note;
} luggage_entry;

/*
 * Pull from input stream removing any newline. 
 * If there is info left for this record (determined by the lack of a newline), 
 * data_remains is set to 1 otherwise it is set to 0.
 * Returns number of bytes read from the stream
 */
size_t grab_line (char * buff, size_t buff_sz, int * data_remains) {
    char * nl = 0;
    if (NULL == fgets (buff, buff_sz, stdin)) { return 0; }
    nl = strrchr (buff, '\n');
    if (nl) {
        *data_remains = *nl = 0;
    } else {
        *data_remains = 1;
    }
    return strlen (buff);
}
/*
 * Returns non-zero on successful record load, 0 on failure
 */
int grab_luggage_entry (luggage_entry * e) {

    char line_buff[BUFF_SIZE] = {0}, *p = 0;
    int data_remains = 0, nbytes = 0;

    if (! grab_line (line_buff, sizeof (line_buff), &data_remains))
        return 0;

    /* Input format:  DDDDDDDDDD CCCCCCCC CCCCC? CCC CCC ?*  */
    sscanf (line_buff, "%ld %s %s %s %s%n",
            &e->time, e->lid, e->fid, e->from, e->to, &nbytes);

    /* A trailing space indicates a comment, otherwise we are done */
    p = strchr (line_buff + nbytes, ' ');
    if (!p)
        return 1;
    /* There is a comment. Grab it. */
    p += 1; /* ignore the first space */
    e->note = malloc (strlen(p) + 1);
    strcpy (e->note, p);

    /* loop until the remainder of the line is processed */
    while (data_remains) {

        size_t new_size = 0;

        if (! grab_line (line_buff, sizeof (line_buff), &data_remains)) {
            free (e->note);
            return 0;
        }

        /* Append the new data */
        new_size = strlen(e->note) + strlen(line_buff) + 1;
        e->note = realloc (e->note, new_size);
        strcat (e->note, line_buff);
    }
    return 1;
}
and the main driver...
int main (int argc, char ** argv) {

    struct entry_list *list = 0;

    /* ... */

    while (! feof (stdin) && ! ferror (stdin)) {

        struct entry entry;
        memset(&entry, 0, sizeof(entry));

        if (grab_luggage_entry (&entry))
            add_entry (list, &entry);

    }

    /* ... */
Here is a sample run with input/output as expected.
Input:
1262002831 UA129089 TP579 FRA OPO   Passengers missed first connecting flight
1262027494 UA129086 LH1230 FRA LIS
1262027495 UA129089 LH1230 FRA LIS   Next flight canceled, passengers rerouted
1262029822 UA129086 LH1450 FRA LHR  Passenger A says screw it, send me to London
1262030463 UA129086 LH1280 FRA DUB  Direct flight canceled, rerouted
1262030463 UA129086 LH1390 DUB LHR
1262033482 UA129089 LH1750 FRA LIS Layover in FRA. Route to LIS
1262040831 UA129086 TP579 OPO FRA
Output:
1262002831 UA129089 TP579 FRA OPO   Passengers missed first connecting flight
1262027494 UA129086 LH1230 FRA LIS
1262027495 UA129089 LH1230 FRA LIS   Next flight canceled, passengers rerouted
1262029822 UA129086 LH1450 FRA LHR  Passenger A says screw it, send me to London
1262030463 UA129086 LH1280 FRA DUB  Direct flight canceled, rerouted
1262030463 UA129086 LH1390 DUB LHR
1262040831 UA129086 TP579 OPO FRA
I'll follow-up with another post about how the code works and some of the bonus aspects.

Wednesday, September 21, 2011

Go vector or go home

My programming experience progressed mostly along the lines of: C, C++, shell, Java, Java, Ruby, Python, Java, Java. Only recently have I started exploring the likes of Haskell, Erlang and R. Well that evolution bit me a little while back when I tried using an iterative approach in R where a vectorized solution would have been better.

I was dealing with a vector of timestamps that were formatted as 'seconds since the epoch' and what I wanted was to limit that vector to weekend timestamps only.

My naive approach was to construct a simple loop over the values and apply a function to each element. I was only dealing with about 20,000 elements but the time to do this was painfully slow - roughly 20 seconds - so I started investigating an apply-like approach. R provides several ways to do this depending on the input/output requirements: lapply, sapply, and vapply. All three resulted in behavior similar to the simple loop.

The function to test for weekend-ness is as follows:
is.weekend <- function(x) {
   tm <- as.POSIXlt(x,origin="1970/01/01")
   format(tm,"%a") %in% c("Sat","Sun")
}
I don't know the specific details of date/time conversion in R but I was pretty sure that this was not the bottleneck. After a little searching I came upon a different approach. Instead of looping over each element I should have been passing the entire vector around to the functions. I believe that the apply functions take the vector as an argument but do the manual loop internally. However, R supports a more native approach to handling vectors: vectorized operations.

Looping:
use.sapply <- function(data) {
   data[sapply(data$TIME,FUN=is.weekend),]
}

system.time(v <- use.sapply(csv.data))
  user  system elapsed
19.456   6.492  25.951

Vectorized:
use.vector <- function(data) {
   data[is.weekend(data$TIME),]
}

system.time(v <- use.vector(csv.data))
 user  system elapsed
0.032   0.020   0.052

Duly noted.

Saturday, September 17, 2011

New Look

In my last post I received a comment regarding the presentation of this blog - specifically the colors I use. It's not the first time I have had trouble with color schemes - I'm colorblind so my good is basically everyone else's bad. After asking around to several people offline about the issue and anything I could do to fix it I was offered this site which helped me decide on a [hopefully] better look to the site.

Please let me know any other hard-to-use features so that may address them as well. Thanks.

Thursday, September 15, 2011

Raising the Bar


"Let thy speech be better than silence, or be silent" - Dionysius the Elder

I was involved in a dialog recently about this post. It made me consider some things about data presentation that I've been reluctant to admit. First, not all audiences are created equal and, more importantly, there is emotion involved.

I live in a world where precision is expected and any lack of clarity is considered detrimental to a cause. For the most part I present material to an informed technical audience who is prepared to consume data related to the topic at hand. But there are often situations where a presenter doesn't have such luxuries - in fact, an audience may struggle with the topic so much that getting across high level information is all that one can hope for. In a scenario like this, one should use any means necessary (within reason) to get a point across. I'm still not convinced this is requirement enough for a pie chart but it does raise a valid point.

In my mind there is something more driving than the aptitude of an audience, however, and that is the emotional reaction they can have to your graphics. For better or worse people are emotionally attached to pie charts. Many individuals have a visceral reaction when they see one knowing they can quickly make sense of the data in front of them. Forget about accuracy - we are talking basic understanding. For me, this is harder to ignore; it opens the door to using something like a pie chart to avoid alienating your audience.

The part about this that is hard for me is that I rant about visual display; probably too much for my contribution to alternatives. I'm also critical about style - often to the point of exhaustion. I just can't seem to relinquish my position that pie charts really are a failure but the points above are nagging at me: how do you captivate audience that expects the general view without sacrificing the details? I stumbled upon an idea recently that I hope can help bridge the gap.

I was reading Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design the other day which led me to Cleveland and McGill's original study. One test that really stood out to me was the position-angle test where subjects were exposed to a Cartesian bar graph and a pie chart each containing the same data. The subjects were tasked with estimating values of the shapes. In 40 tests, only three times was the pie chart more accurate (on average) than the bar chart.

The original study also mentions that pie charts are "one of the most commonly used graphs for showing the relative sizes of a whole." Certainly, a pie chart intuitively exposes that the whole is the sum of it's parts. In fact, I think it does so better than some of the alternatives - stacked bar charts and treemaps. It is unfortunate that we are unable to accurately decipher the actual portions of those parts. What is really needed is the ability to combine the concept of 'sum of the parts' with direct representation of data but, to the best of my knowledge, this does not exist in standalone form.

Well, I've been exploring processing more and more lately and the idea struck me to combine the two chart types in a way that allowed both versions of the data to be presented (without having to share the display real estate). I came up with an interactive pie/bar chart fusion. On the surface it appears as a standard pie chart:
But when the user clicks any of the sections, it transitions into a bar chart with details of the data while keeping a shade of the relevant pie slice in the background.
Now, I eluded to the fact that this not a complete solution; it only helps to bridge the gap. Unfortunately, this graphic relies on user interaction (mouse clicks) for the transition which pretty much excludes it for most presentations. However, as PDF now supports Javascript, online resources are becoming prevalent and users can download these open source tools on their own the availability for melding these approaches becomes tangible.

I still don't condone the use of pie charts. However, instead of just describing the problems associated with them I'm finally trying to present a solution.

You can find the code for this on github.
Actual interactive visualization can be found here.


Tuesday, August 23, 2011

The Potential of Fail

I saw a link come across the front page of Hacker News today that blew my mind - not in a good way. In general, the content that appears on that page is pertinent and informative but the information in this link is just plain horse manure.

This post is basically a rant about everything I don't like about that infographic - leave now if your are otherwise aligned. You have been warned.

[Update: It seems that the graphic was updated to change petrabyte to petabye throughout. The remainder of the following artifacts, however, seem as they were originally presented]

Lets start with the term 'petrabyte'. You might think that there was a typo someplace on the page ('r' being next to 't' on the keyboard) and could understand seeing it once instead of petabyte. No, this infosludge is selling it as a measuring stick throughout. Unforgiven.

Now, lets start looking at the data contained in the display itself. One of the first data comparisons is the projected growth rainbows. Outside of what is being said in the text the shapes and values are lying. The 5% value is represented by 5 lines (1% per line) while the 40% value is represented with 12 lines (3.33% per line). If instead of the count of lines you consider the visual space of the arc you get an area of 39.27 (using radius 5). That should mean, if the aspect ratio is equal, that the area of the larger semi-circle should be 314.16 (8 times the smaller). Instead it is 226.19.

To the left of that there is a bubble containing the value of 235 terabytes. This represents the total data collected by the Library of Congress in the month of April in 2011. If this is an important benchmark or standard we should certainly be told at some point; along with how it relates to the other information it leads to. Instead, that value directs us to the Data Sectors. The problem is that values listed in the Data Sectors section are for yearly aggregates over entire sectors for 2009. Further problematic is the fact that the areas of the circles in the Data Sectors section do not correspond to the numbers listed under them. The ratio of the largest printed value and the smallest printed value is 18.94 while the ratio of the largest bubble to the smallest bubble is 25.0 (80px and 16px, respectively).

Moving through the chart to the next bubble lands us on a value of 3.8 [units elided]. What is the significance of this number - or the Securities and Investments sector it represents? What else in the chart references this value? Nothing that I can find; it's not even part of the five selected sections that follow it.

Then, again with the rainbow. This time 12 distinct rings is sufficient to represent two different values. Certainly there was effort in resizing those 12 rings for the smaller value - yet no one thought to use the correct scale? Boggles the mind.

Moving along in the Health sector we see that R&D is really important. It could reportedly capture $108 billion. That amount is $57 billion less than the Clinical area but R&D still gets a bigger bubble. I'll admit I'm compelled to agree with this choice - I totally dig R&D.

Personally, my favorite part of the entire chart is in the Retail sector. The caption is priceless: "The potential increase ... could be 60%." If I offered you a job and my pitch was "I might potentially pay you $60K", would you take it? Oh, you're probably likely to receive benefits, too.

Yay for consistency. As we are moving into Government not only have we temporarily switched to euros but we're also provided two different symbols to represent that change. Nothing like keeping your readers engaged by constantly changing the rules.

The last bubble I'll discuss is the "1 Petrabyte" centered near the bottom. How about one colossal waste of our time. Considering the discrepancies in the chart itself I'd be hesitant to trust the values as provided. The existence of delinquency like this is not altogether surprising; it's the fact that it is so popular that really appalls me.



NOTE: My discontent certainly does not represent my opinion of Hacker News and it's community. I simply find it unfortunate that so many are lead astray by these garish displays.

Friday, August 12, 2011

Quick Shine


I've posted before that I like puzzles. I came across another challenge that recently appeared on the hacker news feed - here is the actual post.

The premise is that there is a slow floating point operation that is taking place that can be sped up by using integer-only operations without causing any change in the output of the equation. For reference, here is the function that is reported to be slow:

unsigned char SlowBrightness(unsigned char r,
                             unsigned char g,
                             unsigned char b)
{
  return 0.289f * r +
         0.587f * g +
         0.114f * b + 1e-5f;
}


My approach to this was to try and estimate the floating point values by composing large enough integer fractions. The goal here is to find a single denominator (optimally a power of 2) that is capable of hosting three numerators that will result in the values listed. The reason to aim for a power of two in the denominator is that a dividing by powers of two can be efficiently implemented as a right bit shift.

At 219, the following values result as numerators: 151519, 307757 and 59769 representing 0.289, 0.587 and 0.114 respectively. To check what type of error was related to these values I also evaluated the division and compared to the original values.

0.28900000
0.28899955 (151519/524288)

0.58700001
0.58699989 (307757/524288)

0.11400000
0.11400032 (59769/524288)

There was error, but since there was conveniently a constant in the original equation I could mitigate by adjusting the constant if necessary.

Once I had the values, I wrote the following test harness:

#include <stdio.h>

#define R(r) (151519UL*(r))
#define G(g) (307757UL*(g))
#define B(b) (59769UL*(b))
#define C    (5)

typedef unsigned char uchar;

uchar SlowBrightness(uchar r, uchar g, uchar b) {
    return 0.289f * r + 0.587f * g + 0.114f * b + 1e-5f;
}

uchar FastBrightness(uchar r, uchar g, uchar b) {
    return (R(r) + G(g) + B(b) + C) >> 19;
}

int main () {
    uchar r = 0, g = 0, b = 0;
    for (; r < 255; ++r)
        for (g = 0; g < 255; ++g)
            for (b = 0; b < 255; ++b)
                if (SlowBrightness(r,g,b) != FastBrightness(r,g,b))
                    printf("FAIL: %d,%d,%d\n", r,g,b);
    return 0;
}

The constant value of C came from 1e-5f represented as 5 / 219. As expected, I had to increase the value of C to compensate for the difference in my estimated numbers. After the adjustment (the minimum constant ended up being 72) I was able to run and match all values of the two calls. I also achieved the bonus for minimal constants and only using +, * and >> operations.

Thursday, August 4, 2011

Hotness

We have an internal image that floated around work several years ago that details network utilization of TCP over a wide variety of configurations. It is a heatmap created in matlab that is just sweet, sweet eye candy. We actually hung it on the outside of a cube for a short while and people couldn't help but stop and look at it.

It is entirely dysfunctional, mind you. The designer tried to combine eight parameters - with all variations - into a individual 2D plot (3D if you consider color a dimension). It was definitely an internal tool - there were only two or three of us who could decipher the layout enough to say anything about the data. That was fine by us; we basically made up the entire population of people who cared.

Fast forward a few years. I'm currently working on a technical report that could use the data we used to create that plot and, as luck would have it, I'm also the only one from the original group still at the company. In order to be able to include this data in the report there needs to occur a certain amount of reformatting - first my brain, then that plot. I wasn't the original designer and, although I have access to the code, I don't know matlab so I'm pretty much stuck. I decided to rework the data in R.

The thing about that original plot was that it had a certain je ne sais quoi: it made you look. I wanted to keep that so I immediately investigated heatmap functionality available in R.

Really? Ouch. Not much available there. I came up with two resources that were helpful: A Wikipedia entry about a Mandelbrot set animation in R; A stackoverflow answer that mentioned rasterImage in a comment. The first site lead me to the color set used in our original plot and the second gave me the pointer I needed to get the job done. I'll leave what follows as a reminder for myself and a helpful nudge for those who face a similar problem in the future.

hmap.example <- function () {

    code <- c("colfun <- colorRampPalette(c(...))",
        "my.colors <- colfun(10000)","xs <- 1:100",
        "X  <- outer(xs,rev(xs))",
        "C1 <- matrix(my.colors[X],100,100)", "X  <- outer(xs,xs)",
        "C2 <- matrix(my.colors[X],100,100)", "X  <- outer(rev(xs),xs)",
        "C3 <- matrix(my.colors[X],100,100)",
        "plot(c(-100,100),c(-100,100),type='n')",
        "rasterImage(C1,1,1,100,100)",
        "rasterImage(C2,-100,1,1,100)", "rasterImage(C3,-100,-100,1,1)",
        "abline(v=0,col='black',lwd=5)", "abline(h=0,col='black',lwd=5)")

    colfun <- colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan",
                    "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))

    my.colors <- colfun(10000)
    xs <- 1:100
    X  <- outer(xs,rev(xs))
    C1 <- matrix(my.colors[X],100,100)
    X  <- outer(xs,xs)
    C2 <- matrix(my.colors[X],100,100)
    X  <- outer(rev(xs),xs)
    C3 <- matrix(my.colors[X],100,100)
    plot(c(-100,100),c(-100,100),type='n',axes=FALSE,xlab='',ylab='')
    rasterImage(C1,1,1,100,100)
    rasterImage(C2,-100,1,1,100)
    rasterImage(C3,-100,-100,1,1)
    abline(v=0,col='black',lwd=5)
    abline(h=0,col='black',lwd=5)
    text(1,1:length(code)*-6,labels=code,cex=0.8,pos=4,family="mono")
}

And the result:

Tuesday, July 26, 2011

typedef[ine]

The typedef keyword introduces a new type. The #define directive provides a textual substitution mechanism. When does this matter? Consider the following source:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#include "str_type.h"

size_t length (CharPtr str) { return strlen(str); }

void message () {
    CharPtr str = malloc (255);
    if (! str) { return; }
    strcpy (str, "The Message");
    printf("%s\n", str);
    free (str);
}

int main () {

    CharPtr str = "Some Static String";
    CharPtr str1, str2, str3;

    printf ("Length: %d\n", length (str));
    message ();
    printf ("sizeof(str1): %d\n", sizeof (str1));
    printf ("sizeof(str2): %d\n", sizeof (str2));

    return 0;
}

If str_type.h contains

typedef char * CharPtr;

Then the output of running that program is

Length: 18
The Message
sizeof(str1): 8
sizeof(str2): 8

However, if that file contains

#define CharPtr char *

You instead get

Length: 18
The Message
sizeof(str1): 8
sizeof(str2): 1

Notice that since the CharPtr is no longer a type you get

char * str1, str2, str3;

after the processing phase of translation and the asterisk only binds to the first variable leaving the remaining two as plain char variables.

Use #define directives only to provide textual substitution prior to translation; it does not interact with - or compliment - the type system.

Friday, July 15, 2011

Double Standard

In almost all instances it is a bad idea to place multiple distinct scales on a single plot. Not only does it require more of the reader in terms of deciphering the message, it also lessens the impact of the chart itself. Instead of a reader being able to associate the vertical data with an immediate estimated value in their mind they now have a two step process to determine a value. First, they must choose which scale applies - then they must consult the appropriate scale to derive a value. This is all a bad thing.

I also feel this is a common way to try and hide information in plain sight. Consider the following graph, what points define where the two lines intersect? (Here's a hint - they don't) This layout lends itself to all sorts of slight of hand when talking to or describing data.



Usually, I would claim that any graph that did this could be better provided as either a multiplot or two separate graphs. However, the other day I saw an instance where it made perfect sense: temperature. Plotting temperature with two separate scales is actually a representation of two separate mathematical functions that represent the same property; in effect the graph acts as a lookup table for the user for translating from one function to another. As an example, consider the following:



A glorified lookup table to be sure - but entirely functional.

Moving forward, I've modified my view to something more along the lines of: "Dual scales are useful when they both describe a single property in two distinct ways." Any measurement that can be represented in a variety of ways falls into this category: temperature (Fahrenheit v. Celsius), time (24-hour v. 12-hour), distance (miles v. kilometers), and so on.

Tuesday, July 5, 2011

Progress

[EDIT] This code is now available on github: libpbar

cURL ships with one. wget uses it's own. OpenSSH (scp) even comes with it's own version. Indeed, I've written several variations myself over the years. Sooner or later, if you are doing something that processes large amounts of data or takes a long time to complete and you are working in a command-line environment you will write your own, too. A progress bar.

I did a quick search for existing progress bar libraries but came up with little outside of Ruby and Python wrappers around what the packages I mention above are already doing [1]. For mostly selfish reasons, I've decided to write a library that fixes my redundant behavior moving forward. Once I am able to provide a proper build system I will put this code up on github - my unfamiliarity with autotools may delay this a bit.

What this library aims to provide is a consistent interface for programatically displaying a progress bar outside of a GUI environment. It does not try to provide a package similar to Pipe Viewer (pv) which is a complete, standalone program for monitoring processes activity. Instead, to keep things simple, libpbar will initially provide a class with the following public interface:

class ProgressBar {
public:
    ProgressBar (::size_t cap);
    void update (::size_t n = 1);
};

With three implementations provided with the library: TextProgressBar, ColorProgressBar, GraphicalProgressBar. These provide basic text functionality, color text, and a GTK+-based object respectively. Using any of these facilities is as simple as:

int flags = TextProgressBar::SHOW_VALUE | TextProgressBar::SHOW_COUNT;
TextProgressBar pb(100,flags);

They each do as you might expect:
TextProgressBar:

ColorProgressBar:

GraphicalProgressBar:

The nice thing about the GUI version is that the hosting program doesn't need to be a GUI application. For this version of the progress bar a separate thread is used to spawn the graphical interface and as much as possible is done to shed the window treatments to expose just the progress bar itself.

I've got to expand on this by allowing some way to encode message text with format strings and allowing user-defined color modifications. The basics are there, however, and I should now be free from writing yet another progress bar implementation.


[1] This is a solved problem and certainly my fruitless search does not imply any lack of existing tools

Thursday, May 26, 2011

Optical Disillusion

I dislike chartjunk. Not only is there a trend toward the incomprehensible but the movement comes with a ridiculous amount of flair. For all I can tell there exists a competition between infographic creators where the rules are based solely on who can cram more slop on a page.

Besides my obvious distaste for the style, there are sacrifices being made that compromise, or even forgo, the actual message - often without awareness or malice. Take, for example, the evolution of the pie chart.

For the sake of this example lets ignore the fact that a pie chart is a particularly poor way to display data to begin with. As a measure of two variables it provides a rough estimate of dominance but beyond that the human eye can not distinguish relative quantities across the various shapes. In almost all cases a simple bar chart provides a more precise description - even in the two variable case. A visual display of data should be able to stand on its own without the need of labels describing quantities or values. The pie chart fails in this respect, but I digress.

Consider a simple pie chart of two variables.
  (As a measure of the strength of a pie chart as a communication tool, can you guess the values of the two areas? Go ahead, take a guess.)

The red portion of the chart is 55% and the blue portion is the remaining 45%. Without labels it is hard to distinguish exactly but serves to at least show the dominance of red over blue. The problem with trendy infographincs is that a simple pie chart is almost never sufficient in the layout. It needs exploding, or gradients, or even a third dimension.

Lets dress it up a bit and make a 3D pie chart with the same values.
So what's my beef about that? Lets consider the new representative areas of the chart. In the first chart, the values were inconspicuous but at least the color representation mapped directly to the underlying data.

Standard Pie Chart (red pixels) : 44295 (55.000%)
Standard Pie Chart (blue pixels): 36188 (44.900%)


In this new 'cooler' version of the chart we have skewed the data representation and thus our understanding of the overall message. In fact, by visible surface area alone we have changed the meaning of the chart entirely!

3D Pie Chart (red pixels) : 44792 (47.300%)
3D Pie Chart (blue pixels): 49740 (52.600%)


What is now required of us in this new chart, along with somehow mapping area to value, is to do accurate mathematical transformations in our heads to convert the 3D surface to an area in 2D. In fact, we need to now be able to deduce that roughly 52% of viewable surface area translates to 45% underlying data. The skew depends on the pitch, yaw, and roll so there is no magical formula here - every view will be a different mapping between surfaces.

I don't think people consider these details when compiling charts. In my estimate they are only trying to provide the most 'eye candy' for the intended consumer. The behavior is facilitated by common built-in chart generators (only 48 out of Excel's 288 pie chart variations are simple 2D charts) but there is no warning about the possible loss of meaning.

I'm certainly not among those pushing the envelope with infographics - this definitely makes my opinion biased. I keep things as simple as possible and for most data hungry crowds my approach is just too boring against current standards. I do believe there is a middle-ground, however; a place where rich graphics convey accurate data with minimal annotation markup. I only wish I knew how to bridge the gap.

A huge thanks to Dana Brown for taking the time to review and provide feedback on the first draft of this post.

Thursday, May 19, 2011

Using structs

If you want to group related information in a C program the common way to do so is with a struct. You see this everywhere: various network packets are represented as structs; file system objects (inodes); time values can be stored with separate components in struct members; and so on. In all of these cases it is entirely possible to use basic arrays to maintain this data. The problem with that is that we, as non-machines, find it more natural to think in terms of collections with named members than random offsets into a buffer. In other words, it is easier to design and reason with 'object.member = value' than it is with 'buffer[offset] = value' (or something more obscure). Especially if you deal with members of various sizes.

I feel this is a natural progression - the tendency to want to group related items and operate with them as individual 'things'. I believe this to be a dominating theme with C programmers (and programmers in general). What I dont see too much of, however, is explicitly moving in the opposite direction. That is, given some data in the form of a buffer, C programmers are more likely to develop their own mini-parser to get at that data instead of using a struct to ease implementation.

As an example, I've seen the following many times in a variety of flavors:

uint32_t extract(unsigned char *buffer) {
    uint32_t value = 0;
    int i = 0;
    for (i=sizeof(uint32_t)-1; i>=0; i--) {
        value = value << 8;
        value += buffer[i];
    }
    return value;
}


And, while that is functional it is also error-prone and cumbersome to write each time you need to do such a conversion. In contrast, I see very little in the form of

struct extracted {
    uint32_t v[5];
};

struct extracted * map = buffer;


Where I think the implementation is simplified

If we remove most of the boilerplate sections of an example and examine just how each of these is available to use we can see what the overall effect is.

uint32_t vals[] = {0x0, 0x0f, 0x0f0f, 0x0f0f0f, 0x0f0f0f0f};
unsigned char * cvals = vals;
for (; i < 5; ++i)
    printf ("%10lu\n", extract (cvals + i * sizeof(uint32_t)));

and with the struct

uint32_t vals[] = {0x0, 0x0f, 0x0f0f, 0x0f0f0f, 0x0f0f0f0f};
unsigned char * cvals = vals;
struct extracted * map = cvals;
for (; i < 5; ++i)
    printf ("%10lu\n", map->v[i]);

The main differences between the two are how data is extracted from the raw buffer and how that implementation affects code design. In both cases the struct provides a cleaner and more understandable solution. In a more generic setting, one where you may not have the exact number of elements in the buffer, the struct approach above doesn't fit exactly.

However, it can be modified:

struct value {
    uint32_t data;
};

uint32_t extract(unsigned char *buffer) {
    struct value * v = buffer;
    return v->data;
}

Where the using the function still requires the buffer offset when calling the function but the method implementation is much cleaner.

This becomes even more useful if you consider cases where an individual member may contain multiple bits of information. For instance, it is common to have a data member of a struct represent a set of flags. The typical implementation involves a set of macros to test or set values in a bit mask.

For example:

#define FOO_FLAG    (1<<0)
#define BAR_FLAG    (1<<1)
#define BAZ_FLAG    (1<<2)

#define ISSETFOO(v) ((v).mask & (FOO_FLAG))
#define SETFOO(v)   ((v).mask |= (FOO_FLAG))
#define UNSETFOO(v) ((v).mask &= ~(FOO_FLAG))

/* similarly for BAR_FLAG and BAZ_FLAG */

struct data {
    unsigned short mask;
    unsigned char stuff[200];
};

int main () {
    struct data data;
    SETFOO(data);
    if (ISSETFOO(data)) {
        printf ("FOO_FLAG is set\n");
    } else {
        printf ("Foo_FLAG is not set\n");
    }
    /* ... */

With well designed macro names this approach does not imply altogether clumsy code but the macro design is still cumbersome. I think that a more elegant approach can be achieved through the use of structs.

struct flags {
    unsigned char foo:1;
    unsigned char bar:1;
    unsigned char baz:1;
};

struct data {
    struct flags mask;
    unsigned char stuff[200];
};

int main () {
    struct data data;
    data.mask.foo = 1;
    if (data.mask.foo) {
        printf ("FOO_FLAG is set\n");
    } else {
        printf ("Foo_FLAG is not set\n");
    }
    /* ... */

This can even be done without having to change implementation code. Leaving the macro interface while changing the bodies to represent the new implementation allows users of the macro interface to continue uninterrupted.

The struct is no panacea. However, I find that in these types of scenarios the struct provides for much cleaner and manageable code - something I favor whenever I can.

Saturday, April 23, 2011

y = mx + b

I had recently compiled a set of plots for a technical report I put together to summarize an evaluation of some of our software. The focus of the tests was performance of multiple file transfer approaches. One of the comments I got back after pushing the report out for review was that I should include the slope as an annotation for a best fit line I had included. I thought this was a curious request at first since the best fit was already intended to give a visual estimate of the behavior. After I thought about it, however, I considered the case when someone would want to compare results to ours without having any of our raw data. A visual estimate would certainly not do in that case - especially if the range of their data was outside of the viewable area of the plot.

This was a simple change. R, the software I was using to generate the plots, has built in functionality for linear regression. Getting the slope and y-intercept is as easy as:

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
   8.474523     0.007313

This got me thinking, however. Would I know how to generate the linear regression without software? The answer was 'no', so I learned how. Here are the steps I took to generate the same results as the statistical software I was using.

I was dealing with a set of data that was generated using inputs over the powers of two: [1, 2, 4, 8, ..., 2048, 4096]. I also had multiple samples at each input. The first step was to determine how, in fact, a linear regression is calculated.

Linear regression (used to make a best fit line) can be calculated using the least squares approach. I needed to have the following: mean and standard deviation of both the inputs and outputs and the Pearson product-moment correlation coefficient. Once I had all of those calculations I could generate the slope and y-intercept of the best fit line and thus any point along the line.

In order to calculate the PMCC there is some upfront legwork that I needed to do. The equation requires the standard score and sample standard deviation from the two sets and calculating the y-intercept requires the sample means. The Pearson product-moment correlation coefficient for a given sample is defined here (Wikipedia).

I put together some Ruby code to generate the necessary values:

#!/usr/bin/env ruby

class DataSet
    attr_reader :mean, :var, :sd, :size
    def initialize(ary)
        @size = ary.size
        @data = ary.dup
        @mean = ary.inject(0.0) { |sum,n| sum + n } / @size
        @var = ary.inject(0.0) { |sum,n| (n-@mean)**2 + sum }
        @sd = Math.sqrt(@var/(@size - 1))
    end

    def [](i)
        @data[i]
    end
end

# http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
def pearson(x,y)
    g = []
    x.size.times do |i|
        g << ((x[i] - x.mean)/x.sd)*((y[i] - y.mean)/y.sd)
    end
    (1.0/(x.size - 1)) * g.inject(0.0) { |sum,n| sum + n }
end

#http://www.stat.wmich.edu/s216/book/node126.html
def slope(x,y)
    pearson(x,y) * (x.sd/y.sd)
end

def intercept(x,y)
    x.mean - slope(x, y)*y.mean
end

def load_data(file)
    xs, ys = [], []
    if file
        File.open(file).each do |line|
            x,y,_ = line.chomp.split(',')
            xs << x.to_f
            ys << y.to_f
        end
    else
        while line = gets
            x,y,_ = line.chomp.split(',')
            xs << x.to_f
            ys << y.to_f
        end
    end
    [DataSet.new(xs), DataSet.new(ys)]
end

x, y = load_data(ARGV.shift)

puts "Slope      : %0.6f" % [slope y, x]
puts "Y-intercept: %0.6f" % [intercept y, x]

Running that with my control data set yields the following:

Slope      : 0.007313
Y-intercept: 8.474523

With the results from R above as a sanity check these values look accurate. Now, using these calculations, I could plot a best fit simply by drawing a line from (1,y) to (4096,y) where y is calculated using the equation: y = mx + b.


So, while I'm still an advocate of using the right tool for the job I can now understand the mechanics behind these calculations and talk specifically about why they should (or should not) be used.



N.B. I am no expert on any of this. There are topics at play here I do not fully understand (details of the Pearson product-moment correlation coefficient, for one). This is more a narrative than any official guide to understanding.

Wednesday, April 13, 2011

Dynamic Code Loading (in C)

Originally, I intended to write this post about function pointers in C. I was inspired (yet again) by activity in a forum where a user asked a question about why you would ever need to use function pointers. As I started thinking about mapping the idea to a useful example I kept considering how the kernel uses function pointers for module writers. Couple that with my new exploration into the Erlang programming language and you get this content instead.

To address the first point directly: function pointers are useful as a means to expose a standard set of functionality. Consider how a character device driver for the linux kernel operates, for instance. When you write a character device driver you are presented with a struct file_operations (fops).

struct file_operations {
   struct module *owner;
   loff_t (*llseek) (struct file *, loff_t, int);
   ssize_t (*read) (struct file *, char *, size_t, loff_t *);
   ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
   int (*readdir) (struct file *, void *, filldir_t);
   /* ... lots of other calls ... */
};

Why? Consider what you do with a character device; read from it, write to it, and so on. But since each device requires it's own implementation there needs to be a way to indicate this within the module code.

One solution to this is to have a set of skeleton functions declared in a header file. Require that all modules include this file and provide definitions for the functions. There are two obvious drawbacks to this approach:
  • Module writers are forced to fill in an empty body for all functions that will not be used
  • Any changes to the interface require reimplementation of the entire set of modules using the now outdated file
A consequence of the second problem is that any interface must be absolute before release and must not support subsequent updates.

Another solution is to use a structure that contains function pointers and allow developers to use that when writing to the interface. This solves the two issues with the skeleton declaration approach: any function not defined is set to nil in the structure and adding new members to the structure wont break existing code. Ok... if there is some fancy offset calculation being done there is potential for breakage, but that is suspect behavior in the first place.

Obviously removing entries creates compatibility issues in either solution.

Here is how the function pointer approach works in the linux kernel:

/* linux-2.6.37/drivers/i2c/i2c-dev.c:513 */

static const struct file_operations i2cdev_fops = {
    .owner      = THIS_MODULE,
    .llseek     = no_llseek,
    .read       = i2cdev_read,
    .write      = i2cdev_write,
    .unlocked_ioctl = i2cdev_ioctl,
    .open       = i2cdev_open,
    .release    = i2cdev_release,
};

Where each of the i2cdev_* functions are defined locally to that file and any operations not defined in the struct are nil (aio_fsynch, for example).

That actually represents dynamic code loading in it's entirety. You can change the behavior of a system by loading modules at runtime. But that is in the kernel and not applicable in userland; what if we wanted to do that for a process we control? That is something that Erlang lets you do and - because I'm bent on reinventing the wheel for experience - I wanted to do it in C as well. Note: I'm sure that Erlang is not the only language that supports this; it just happens to be what I was reading recently.

Lets start with the interface that is exposed:

#ifndef JRB_DYNOPS__H__
#define JRB_DYNOPS__H__

/*
 * Operations supported by our dynamic runtime:
 *  init: Initialize the new code
 *  speak: process strings in some way
 *  destroy: Cleanup code
 */
struct dynops {
    void (*init)();
    void (*speak)(const char *);
    void (*destroy)();
};

/*
 * Fill in the opeartions for a particular implementation
 */
void get_ops (struct dynops *);

#endif /*JRB_DYNOPS__H__*/

And two clients writing to that interface:

Client 1

#include <stdio.h>
#include <string.h>
#include "dynops.h"

static void lib1_speak (const char * text) {
    fprintf (stderr, "[LIB1] %s\n", text);
}

/* only implements speak. Ignores init & destroy */
static struct dynops my_ops = {
    .speak = lib1_speak,
};

void get_ops (struct dynops ** ops) {
    memcpy(*ops,&my_ops,sizeof(my_ops));
}

Client 2

#include <stdio.h>
#include <string.h>
#include "dynops.h"

static void lib2_init () {
    fprintf (stderr, "Initializing lib2\n");
}
static void lib2_destroy () {
    fprintf (stderr, "Cleanup lib2\n");
}
static void lib2_speak (const char * text) {
    fprintf (stderr, "[LIB2] %s\n", text);
}

static struct dynops my_ops =  {
    .init = lib2_init,
    .speak = lib2_speak,
    .destroy = lib2_destroy,
};

void get_ops (struct dynops ** ops) {
    memcpy(*ops,&my_ops,sizeof(my_ops));
}

This driver is relatively crude. Notification of code change come by way of a signal and a well-known path indicates where the code resides. In addition, the mechanism for copying the structure of function pointers should reside in the driver itself and not in the clients. I think it is sufficient, however, to convey the idea.

int main () {

    unsigned long count = 0;
    char iteration[255] = {0};
    signal (SIGUSR1, trigger_load);

    while (1) {
        snprintf (iteration, 255, "Iteration: %lu\n", ++count);
        do_ops (iteration);
        usleep (500000);
    }

    return 0;
}

And some implementation

void trigger_load (int sig) {
    if (sig != SIGUSR1)
        return;
    FILE * new_lib = fopen("/tmp/dynlib", "r");
    if (! new_lib)
        return;
    fscanf (new_lib, "%s", lib_name);
    fclose (new_lib);
    the_lib = lib_name;
    reload = 1;
}

/* ... */

void do_ops (const char * str) {
    if (reload) {
        load_lib (the_lib);
        reload = 0;
    }
    if (ops.speak)
        ops.speak(str);
}

With that setup we can now control the behavior of that driver at runtime. The example here runs the driver in it's own terminal and in another terminal commands are executed to load new code. The table below shows the execution in sequence - dynamic code execution in C.

Terminal 1
$ ./a.out




'libops1.so' successfully loaded
[LIB1] Iteration: 40
...
[LIB1] Iteration: 52


'libops2.so' successfully loaded
Initializing lib2
[LIB2] Iteration: 53
...
[LIB2] Iteration: 62


Cleanup lib2
'libops1.so' successfully loaded
[LIB1] Iteration: 63
...
Terminal 2

$ pgrep a.out
5993
$ echo "libops1.so" > /tmp/dynlib
$ kill -10 5993




$ echo "libops2.so" > /tmp/dynlib
$ kill -10 5993





$ echo "libops1.so" > /tmp/dynlib
$ kill -10 5993