Thursday, May 26, 2011

Optical Disillusion

I dislike chartjunk. Not only is there a trend toward the incomprehensible but the movement comes with a ridiculous amount of flair. For all I can tell there exists a competition between infographic creators where the rules are based solely on who can cram more slop on a page.

Besides my obvious distaste for the style, there are sacrifices being made that compromise, or even forgo, the actual message - often without awareness or malice. Take, for example, the evolution of the pie chart.

For the sake of this example lets ignore the fact that a pie chart is a particularly poor way to display data to begin with. As a measure of two variables it provides a rough estimate of dominance but beyond that the human eye can not distinguish relative quantities across the various shapes. In almost all cases a simple bar chart provides a more precise description - even in the two variable case. A visual display of data should be able to stand on its own without the need of labels describing quantities or values. The pie chart fails in this respect, but I digress.

Consider a simple pie chart of two variables.
  (As a measure of the strength of a pie chart as a communication tool, can you guess the values of the two areas? Go ahead, take a guess.)

The red portion of the chart is 55% and the blue portion is the remaining 45%. Without labels it is hard to distinguish exactly but serves to at least show the dominance of red over blue. The problem with trendy infographincs is that a simple pie chart is almost never sufficient in the layout. It needs exploding, or gradients, or even a third dimension.

Lets dress it up a bit and make a 3D pie chart with the same values.
So what's my beef about that? Lets consider the new representative areas of the chart. In the first chart, the values were inconspicuous but at least the color representation mapped directly to the underlying data.

Standard Pie Chart (red pixels) : 44295 (55.000%)
Standard Pie Chart (blue pixels): 36188 (44.900%)


In this new 'cooler' version of the chart we have skewed the data representation and thus our understanding of the overall message. In fact, by visible surface area alone we have changed the meaning of the chart entirely!

3D Pie Chart (red pixels) : 44792 (47.300%)
3D Pie Chart (blue pixels): 49740 (52.600%)


What is now required of us in this new chart, along with somehow mapping area to value, is to do accurate mathematical transformations in our heads to convert the 3D surface to an area in 2D. In fact, we need to now be able to deduce that roughly 52% of viewable surface area translates to 45% underlying data. The skew depends on the pitch, yaw, and roll so there is no magical formula here - every view will be a different mapping between surfaces.

I don't think people consider these details when compiling charts. In my estimate they are only trying to provide the most 'eye candy' for the intended consumer. The behavior is facilitated by common built-in chart generators (only 48 out of Excel's 288 pie chart variations are simple 2D charts) but there is no warning about the possible loss of meaning.

I'm certainly not among those pushing the envelope with infographics - this definitely makes my opinion biased. I keep things as simple as possible and for most data hungry crowds my approach is just too boring against current standards. I do believe there is a middle-ground, however; a place where rich graphics convey accurate data with minimal annotation markup. I only wish I knew how to bridge the gap.

A huge thanks to Dana Brown for taking the time to review and provide feedback on the first draft of this post.

Thursday, May 19, 2011

Using structs

If you want to group related information in a C program the common way to do so is with a struct. You see this everywhere: various network packets are represented as structs; file system objects (inodes); time values can be stored with separate components in struct members; and so on. In all of these cases it is entirely possible to use basic arrays to maintain this data. The problem with that is that we, as non-machines, find it more natural to think in terms of collections with named members than random offsets into a buffer. In other words, it is easier to design and reason with 'object.member = value' than it is with 'buffer[offset] = value' (or something more obscure). Especially if you deal with members of various sizes.

I feel this is a natural progression - the tendency to want to group related items and operate with them as individual 'things'. I believe this to be a dominating theme with C programmers (and programmers in general). What I dont see too much of, however, is explicitly moving in the opposite direction. That is, given some data in the form of a buffer, C programmers are more likely to develop their own mini-parser to get at that data instead of using a struct to ease implementation.

As an example, I've seen the following many times in a variety of flavors:

uint32_t extract(unsigned char *buffer) {
    uint32_t value = 0;
    int i = 0;
    for (i=sizeof(uint32_t)-1; i>=0; i--) {
        value = value << 8;
        value += buffer[i];
    }
    return value;
}


And, while that is functional it is also error-prone and cumbersome to write each time you need to do such a conversion. In contrast, I see very little in the form of

struct extracted {
    uint32_t v[5];
};

struct extracted * map = buffer;


Where I think the implementation is simplified

If we remove most of the boilerplate sections of an example and examine just how each of these is available to use we can see what the overall effect is.

uint32_t vals[] = {0x0, 0x0f, 0x0f0f, 0x0f0f0f, 0x0f0f0f0f};
unsigned char * cvals = vals;
for (; i < 5; ++i)
    printf ("%10lu\n", extract (cvals + i * sizeof(uint32_t)));

and with the struct

uint32_t vals[] = {0x0, 0x0f, 0x0f0f, 0x0f0f0f, 0x0f0f0f0f};
unsigned char * cvals = vals;
struct extracted * map = cvals;
for (; i < 5; ++i)
    printf ("%10lu\n", map->v[i]);

The main differences between the two are how data is extracted from the raw buffer and how that implementation affects code design. In both cases the struct provides a cleaner and more understandable solution. In a more generic setting, one where you may not have the exact number of elements in the buffer, the struct approach above doesn't fit exactly.

However, it can be modified:

struct value {
    uint32_t data;
};

uint32_t extract(unsigned char *buffer) {
    struct value * v = buffer;
    return v->data;
}

Where the using the function still requires the buffer offset when calling the function but the method implementation is much cleaner.

This becomes even more useful if you consider cases where an individual member may contain multiple bits of information. For instance, it is common to have a data member of a struct represent a set of flags. The typical implementation involves a set of macros to test or set values in a bit mask.

For example:

#define FOO_FLAG    (1<<0)
#define BAR_FLAG    (1<<1)
#define BAZ_FLAG    (1<<2)

#define ISSETFOO(v) ((v).mask & (FOO_FLAG))
#define SETFOO(v)   ((v).mask |= (FOO_FLAG))
#define UNSETFOO(v) ((v).mask &= ~(FOO_FLAG))

/* similarly for BAR_FLAG and BAZ_FLAG */

struct data {
    unsigned short mask;
    unsigned char stuff[200];
};

int main () {
    struct data data;
    SETFOO(data);
    if (ISSETFOO(data)) {
        printf ("FOO_FLAG is set\n");
    } else {
        printf ("Foo_FLAG is not set\n");
    }
    /* ... */

With well designed macro names this approach does not imply altogether clumsy code but the macro design is still cumbersome. I think that a more elegant approach can be achieved through the use of structs.

struct flags {
    unsigned char foo:1;
    unsigned char bar:1;
    unsigned char baz:1;
};

struct data {
    struct flags mask;
    unsigned char stuff[200];
};

int main () {
    struct data data;
    data.mask.foo = 1;
    if (data.mask.foo) {
        printf ("FOO_FLAG is set\n");
    } else {
        printf ("Foo_FLAG is not set\n");
    }
    /* ... */

This can even be done without having to change implementation code. Leaving the macro interface while changing the bodies to represent the new implementation allows users of the macro interface to continue uninterrupted.

The struct is no panacea. However, I find that in these types of scenarios the struct provides for much cleaner and manageable code - something I favor whenever I can.