Tuesday, August 23, 2011

The Potential of Fail

I saw a link come across the front page of Hacker News today that blew my mind - not in a good way. In general, the content that appears on that page is pertinent and informative but the information in this link is just plain horse manure.

This post is basically a rant about everything I don't like about that infographic - leave now if your are otherwise aligned. You have been warned.

[Update: It seems that the graphic was updated to change petrabyte to petabye throughout. The remainder of the following artifacts, however, seem as they were originally presented]

Lets start with the term 'petrabyte'. You might think that there was a typo someplace on the page ('r' being next to 't' on the keyboard) and could understand seeing it once instead of petabyte. No, this infosludge is selling it as a measuring stick throughout. Unforgiven.

Now, lets start looking at the data contained in the display itself. One of the first data comparisons is the projected growth rainbows. Outside of what is being said in the text the shapes and values are lying. The 5% value is represented by 5 lines (1% per line) while the 40% value is represented with 12 lines (3.33% per line). If instead of the count of lines you consider the visual space of the arc you get an area of 39.27 (using radius 5). That should mean, if the aspect ratio is equal, that the area of the larger semi-circle should be 314.16 (8 times the smaller). Instead it is 226.19.

To the left of that there is a bubble containing the value of 235 terabytes. This represents the total data collected by the Library of Congress in the month of April in 2011. If this is an important benchmark or standard we should certainly be told at some point; along with how it relates to the other information it leads to. Instead, that value directs us to the Data Sectors. The problem is that values listed in the Data Sectors section are for yearly aggregates over entire sectors for 2009. Further problematic is the fact that the areas of the circles in the Data Sectors section do not correspond to the numbers listed under them. The ratio of the largest printed value and the smallest printed value is 18.94 while the ratio of the largest bubble to the smallest bubble is 25.0 (80px and 16px, respectively).

Moving through the chart to the next bubble lands us on a value of 3.8 [units elided]. What is the significance of this number - or the Securities and Investments sector it represents? What else in the chart references this value? Nothing that I can find; it's not even part of the five selected sections that follow it.

Then, again with the rainbow. This time 12 distinct rings is sufficient to represent two different values. Certainly there was effort in resizing those 12 rings for the smaller value - yet no one thought to use the correct scale? Boggles the mind.

Moving along in the Health sector we see that R&D is really important. It could reportedly capture $108 billion. That amount is $57 billion less than the Clinical area but R&D still gets a bigger bubble. I'll admit I'm compelled to agree with this choice - I totally dig R&D.

Personally, my favorite part of the entire chart is in the Retail sector. The caption is priceless: "The potential increase ... could be 60%." If I offered you a job and my pitch was "I might potentially pay you $60K", would you take it? Oh, you're probably likely to receive benefits, too.

Yay for consistency. As we are moving into Government not only have we temporarily switched to euros but we're also provided two different symbols to represent that change. Nothing like keeping your readers engaged by constantly changing the rules.

The last bubble I'll discuss is the "1 Petrabyte" centered near the bottom. How about one colossal waste of our time. Considering the discrepancies in the chart itself I'd be hesitant to trust the values as provided. The existence of delinquency like this is not altogether surprising; it's the fact that it is so popular that really appalls me.



NOTE: My discontent certainly does not represent my opinion of Hacker News and it's community. I simply find it unfortunate that so many are lead astray by these garish displays.

Friday, August 12, 2011

Quick Shine


I've posted before that I like puzzles. I came across another challenge that recently appeared on the hacker news feed - here is the actual post.

The premise is that there is a slow floating point operation that is taking place that can be sped up by using integer-only operations without causing any change in the output of the equation. For reference, here is the function that is reported to be slow:

unsigned char SlowBrightness(unsigned char r,
                             unsigned char g,
                             unsigned char b)
{
  return 0.289f * r +
         0.587f * g +
         0.114f * b + 1e-5f;
}


My approach to this was to try and estimate the floating point values by composing large enough integer fractions. The goal here is to find a single denominator (optimally a power of 2) that is capable of hosting three numerators that will result in the values listed. The reason to aim for a power of two in the denominator is that a dividing by powers of two can be efficiently implemented as a right bit shift.

At 219, the following values result as numerators: 151519, 307757 and 59769 representing 0.289, 0.587 and 0.114 respectively. To check what type of error was related to these values I also evaluated the division and compared to the original values.

0.28900000
0.28899955 (151519/524288)

0.58700001
0.58699989 (307757/524288)

0.11400000
0.11400032 (59769/524288)

There was error, but since there was conveniently a constant in the original equation I could mitigate by adjusting the constant if necessary.

Once I had the values, I wrote the following test harness:

#include <stdio.h>

#define R(r) (151519UL*(r))
#define G(g) (307757UL*(g))
#define B(b) (59769UL*(b))
#define C    (5)

typedef unsigned char uchar;

uchar SlowBrightness(uchar r, uchar g, uchar b) {
    return 0.289f * r + 0.587f * g + 0.114f * b + 1e-5f;
}

uchar FastBrightness(uchar r, uchar g, uchar b) {
    return (R(r) + G(g) + B(b) + C) >> 19;
}

int main () {
    uchar r = 0, g = 0, b = 0;
    for (; r < 255; ++r)
        for (g = 0; g < 255; ++g)
            for (b = 0; b < 255; ++b)
                if (SlowBrightness(r,g,b) != FastBrightness(r,g,b))
                    printf("FAIL: %d,%d,%d\n", r,g,b);
    return 0;
}

The constant value of C came from 1e-5f represented as 5 / 219. As expected, I had to increase the value of C to compensate for the difference in my estimated numbers. After the adjustment (the minimum constant ended up being 72) I was able to run and match all values of the two calls. I also achieved the bonus for minimal constants and only using +, * and >> operations.

Thursday, August 4, 2011

Hotness

We have an internal image that floated around work several years ago that details network utilization of TCP over a wide variety of configurations. It is a heatmap created in matlab that is just sweet, sweet eye candy. We actually hung it on the outside of a cube for a short while and people couldn't help but stop and look at it.

It is entirely dysfunctional, mind you. The designer tried to combine eight parameters - with all variations - into a individual 2D plot (3D if you consider color a dimension). It was definitely an internal tool - there were only two or three of us who could decipher the layout enough to say anything about the data. That was fine by us; we basically made up the entire population of people who cared.

Fast forward a few years. I'm currently working on a technical report that could use the data we used to create that plot and, as luck would have it, I'm also the only one from the original group still at the company. In order to be able to include this data in the report there needs to occur a certain amount of reformatting - first my brain, then that plot. I wasn't the original designer and, although I have access to the code, I don't know matlab so I'm pretty much stuck. I decided to rework the data in R.

The thing about that original plot was that it had a certain je ne sais quoi: it made you look. I wanted to keep that so I immediately investigated heatmap functionality available in R.

Really? Ouch. Not much available there. I came up with two resources that were helpful: A Wikipedia entry about a Mandelbrot set animation in R; A stackoverflow answer that mentioned rasterImage in a comment. The first site lead me to the color set used in our original plot and the second gave me the pointer I needed to get the job done. I'll leave what follows as a reminder for myself and a helpful nudge for those who face a similar problem in the future.

hmap.example <- function () {

    code <- c("colfun <- colorRampPalette(c(...))",
        "my.colors <- colfun(10000)","xs <- 1:100",
        "X  <- outer(xs,rev(xs))",
        "C1 <- matrix(my.colors[X],100,100)", "X  <- outer(xs,xs)",
        "C2 <- matrix(my.colors[X],100,100)", "X  <- outer(rev(xs),xs)",
        "C3 <- matrix(my.colors[X],100,100)",
        "plot(c(-100,100),c(-100,100),type='n')",
        "rasterImage(C1,1,1,100,100)",
        "rasterImage(C2,-100,1,1,100)", "rasterImage(C3,-100,-100,1,1)",
        "abline(v=0,col='black',lwd=5)", "abline(h=0,col='black',lwd=5)")

    colfun <- colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan",
                    "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))

    my.colors <- colfun(10000)
    xs <- 1:100
    X  <- outer(xs,rev(xs))
    C1 <- matrix(my.colors[X],100,100)
    X  <- outer(xs,xs)
    C2 <- matrix(my.colors[X],100,100)
    X  <- outer(rev(xs),xs)
    C3 <- matrix(my.colors[X],100,100)
    plot(c(-100,100),c(-100,100),type='n',axes=FALSE,xlab='',ylab='')
    rasterImage(C1,1,1,100,100)
    rasterImage(C2,-100,1,1,100)
    rasterImage(C3,-100,-100,1,1)
    abline(v=0,col='black',lwd=5)
    abline(h=0,col='black',lwd=5)
    text(1,1:length(code)*-6,labels=code,cex=0.8,pos=4,family="mono")
}

And the result: