Friday, July 19, 2013

Column-Major Confusion

As a programmer coming from a language like C I am used to understanding multidimensional arrays in the following way:
int matrix[3][3] = {
    { 1, 2, 3 },
    { 4, 5, 6 },
    { 7, 8, 9 }

Understandably, this is a little bit of a setback when trying to grok how a language such as R handles matrices. The example above is 'row-major' but R uses 'column-major' by default. Note that I'm not describing memory layout here - which coincidentally is the same as my description - I referring to how the matrices are presented to the programmer. To complete the example, here is how I would create the same matrix as above in R:
> matrix(1:9,ncol=3)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

There is no problem with this per se. In fact, I'd imagine that R programmers that come to something like C may feel similar unease in the paradigm shift. I'm finding that this fact is glazed over in some texts offering an introduction to R. This happens in non-obvious ways. In particular, I found this example today:
> matrix(c(1,0,0,0,1,0,0,0,1),ncol=3)
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

The identity matrix is a particularly bad choice in this regard as it gives no indication of the true layout being used. It is probably good for any tutorial using matrices to cover an obvious simple case first to set the stage before moving directly to something like the identity matrix (or any other symmetric matrix for that matter).

Saturday, July 13, 2013

10 points: 10 plots

As an exercise in expanding my ability to display data I challenged myself to present 10 data points in 10 ways that were as distinct as possible. The idea was simple: use 10 random data points; minimize the axis and other ancillary information so as to focus on the data as much as possible; and try to minimize the overlap between each of the approaches.

Initially, I expected this would be a trivial task - something that would take a single sitting and a little bit of thought. A few attempts later and I kept circling back on a few common ideas while considering just how many approaches I'd not considered. What exists below is a collection of the results of that exercise with explanation if necessary.

1 - Standard Cartesian (scatterplot)

2 - Derivative Cartesian: uses labels instead of points to eliminate the need for tick marks on the x-axis.

3 - Impulses. Mixing the number and characters on the x-axis tick marks is questionable and could just as well have been labels at the top of each impulse

4 - Sorted derivative Cartesian

5 - Boxplot

6 - Barplot

7 - Radial. Points are interpreted as radians and placed starting from 0 radians

8 - Heatmap

9 - Cumulative Sum

10 - Financial/Intensity: Positive values are blue, negative are red. Absolute values define the radius of the circle used.

I considered others such as LOESS fit but they either needed the points to accompany them (to show what was being fitted) which made them too close to the Cartesian plot, or they were too complex for just 10 points.

It was interesting to see how difficult it turned out to be to stretch 10 points into 10 distinct presentation approaches.