Saturday, April 23, 2011

y = mx + b

I had recently compiled a set of plots for a technical report I put together to summarize an evaluation of some of our software. The focus of the tests was performance of multiple file transfer approaches. One of the comments I got back after pushing the report out for review was that I should include the slope as an annotation for a best fit line I had included. I thought this was a curious request at first since the best fit was already intended to give a visual estimate of the behavior. After I thought about it, however, I considered the case when someone would want to compare results to ours without having any of our raw data. A visual estimate would certainly not do in that case - especially if the range of their data was outside of the viewable area of the plot.

This was a simple change. R, the software I was using to generate the plots, has built in functionality for linear regression. Getting the slope and y-intercept is as easy as:

> lm(y ~ x)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
   8.474523     0.007313

This got me thinking, however. Would I know how to generate the linear regression without software? The answer was 'no', so I learned how. Here are the steps I took to generate the same results as the statistical software I was using.

I was dealing with a set of data that was generated using inputs over the powers of two: [1, 2, 4, 8, ..., 2048, 4096]. I also had multiple samples at each input. The first step was to determine how, in fact, a linear regression is calculated.

Linear regression (used to make a best fit line) can be calculated using the least squares approach. I needed to have the following: mean and standard deviation of both the inputs and outputs and the Pearson product-moment correlation coefficient. Once I had all of those calculations I could generate the slope and y-intercept of the best fit line and thus any point along the line.

In order to calculate the PMCC there is some upfront legwork that I needed to do. The equation requires the standard score and sample standard deviation from the two sets and calculating the y-intercept requires the sample means. The Pearson product-moment correlation coefficient for a given sample is defined here (Wikipedia).

I put together some Ruby code to generate the necessary values:

#!/usr/bin/env ruby

class DataSet
    attr_reader :mean, :var, :sd, :size
    def initialize(ary)
        @size = ary.size
        @data = ary.dup
        @mean = ary.inject(0.0) { |sum,n| sum + n } / @size
        @var = ary.inject(0.0) { |sum,n| (n-@mean)**2 + sum }
        @sd = Math.sqrt(@var/(@size - 1))
    end

    def [](i)
        @data[i]
    end
end

# http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
def pearson(x,y)
    g = []
    x.size.times do |i|
        g << ((x[i] - x.mean)/x.sd)*((y[i] - y.mean)/y.sd)
    end
    (1.0/(x.size - 1)) * g.inject(0.0) { |sum,n| sum + n }
end

#http://www.stat.wmich.edu/s216/book/node126.html
def slope(x,y)
    pearson(x,y) * (x.sd/y.sd)
end

def intercept(x,y)
    x.mean - slope(x, y)*y.mean
end

def load_data(file)
    xs, ys = [], []
    if file
        File.open(file).each do |line|
            x,y,_ = line.chomp.split(',')
            xs << x.to_f
            ys << y.to_f
        end
    else
        while line = gets
            x,y,_ = line.chomp.split(',')
            xs << x.to_f
            ys << y.to_f
        end
    end
    [DataSet.new(xs), DataSet.new(ys)]
end

x, y = load_data(ARGV.shift)

puts "Slope      : %0.6f" % [slope y, x]
puts "Y-intercept: %0.6f" % [intercept y, x]

Running that with my control data set yields the following:

Slope      : 0.007313
Y-intercept: 8.474523

With the results from R above as a sanity check these values look accurate. Now, using these calculations, I could plot a best fit simply by drawing a line from (1,y) to (4096,y) where y is calculated using the equation: y = mx + b.


So, while I'm still an advocate of using the right tool for the job I can now understand the mechanics behind these calculations and talk specifically about why they should (or should not) be used.



N.B. I am no expert on any of this. There are topics at play here I do not fully understand (details of the Pearson product-moment correlation coefficient, for one). This is more a narrative than any official guide to understanding.

No comments :

Post a Comment