This was a simple change. R, the software I was using to generate the plots, has built in functionality for linear regression. Getting the slope and y-intercept is as easy as:
> lm(y ~ x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
8.474523 0.007313
This got me thinking, however. Would I know how to generate the linear regression without software? The answer was 'no', so I learned how. Here are the steps I took to generate the same results as the statistical software I was using.
I was dealing with a set of data that was generated using inputs over the powers of two: [1, 2, 4, 8, ..., 2048, 4096]. I also had multiple samples at each input. The first step was to determine how, in fact, a linear regression is calculated.
Linear regression (used to make a best fit line) can be calculated using the least squares approach. I needed to have the following: mean and standard deviation of both the inputs and outputs and the Pearson product-moment correlation coefficient. Once I had all of those calculations I could generate the slope and y-intercept of the best fit line and thus any point along the line.
In order to calculate the PMCC there is some upfront legwork that I needed to do. The equation requires the standard score and sample standard deviation from the two sets and calculating the y-intercept requires the sample means. The Pearson product-moment correlation coefficient for a given sample is defined here (Wikipedia).
I put together some Ruby code to generate the necessary values:
#!/usr/bin/env ruby
class DataSet
attr_reader :mean, :var, :sd, :size
def initialize(ary)
@size = ary.size
@data = ary.dup
@mean = ary.inject(0.0) { |sum,n| sum + n } / @size
@var = ary.inject(0.0) { |sum,n| (n-@mean)**2 + sum }
@sd = Math.sqrt(@var/(@size - 1))
end
def [](i)
@data[i]
end
end
# http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
def pearson(x,y)
g = []
x.size.times do |i|
g << ((x[i] - x.mean)/x.sd)*((y[i] - y.mean)/y.sd)
end
(1.0/(x.size - 1)) * g.inject(0.0) { |sum,n| sum + n }
end
#http://www.stat.wmich.edu/s216/book/node126.html
def slope(x,y)
pearson(x,y) * (x.sd/y.sd)
end
def intercept(x,y)
x.mean - slope(x, y)*y.mean
end
def load_data(file)
xs, ys = [], []
if file
File.open(file).each do |line|
x,y,_ = line.chomp.split(',')
xs << x.to_f
ys << y.to_f
end
else
while line = gets
x,y,_ = line.chomp.split(',')
xs << x.to_f
ys << y.to_f
end
end
[DataSet.new(xs), DataSet.new(ys)]
end
x, y = load_data(ARGV.shift)
puts "Slope : %0.6f" % [slope y, x]
puts "Y-intercept: %0.6f" % [intercept y, x]
Running that with my control data set yields the following:
Slope : 0.007313
Y-intercept: 8.474523
With the results from R above as a sanity check these values look accurate. Now, using these calculations, I could plot a best fit simply by drawing a line from (1,y) to (4096,y) where y is calculated using the equation: y = mx + b.
So, while I'm still an advocate of using the right tool for the job I can now understand the mechanics behind these calculations and talk specifically about why they should (or should not) be used.
N.B. I am no expert on any of this. There are topics at play here I do not fully understand (details of the Pearson product-moment correlation coefficient, for one). This is more a narrative than any official guide to understanding.