## Monday, December 1, 2014

### Spurious correlations, three ways

I stumbled across a collection of spurious correlations a while back.

Here is one of them that struck me, in part because the data set links sour cream consumption per capita and motorcycle riders killed in non-collision transport accidents, but also because it is presented in odd sort of way.

 Headline: Sour cream consumption linked to greater risk of non-collision transport death by motorcycle.

What an interesting way to represent the relationship between two variables. They have used the linking variable (time) not merely to identify which pairs of data points belong together, but used it to form the x-axis. So they are really displaying an overlapping pair of time-series graphs. That's a strange choice, given this is an obvious place to use a scatter plot. But it is stranger still that they decided to fit what appear to be spline curves to each series in an apparent attempt to make the correlation stand out more clearly.

The authors are interested in establishing the correlation between this unlikely duo, and to that end they have reported the correlation coefficient to be about r = 0.916. But correlation coefficients are more commonly represented graphically using a line of best fit on a scatter plot... so I did that:

 Created with Desmos
I should have labeled the axes, but suffice it to say this shows the correlation quite nicely. Higher x's (annual sour cream consumption per capita) appears to be associated with higher annual rates of motorcycle deaths (y-axis).

We can see the same correlation coefficient found earlier: r = 0.916. It seems they did use a linear regression after all. This just affirms that it was a strange decision to represent the bivariate data using a pair of time series graphs with spline interpolations.

But if we're into making strange analysis decisions today, let's take it a step further.

We could transform our numerical data into categorical data by setting a benchmark for "high" and "low" consumption rates and motorcycle death rates. That way, we could construct a two-way table and perform a chi-square frequency analysis!
(Psst! Lines of best fit and two-way tables are both introduced in grade 8 these days. The best fit / linear regression concept fits nicely with the emphasis on linear equations in 8th grade, and the two-way table seems to be included so teachers can answer the troublesome question, "But Teacher, what do we do if we have bi-variate categorical data?")
So... we need a two-way table. What cut-scores shall we use to form high/low categories? How about:
Low sour cream consumption is anything less than... 7 pints/person per year. (That seems about right... doesn't it?)
And Low incidence of motorcycle traffic deaths not involving collisions? Judging from the available data, let's call anything below 50 deaths per year a good year.
There are 3 years where the sour cream consumption was below 7 pints/person. All 3 of those years had low accident rates. Oooh, looking promising!

That leaves 7 years where the sour cream consumption was above 7 pints/person. Of those years, 5 had high accident rates and 2 had low accident rates.

Now... create the table in Geogebra and do a bit of ChiSquare (pronounced "kai square") magic!

Since the p-value of 0.0384 is below the usual 5% threshold, we are led to conclude that the correlation is statistically significant!

Well that does it. We have little choice but to conclude that sour cream consumption at an annual rate greater than 7 pints/yr is associated with an increase risk of dying from a motorcycle incident not involving a collision. And I'd hardly call that spurious...

It is, perhaps, a bit specious...