From covariance to linear regression

I wrote this as an answer to a question on CrossValidated, but then realized that the question was about something completely different.

I discard lengthy posts often, but I ended up being too happy with this one to do nothing with it. I learned something in putting it together, and I figured it might be helpful to someone else in the future. So here it is: how to retrieve and interpret the linear regression line between random variables \(X\) and \(Y\), from only a sample covariance matrix and the sample means.

Bivariate covariance implies linear regression

Suppose you have a 2-by-2 sample covariance matrix

\[ \operatorname{Cov}\left(X,Y\right)=\left[\matrix{s_X^2 &s_{XY} \\ s_{XY} &s_Y^2}\right] \]

and you would like to calculate the implied linear OLS regression line. Good news: this is actually a straightforward computation! However you do have to decide which one is the dependent variable. Let’s pick one arbitrarily and say \(Y\) is the dependent variable. You can always change your mind and repeat the calculations the other way, but do be aware that the formulas here are asymmetric.

So now we have a model \(Y = \alpha + \beta X + \varepsilon\) with \(\varepsilon \sim \mathcal N \left(0, s_Y\right)\). Let \(a\) and \(b\) be sample estimates of \(\alpha\) and \(\beta\) respectively. It’s not hard to derive that

\[ b = \frac{s_{XY}}{s_X^2} \]

and

\[ a = \bar y - b \bar x \]

where \(\bar x\) and \(\bar y\) are the sample means of \(x\) and \(y\) respectively. This gives us the fitted regression line. Again, you can “flip” the line by swapping \(X\) and \(Y\) in these equations.

Interpretation

The formula for \(b\) is straightforward to interpret. If we assume that the relationship between \(X\) and \(Y\) is a straight line, the regression line gets steeper as the covariance between \(X\) and \(Y\) increases relative to the amount of variation inherent in \(X\).

As for \(a\), some manipulation yields (assuming \(N\) observations)

\(\begin{align} a &= \frac{1}{N}\sum_{i=1}^N y_i - \frac{s_{XY}}{s_X^2} \frac{1}{N}\sum_{i=1}^N x_i \\ &= \frac{1}{N}\left( \sum_{i=1}^N y_i - \frac{s_{XY}}{s_X^2} \sum_{i=1}^N x_i \right) \\ &= \frac{1}{N}\left( \sum_{i=1}^N y_i + \sum_{i=1}^N - \frac{s_{XY}}{s_X^2} x_i \right) \\ &= \frac{1}{N}\left( \sum_{i=1}^N y_i - \frac{s_{XY}}{s_X^2} x_i \right) \\ &= \frac{1}{N}\left( \sum_{i=1}^N y_i - b x_i \right) \end{align}\)

That is, \(a\) is just the average difference between \(y\) and \(bx\).

Let’s follow the math a little further. Let \(r_{XY}\) denote the sample correlation of \(X\) and \(Y\), for which we know \(r_{XY} = \frac{s_{XY}}{s_X s_Y}\). Then

\[ b = \frac{s_{XY}}{s_X^2} = \frac{s_{XY}}{s_Xs_X} = \frac{s_{XY}}{s_Xs_X} \frac{s_Y}{s_Y} = \frac{s_{XY}}{s_X s_Y} \frac{s_Y}{s_X} = r_{XY} \frac{s_Y}{s_X} \]

This re-expression of \(b\) can be seen as a kind of “decomposition” of the relationship between \(X\) and \(Y\). You can think of the ratio \(\frac{s_Y}{s_X}\) as the amount of variation inherent in \(Y\) relative to the amount of variation inherent in \(X\). \(r_{XY}\) is the strength of linear association between them. The regression coefficient \(\beta\) depends on the scale of \(Y\) relative to the scale of \(X\) (in “units of \(Y\) per unit of \(X\)”), regulated by a (unitless) “linearity” parameter, which happens to be the correlation coefficient.

Conclusion

Covariance and correlation are powerful tools, but they can be hard to interpret for new students despite their mathematically straightforward definitions. Linear regression, meanwhile, is an intuitive concept that students can usually wrap their heads around quickly. They are probably also already familiar with it from Microsoft Excel. By demonstrating the connection between regression the two concepts, students can ground their understanding of one in the other.

Many students of applied statistics often don’t give enough credit to the raw mathematics that is underlying the techniques they use. I was one of these students for a long time, and my understanding of my statistical tools suffered for it. Math is obviously a great tool for theoretical work, but it is also an extremely powreful tool in applied work. Without a sense of the underlying mathematics, you can only use standard statistical procedures as they are implemented in your statistical package. But when you know the math, you are free to manipulate, modify, and extend your statistical toolbox.

With this post, I demonstrated some intuition about covariance that is not often taught in intro statistics classes. I did so in a restricted setting that allowed me to formally derive each step with straightforward algebra, intending that an explicit construction would help readers appreciate the value of knowing the mathematics underlying statistical techniques. Hopefully it helped you!