R-Squared, r, and Simple Linear Regression

Linear regression is perhaps the simplest and most common statistical model (very powerful nevertheless), and fits the model of the form:

\[\mathbf{y}=\mathbf{X}\beta+\epsilon\]

This model makes some assumptions about the predictor variables, response variables, and their relationship:

  1. Weak exogeneity - predictor variables are assumed to be error-free.

  2. Linearity - response variables is a lienar combination of the predictor variables.

  3. Constant variance - response variables have the same same variance in their errors, regardless of the values of the predictor variables. This is most likely invalid in many applications. Especially so when the data is instationary.

  4. Independence of errors - assuming errors of response variables are uncorrelated with each other.

  5. Lack of multicollinearity - in the predictors. Regularization can be used to combat this. On the other hand, when there is a lot of data, the collinearity problem is not as significant.

The most common linear regression is fitted with ordinary least-squares (OLS) method, where the sum of the squares of the differenceces between the observed responses in the dataset and those predicted by the model is minimized.

Two very common metrics associated with OLS linear regression is coefficient-of-determination \(R^2\), and correlation coefficient, specifically Pearson correlation coefficient \(r\).

The first measures how much variance is explained by the regression, and is calculated by \(1-\frac{SS_{res}}{SS_{tot}}\). The second measures the the lienar dependence between the response and predictor variables. Both have value between 0 and 1. Say we fit linear regression and want to check how good that model is, we can get the predicted value \(\mathbf{\hat y}=\mathbf{X}\beta\) and check the \(\mathbf{R^2}\) and \(\mathbf{r}\) values between \(\mathbf{y}\) and \(\mathbf{\hat y}\). When a linear regression is fit using OLS, then the \(R^2=r^2\).

However, if two vectors are linearly related but not fit using OLS regression, then they can have high \(r\) but negative \(R^2\). In MATLAB:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
>> a = rand(100,1);
>> b = rand(100,1);
>> corr(a,b)

ans =

    0.0537

>> compute_R_square(a,b)

ans =

   -0.7875

>> c = a + rand(100,1);
>> corr(a,c)

ans =

    0.7010

>> compute_R_square(a,c)

ans =

   -2.8239

Even though a and c above are clearly linearly related, their \(R^2\) value is negative. This is because the \(1-\frac{SS_{res}}{SS_{tot}}\) definition assumes c is generated from an OLS regression.

According to Wikipedia,

Important cases where the computational definition of R2 can yield negative values, depending on the definition used, arise where the predictions that are being compared to the corresponding outcomes have not been derived from a model-fitting procedure using those data, and where linear regression is conducted without including an intercept. Additionally, negative values of R2 may occur when fitting non-linear functions to data.[5] In cases where negative values arise, the mean of the data provides a better fit to the outcomes than do the fitted function values, according to this particular criterion

This point really tripped me up when I was checking my decoder’s performance. Briefly, I fit a Wiener filter to training data, apply on testing data, and then obtain the \(R^2\) and \(r\) between the predicted and actual response of the testing data. I expected \(R^2=r^2\). However, \(R^2\) was negative and its magnitude was not related to \(r\) at all – precisely for this reason.

In the case of evaluating decoder performance then, we can instead:

  1. Fit a regression between \(\mathbf{y}\) and \(\mathbf{y_{pred}}\), then check the \(R^2\) of that regression. Note that when there is a lot of data, regardless of how weak the correlation is between them, the fit will be statistically significant. So while the value of \(R^2\) is useful, its pvalue is not as useful.

  2. Correlation coefficient \(r\). However it is susceptible to outliers when there is not much data. This has been my go-to metric for evaluating neural tuning (as accepted in the field) and decoder performance.

  3. Signal-to-Noise (SNR) ratio. As Li et al., 2009 pointed out, \(r\) is scale and translational invariant, this means a vector \([1,2,3,4,5]\) and \([2,3,4,5,6]\) will have \(r=1\), which does not capture this offset.

SNR is calculated as \(\frac{var}{mse}\) (often converted to dB) where \(var\) is the sample variance \(y\), and \(mse\) is the mean squared error of \(y_{pred}\). This then measures how \(y_{pred}\) ‘tracks’ \(y\), as desired.