回归残差和相关系数
回归性能评价指标决定系数 \(R^2\) 和皮尔逊相关系数 \(r\) 有什么关系,为什么两者都可以作为评价相关性的指标,它们之间有什么内在的联系呢?
1 两者关系
在机器学习中经常使用回归残差(Sum squared regression, SSR)来评价回归模型的性能;而皮尔逊相关系数(Pearson correlation coefficient)经常用来评价两个变量线性相关性。
回归残差:
\[ R^2 = \sum_{i=1}^{n}\left(y_i - \hat{y_i} \right)^2 \tag{1}\]
皮尔逊相关系数:
\[ r=\frac{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)\left(Y_{i}-\bar{Y}\right)}{\sqrt{\sum_{i=1}^{n}\left(X_{i}-\bar{X}\right)^{2}} \sqrt{\sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2}}} \tag{2}\]
那么这两者到底有什么关系?先说结论:
对于线性回归的最小二乘拟合:
\[ r(x, y) = \pm\sqrt{R^2} \tag{3}\]
对于非线性拟合,也有此关系,证明见 Sec. 2.3。
2 关系证明
2.1 线性回归和最小二乘
线性回归:
\[ y = \beta_0 + \beta_1 x + \epsilon \tag{4}\]
其中:\(\hat{y} = \beta_0 + \beta_1 x\)。用最小二乘法拟合残差的平方和(Sum of square residuals, SSR)得知:
\[ {SSR} = \sum_{i=1}^{n}(\epsilon_i)^2=\sum_{i=1}^{n}(y_i - \hat{y_i})^2 = \sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2 \tag{5}\]
对其求偏导,并使其为零:
\[ \begin{array}{l} \frac{\partial {SSR}}{\partial \beta_{0}}=\sum_{i=1}^{n} 2\left(y_{i}-\beta_{0}-\beta_{1} x_{i}\right)(-1)=0 \\ \frac{\partial {SSR}}{\partial \beta_{1}}=\sum_{i=1}^{n} 2\left(y_{i}-\beta_{0}-\beta_{1} x_{i}\right)\left(-x_{i}\right)=0 \end{array} \tag{6}\]
则:
\[ \begin{array}{c} \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1} x_{i}\right)=\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)=0 \\ \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1} x_{i}\right) x_{i}=\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right) x_{i}=0 \end{array} \tag{7}\]
根据上式子第一条:
\[ \overline{\hat{y}}=\frac{\sum_{i=1}^{n} \hat{y}_{i}}{n}=\frac{\sum_{i=1}^{n} y_{i}}{n}=\bar{y} \tag{8}\]
2.2 相关系数
\[ \begin{aligned} \rho\left(y_{i}, \hat{y}_{i}\right) & =\frac{\operatorname{cov}\left(y_{i}, \hat{y}_{i}\right)}{\sqrt{\operatorname{var}\left(y_{i}\right) \operatorname{var}\left(\hat{y}_{i}\right)}} \\ & =\frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)\left(\hat{y}_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}} \\ & =\frac{\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}+\hat{y}_{i}-\bar{y}\right)\left(\hat{y}_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}} \\ & =\frac{\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)\left(\hat{y}_{i}-\bar{y}\right)+\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}{\sqrt{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}} \\ & =\frac{0+\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}{\sqrt{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}} \\ & =\sqrt{\frac{\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}} \\ & =\sqrt{R^{2}} \end{aligned} \tag{9}\]
其中:
\[ \begin{aligned} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)\left(\hat{y}_{i}-\bar{y}\right) & =\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)\left(\beta_{0}+\beta_{1} x_{i}-\bar{y}\right) \\ & =\left(\beta_{0}-\bar{y}\right) \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)+\beta_{1} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right) x_{i} \\ & =0 \end{aligned} \tag{10}\]
2.3 二次回归的决定系数和皮尔逊相关系数
二次回归决定系数:\(\hat{y} = \beta_0 + \beta_1 x_1+ \beta_2 x_2\),最小二乘法的残差平方和:
\[ SSR=\sum_{i=1}^{n}\left(e_{i}\right)^{2}=\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}=\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1} x_{i}-\beta_{2} x_{i}^{2}\right)^{2} \tag{11}\]
对 SSR 求参数的偏导,令偏导数等于零,可得最优参数:
\[ \begin{array}{l} \frac{\partial S S R}{\partial \beta_{0}}=\sum_{i=1}^{n} 2\left(y_{i}-\beta_{0}-\beta_{1} x_{i}-\beta_{2} x_{i}^{2}\right)(-1)=0 \\ \frac{\partial S S R}{\partial \beta_{1}}=\sum_{i=1}^{n} 2\left(y_{i}-\beta_{0}-\beta_{1} x_{i}-\beta_{2} x_{i}^{2}\right)\left(-x_{i}\right)=0 \\ \frac{\partial S S R}{\partial \beta_{2}}=\sum_{i=1}^{n} 2\left(y_{i}-\beta_{0}-\beta_{1} x_{i}-\beta_{2} x_{i}^{2}\right)\left(-x_{i}^{2}\right)=0 \end{array} \tag{12}\]
可以得到:
\[ \begin{array}{c} \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1} x_{i}-\beta_{2} x_{i}^{2}\right)=\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)=0 \\ \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1} x_{i}-\beta_{2} x_{i}^{2}\right) x_{i}=\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right) x_{i}=0 \\ \sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1} x_{i}-\beta_{2} x_{i}^{2}\right) x_{i}^{2}=\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right) x_{i}^{2}=0 \end{array} \tag{13}\]
根据上面式子可以得到:
\[ \bar{\hat{y}} = \bar{y} \tag{14}\]
根据相关系数的公式:
\[ \begin{aligned} \rho(y, \hat{y}) & =\frac{\operatorname{cov}\left(y_{i}, \hat{y}_{i}\right)}{\sqrt{\operatorname{var}\left(y_{i}\right) \operatorname{var}\left(\hat{y}_{i}\right)}} \\ & =\frac{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)\left(\hat{y}_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}} \\ & =\frac{\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}+\hat{y}_{i}-\bar{y}\right)\left(\hat{y}_{i}-\bar{y}\right)}{\sqrt{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \sum_{i=1}^{n b}\left(\hat{y}_{i}-\bar{y}\right)^{2}}} \\ & =\frac{\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)\left(\hat{y}_{i}-\bar{y}\right)+\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}{\sqrt{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}} \\ & =\frac{0+\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}{\sqrt{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \sum_{i=1}^{n b}\left(\hat{y}_{i}-\bar{y}\right)^{2}}} \\ & =\sqrt{\frac{\sum_{i=1}^{n}\left(\hat{y}_{i}-\bar{y}\right)^{2}}{\sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}}} \\ & =\sqrt{R^{2}} \end{aligned} \tag{15}\]
其中:
\[ \begin{aligned} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)\left(\hat{y}_{i}-\bar{y}\right) & =\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)\left(\beta_{0}+\beta_{1} x_{i}+\beta_{2} x_{i}^{2}-\bar{y}\right) \\ & =\left(\beta_{0}-\bar{y}\right) \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)+\beta_{1} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right) x_{i}+\beta_{2} \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right) x_{i}^{2} \\ & =0 \end{aligned} \tag{16}\]
可见,只要通过最小二乘法拟合,就能得到 \(r(x, y) = \pm\sqrt{R^2}\),进而推广到多项式。任意非线性函数可以由泰勒拟合为多项式,所以进而可以说任意函数都有 \(r(x, y) = \pm\sqrt{R^2}\)。
2.4 易混淆的公式
- Sum square error, SSE: \(\sum_{i=1}^{n}\left(y_i - \hat{y_i} \right)^2\)
- Sum square regression, SSM: \(\sum_{i=1}^{n}\left(y_i - \bar{\hat{y_i}} \right)^2\)
- Sum square total, SST: \(\sum_{i=1}^{n}\left(y_i - \bar{y_i} \right)^2\)
其中:\(SST = SSE + SSR\),证明如下:
\[ \begin{aligned} SST &= \sum_{i=1}^{n}\left(y_i - \bar{y_i} \right)^2\\ &= \sum_{i=1}^{n}\left((y_i - \hat{y_i}) + (\hat{y_i} - \bar{y_i}) \right)^2\\ &= \sum_{i=1}^{n}\left(y_i - \hat{y_i}\right)^2 + \sum_{i=1}^{n}\left(\hat{y_i} - \bar{y_i} \right)^2 +\underbrace{\sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)\left(\hat{y}_{i}-\bar{y}\right)}_{value=0}\\ \end{aligned} \tag{17}\]
其中 value=0
一项参考 Eq. 10。