Processing math: 100%

Tuesday, July 5, 2016

Collinearity对Linear Regression的影响及消除方法

在多变量的回归模型中,如果出现两个变量之间呈现collinear relationship时,由于他们之间存在共同变动的关系,因此很难区分开其中某一个predictor对于response的作用程度。这个时候我们可以通过contour plot查看。

The contour plot of the RSS (Residual Sum of Square) associated with different possible coefficient estimates for the regression of response and two predictors. Each ellipse represents a set of coefficients that correspond to the same RSS, with ellipse nearest to the center taking on the lowest values of RSS. The black dots and associated dashed lines represent the coefficient estimates that result in the smallest possible RSS -- in other words, these are the least squares estimates.

pic

右侧比左侧说明两个变量间存在明显的线性关系,即两变量的系数出现微幅变动时即会影响RSS的稳定性。同样也会体现到右侧图的系数的standard error和P值都会明显比左侧图要高,也证明了其不确定性。

有效发现这种线性相关性的方法是通过VIF参数,以下为其方法:
A better way to assess multicollinearity is to compute the variance inflation factor (VIF). The VIF is the ratio of the variance of \hat{\beta_j} when fitting the full model divided by the variance of \hat{\beta_j} if fit on its own. VIF(\hat{\beta_j}) = \frac{1}{1-R^2_{X_j \vert X_{-j}}}, the smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. 

一旦发现predictors存在巨大的VIF值,就可以考虑去掉一个,再次检验是否回归正常。

No comments:

Post a Comment