Thursday, July 7, 2016

KNN (K-Nearest Neighbors) 方法进行non-parametric回归

KNN Regression method
Given a value for K and a prediction point $x_0$, KNN regression first identifies the K training observations that are closest to $x_0$, represented by $N_0$. It then estimates $f(x_0)$ using the average of all the training responses in $N_0$. In other words, $$\hat{f(x_0)}=\frac{1}{K}\sum_{x_i \in N_0}y_i.$$

KNN方法并不预先设定参数,再进行估测,而是依次选定某个点,在这个点的最近距离内找出给定的K个样本量再进行平均,以最大程度在拟合到Response。逐次完成后,即是一个曲面拟合。这种方法对于$K \ge 4$时拟合程度要比linear regression效果好。

对于KNN和Linear方法的应用经验主要从predictors dimension层面进行衡量:
When p=1 or p=2, KNN outperforms linear regression. But for p=3 the results are mixed, and for p ≥4 linear regression is superior to KNN. In fact, the increase in dimension has only caused a small deterioration in the linear regression test set MSE, but it has caused more than a ten-fold increase in the MSE for KNN. This decrease in performance as the dimension increases is a common problem for KNN, and results from the fact that in higher dimensions there is effectively a reduction in sample size. In this data set there are 100 training observations; when p=1, this provides enough information to accurately estimates $f(X)$. However, spreading 100 observations over p=20 dimensions results in a phenomenon in which a given observation has no nearby neighbors - this is the so-called curse of dimensionality. That is, the K observations that are nearest to a given test observation $x_0$ may be very far away from $x_0$ in p-dimensional space when p is large, leading to a very poor prediction of $f(x_0)$ and hence a poor KNN fit. As a general rule, parametric methods will tend to outperform non-parametric approaches when there is a small number of observations per predictor.

Tuesday, July 5, 2016

Collinearity对Linear Regression的影响及消除方法

在多变量的回归模型中,如果出现两个变量之间呈现collinear relationship时,由于他们之间存在共同变动的关系,因此很难区分开其中某一个predictor对于response的作用程度。这个时候我们可以通过contour plot查看。

The contour plot of the RSS (Residual Sum of Square) associated with different possible coefficient estimates for the regression of response and two predictors. Each ellipse represents a set of coefficients that correspond to the same RSS, with ellipse nearest to the center taking on the lowest values of RSS. The black dots and associated dashed lines represent the coefficient estimates that result in the smallest possible RSS -- in other words, these are the least squares estimates.

pic

右侧比左侧说明两个变量间存在明显的线性关系,即两变量的系数出现微幅变动时即会影响RSS的稳定性。同样也会体现到右侧图的系数的standard error和P值都会明显比左侧图要高,也证明了其不确定性。

有效发现这种线性相关性的方法是通过VIF参数,以下为其方法:
A better way to assess multicollinearity is to compute the variance inflation factor (VIF). The VIF is the ratio of the variance of $\hat{\beta_j}$ when fitting the full model divided by the variance of $\hat{\beta_j}$ if fit on its own. $$VIF(\hat{\beta_j}) = \frac{1}{1-R^2_{X_j \vert X_{-j}}}, $$ the smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. 

一旦发现predictors存在巨大的VIF值,就可以考虑去掉一个,再次检验是否回归正常。

Studentized Residuals结合High Leverage Points图形找出Outliers Points

An outlier is a point for which $y_i$ is far from the value predicted by the model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection.

To address the outlier problem, instead of plotting the residuals, we can plot the studentized residuals, computed by dividing each residual $e_i$ by its estimated standard error. Observations whose studentized residuals are greater than 3 in absolute value are possible outliers.


在去除outliers后,要考察对于模型的RSE和$R^2$的改善性况是否显著。

另一方面,我们也可以通过考查outliers的leverage水平高低,来判断其对least square fit线(红线)的影响程度,即leverage水平越高,对least square fit线的影响越大。

$$h_i=\frac{1}{n}+\frac{(x_i-\bar{x})^2}{\sum^n_{i \prime =1}(x_{i\prime}-\bar{x})^2}, 1/n \le h_i \le 1$$

正常情况下the average leverage for all the observations $h_i = (p+1)/n$,如果大幅超越了$(p+1)/n$,则要怀疑这个点的high leverage的影响。如:
Left: observation 41 is a high leverage point, while 20 is not. The red line is the fit to all the data, and the blue line is the fit with observation 41 removed. Center: the read observation is not unusual in terms of its $X_1$ value or its $X_2$ value, but still falls outside the bulk of the data, and hence has high leverage. Right: observation 41 has a high leverage and a high residual.

线性回归中的error uncorrelate要求

在线性回归模型中通常要求uncorrelated error,一旦error出现相关性,则最终的拟合结果并不能令人满意,而这样的情况通常出现于time series类数据中。所以对于这样的观察值,一定要去做有关time series的residual plot以观察是否由于时间数据的相关性造成了拟合直线图形的变异。

Plots of residuals from simulated time series data sets generated with differing levels of correlation $\rho$
between error terms for adjacent time points.
以上三个图形中,$\rho$代现error的彼此相关性。

另一个要求是:the error terms have a constant variance, $Var(\epsilon_i)=\sigma^2$. 对于non-constant variances in the errors, or heteroscedasticity,如果在residual plot图中出现了funnel shape的形状时,那说明error variance的一致性出现问题。对于这种情形,我们可以通过对因变量Y施加凹形函数,将突出的residual plot图形往回拉平。这样的函数有$logY, \sqrt{Y}$等。


使用Residual Plot判断拟合程度

$$Residual = y_i - \hat{y}_i$$

对于某个回归方程与观察值之间可以通过绘制Residual Plot图观看拟合程度,在图形中,代表拟合程度的线越接近于abline(0,0)越好,如果出现弯曲程度较大的曲线,则说明拟合程度不佳。


显示右侧的拟合比左侧要好。

线性回归中的Synergy Effect或Interaction Effect

对于如下形式的线性回归:$$Y=\beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon$$,如果现实情况下$X1$和$X_2$存在着此消彼长的关系,具体例子来说:比如一笔预算在两个变量间分配时,由于两个变量之间的关系并未在以上的线性方程中进行体现,因此增加一个变量形成如下形式:$$Y=\beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2 + \epsilon$$.

The hierarchical principle states that if we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

 在引入Synergy Effect的参数时,无论$X_1, X_2$前面系数的P值比较大,也应该予以保留。

Monday, July 4, 2016

Confidence Interval and Prediction Interval的区别

We use a confidence interval to quantify the uncertainty surrounding the average sales over a large number of cities. For example, given that100,000 is spent on TV advertising and $20,000 is spent on radio advertising in each city, the 95% confidence interval is [10,985, 11,528]. We interpret this to mean that 95% of intervals of this form will contain the true value of f(X). On the other hand, a prediction interval can be used to quantify the uncertainty surrounding sales for a particular city. Given that $100,000 is spent on TV advertising and $20,000 is spent on radio advertising in that city the 95% prediction interval is [7,930, 14,580]. We interpret this to mean that 95% of intervals of this form will contain the true value of Y for this city. Note that both intervals are centered at 11,256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations.

Confidence Interval是对所有observation依据模型求出的均值,而Prediction Interval是对某一特定observation的预测值。通常后者的区间大于前者,因为后者包含有irreducible error.

线性回归时通过$R^2$与RSE判断变量有效性

$$R^2 = Cor(Y, \hat{Y})$$
$$RSE = \sqrt{\frac{1}{n-p-1}RSS}$$

$R^2$越趋近于1,则说明模型中的自变量能够解释大部分的因变量。所以我们在做线性回归时,应尽可能的追求$R^2 \to 1$。而这个过程可以伴随着自变量的筛选环节。

比如,当$X_1, X_2$对于$Y$的$R^2=0.89719$,再加入$X_3$后的$R^2=0.8972$,可以看出在加入第三个自变量时,其$R^2$值的增加很小,因此我们可以断定$X_1, X_2$对于因变量的解释程度已经足够好,而$X_3$对于因变量的解释程度很小,因此在回归模型中,可不考虑。

RSE = Residual Square Error,在做线性回归时,寻求RSE的较小值。依上例,自变量为$X_1, X_2$时,$RSE = 1.681$,加入$X_3$后,$RSE=1.686$,说明第三自变量加入时会增加RSE的值,因此也印证了上面所说,不考虑将$X_3$加入至线性回归模型中。