Bravo in Life: KNN (K-Nearest Neighbors) 方法进行non-parametric回归

KNN Regression method
Given a value for K and a prediction point $x_0$, KNN regression first identifies the K training observations that are closest to $x_0$, represented by $N_0$. It then estimates $f(x_0)$ using the average of all the training responses in $N_0$. In other words, $$\hat{f(x_0)}=\frac{1}{K}\sum_{x_i \in N_0}y_i.$$

KNN方法并不预先设定参数，再进行估测，而是依次选定某个点，在这个点的最近距离内找出给定的K个样本量再进行平均，以最大程度在拟合到Response。逐次完成后，即是一个曲面拟合。这种方法对于$K \ge 4$时拟合程度要比linear regression效果好。

对于KNN和Linear方法的应用经验主要从predictors dimension层面进行衡量：

When p=1 or p=2, KNN outperforms linear regression. But for p=3 the results are mixed, and for p ≥4 linear regression is superior to KNN. In fact, the increase in dimension has only caused a small deterioration in the linear regression test set MSE, but it has caused more than a ten-fold increase in the MSE for KNN. This decrease in performance as the dimension increases is a common problem for KNN, and results from the fact that in higher dimensions there is effectively a reduction in sample size. In this data set there are 100 training observations; when p=1, this provides enough information to accurately estimates $f(X)$. However, spreading 100 observations over p=20 dimensions results in a phenomenon in which a given observation has no nearby neighbors - this is the so-called curse of dimensionality. That is, the K observations that are nearest to a given test observation $x_0$ may be very far away from $x_0$ in p-dimensional space when p is large, leading to a very poor prediction of $f(x_0)$ and hence a poor KNN fit. As a general rule, parametric methods will tend to outperform non-parametric approaches when there is a small number of observations per predictor.

Bravo in Life

Thursday, July 7, 2016

KNN (K-Nearest Neighbors) 方法进行non-parametric回归

No comments:

Post a Comment