Processing math: 100%

Thursday, July 7, 2016

KNN (K-Nearest Neighbors) 方法进行non-parametric回归

KNN Regression method
Given a value for K and a prediction point x_0, KNN regression first identifies the K training observations that are closest to x_0, represented by N_0. It then estimates f(x_0) using the average of all the training responses in N_0. In other words, \hat{f(x_0)}=\frac{1}{K}\sum_{x_i \in N_0}y_i.

KNN方法并不预先设定参数,再进行估测,而是依次选定某个点,在这个点的最近距离内找出给定的K个样本量再进行平均,以最大程度在拟合到Response。逐次完成后,即是一个曲面拟合。这种方法对于K \ge 4时拟合程度要比linear regression效果好。

对于KNN和Linear方法的应用经验主要从predictors dimension层面进行衡量:
When p=1 or p=2, KNN outperforms linear regression. But for p=3 the results are mixed, and for p ≥4 linear regression is superior to KNN. In fact, the increase in dimension has only caused a small deterioration in the linear regression test set MSE, but it has caused more than a ten-fold increase in the MSE for KNN. This decrease in performance as the dimension increases is a common problem for KNN, and results from the fact that in higher dimensions there is effectively a reduction in sample size. In this data set there are 100 training observations; when p=1, this provides enough information to accurately estimates f(X). However, spreading 100 observations over p=20 dimensions results in a phenomenon in which a given observation has no nearby neighbors - this is the so-called curse of dimensionality. That is, the K observations that are nearest to a given test observation x_0 may be very far away from x_0 in p-dimensional space when p is large, leading to a very poor prediction of f(x_0) and hence a poor KNN fit. As a general rule, parametric methods will tend to outperform non-parametric approaches when there is a small number of observations per predictor.

No comments:

Post a Comment