Why does the overfitting decreases if we choose K to be large in K-nearest neighbors?

https://datascience.stackexchange.com/questions/81866

14-12-2020
|

Question

I am studying machine learning and I am focusing on K-nearest neighbors . I have understood the algorithm, but I have still a doubt, which is on how to choose the K for the number of neighbors.

I have seen that choosing $K=1$ lads to a frafmented decision boundary, while if I choose a larger value of K, I obtain a smoother boundary:

I think that I have lear the reason of why this happens.

But I have also studied that if $K=1$ we do overfitting, whiile if we increase $K$ the overfitting decreases, but why?

So, why does the overfitting decreases if we choose K to be large in K-nearest neighbors?

Solution

Overfitting is "The production of an analysis which corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably." (Oxford dictionary)

When you fit a ML model, you use a dataset that you assume is a sample of the real statistical distribution you want to model. This means that the dataset does not fully reflect the real distribution and will contain anomalies, exceptions or some randomness. An overfitted model would stick too much to the particularities of this dataset and be too variant, instead of learning statistical patterns that would smooth these particularities and bring the model closer to the real statistical distribution.

In the case of KNN, K controls the size of the neighborhood used to model the local statistical properties. A very small value for K makes the model more sensitive to local anomalies and exceptions, giving too many weight to these particular points. On the contrary, a too large value of K would make the model ignore the local structure of the distribution you try to learn, and produce an underfitted model.

Let's take an example to illustrate this:

In a given city, you want to predict if a household has low or high income because you have data about most of them, but not all. You decide to use KNN and predict the income category of the households with no data according to the majority category among the K closest neighbors. This is likely to work well because we know that housing prices tend to create low or high income neighborhoods.

If you use a small K, let's say K=1 (you predict based on the closest neighbor), you might end up with these kind of predictions:

In a low income neighborhood, you wrongly predict one househlod to have a high income because its closest neighbor has a high income. Indeed, this neighbor refused to move out from his childhood house though he could afford living anywhere else.
In a high income neighborhood, you wrongly predict one household to have a low income because its closest neighborhood has a low income. Indeed, though he couldn't afford the rent, the house belongs to the family since ever.

The model overfits on these particular cases. Now if you increase the number of neighbors you compare to (= increase K), you will get a majority of either low or high income households in the neighborhood and correctly classify these examples.

Finally, if you increase K too much, you will have an underfitted model. In our example, if we increase K indefinitely we will end up taking all households in the city as neighbors, and basically always predict the category that is the majority category in the city, ignoring local particularities.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange