Frage

I am just a young learner in Data Science. Could you help me to resolve this exercise of K-NN?

Ex.1

The table below provides a training data set containing six observations, three predictors, and one qualitative response variable:

x1  x2  x3  y
0   3   0  Red
2   0   0  Red
0   1   3  Red
0   1   2  Green
−1  0   1  Green
1   1   1  Red

Suppose we wish to use this dataset to make a prediction for Y when x1 = x2 = x3 = 0 using K-nearest neighbors.

1) With the euclidean distance, what is the prediction with K = 1 and with K = 3 for the test point (0, 0, 0)?

2) If the Bayes decision boundary in this problem is highly non-linear, then would we expect the best value for K to be large or small? Why?

War es hilfreich?

Lösung

An algorithm implementing KNN for classification tasks goes as follows:

  1. Compute the distance (in this case Euclidean) from the test point in question to all points in the training data.
  2. Using the computations from 1), sort the training points in ascending order, according to their distance from the test point in question (smallest distance first).
  3. Take the first K training points from the list in part 2), and collect the classes assigned to each in one list.
  4. Assign to the test point in question the class which appears most often in the list from part 3).

In this case, since we are interested in the Euclidean distance, you can use the formula

$$ d(p, p') = \sqrt{(x_{1}- x_{1}')^{2} + (x_{2}- x_{2}')^{2} + (x_{3}- x_{3}')^{2}} $$

to calculate the distance between a training point $p=(x_{1}, x_{2},x_{3})$ and a test point in question $p' = (x_{1}', x_{2}', x_{3}')$.

With the euclidean distance, what is the prediction with K = 1 and with K = 3 for the test point (0, 0, 0)?

For the K=1 case, if my algebra serves me correctly, the training point with the smallest distance to the test point $p' = (0,0,0)$ is the point $p=(-1,0,1)$ (with distance $\sqrt{2}$). Assuming my calculation is correct, this means we should associate the class of the point p, namely the class "Green", to the test point $p'$. The K=3 case can be approached by the same method, but this time we will need to calculate all distances from train points to p', keep the train points with the 3 smallest distances, and assign the class which occurs most often among those 3 train points. I'll leave it to you.

If the Bayes decision boundary in this problem is highly non-linear, then would we expect the best value for K to be large or small? Why?

Some intuition can be gained by thinking about the K=1 case. In this case, each training point has a neighborhood around it such that any new test points in that region will be classified in the same way as the training point. By definition, this neighborhood is given by those points which are closer to the training point in question than they are to any other training points. These neighborhoods are known as Voronoi cells, and together they form a Voronoi tesselation (see wikipedia for some pictures). The decision boundaries for KNN with K=1 are comprised of collections of edges of these Voronoi cells, and the key observation is that traversing arbitrary edges in these diagrams can allow one to approximate highly nonlinear curves (try making your own dataset and drawing it's voronoi cells to try this out). To answer the question, one can investigate how this ability to approximate nonlinear curves changes as K gets made larger, and try to relate the findings to the existence of a highly nonlinear decision boundary from a Naive Bayes classifier.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit datascience.stackexchange
scroll top