K Nearest Neighbors classification Special Case with Identical Points

https://stackoverflow.com/questions/18681120

28-06-2022
|

Question

The question is about KNN algorithm for classification - the class labels of training samples are discrete.

Suppose that the training set has n points that are identical to the new pattern which we are about to classify, that is the distances from these points to new observation are zero (or <epsilon). It may happen that these identical training points have different class labels. Now suppose that n < K and there are some other training points which are the part of nearest neighbors collection but have non-zero distances to the new observation. How do we assign the class label to new point in this case?

There are few possibilities such as:

consider all K (or more if there are ties with the worst nearest neighbor) neighbors and do majority voting
ignore the neighbors with non-zero distances if there are "clones" of the new point in training data and take the majority vote only over the clones
same as 2. but assign the class with the highest prior probability in the training data (among clones)
...

Any ideas? (references would be appreciated as well)

Solution

Each of proposed methods will work in some problems, and in some they won't. In general, there is no need to actually think about such border cases and simply use the default behaviour (option "1" from your question). In fact, if border cases of any classification algorithm becomes the problem it is a signal of at least one of:

bad problem definition,
bad data representation,
bad data preprocessing,
bad model used.

From the theoretical point of view nothing changes if some points are exactly in the place of your training data. The only difference would be, if you have consistent training set (in the sense, that duplicates with different labels do not occur in the training data) and 100% correct (each label is a perfect labeling fot this point), then it would be reasonable to add an if clausule that answers according to the label of the point. But in reallity it is rarely the case.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow