Question

I am trying to split up some emails on either announcements ("call for") as well as discussions ("discussions") in two groups using k-nearest neighbour classification. I suppose this could be done using

knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE)

I already have the document term matrix mails. I have no idea how I could construct the train, test matrices and cl factor using this document term matrix. I can't find any good examples, and don't understand the one on http://stat.ethz.ch/R-manual/R-devel/library/class/html/knn.html. Is there anyone who can point me in the right direction?

UPD

The whole TermDocumentMatrix is located at dl.dropboxusercontent.com/u/20641416/data

Was it helpful?

Solution

Well, I cannot solve your problem, since I have no sample data. However, I can clarify the example in the documentation for you, so you can start off with an idea of what's going on.

  • train is the "benchmark" data, for which the classification is already known. It will be used to form the knn structure, which will allow you to make future predictions.

  • cl are the correct answers for the training dataset.

Here a built-in dataset iris is used to simulate "known data". The train dataset is taken so that there is an equal number of each species (s - Setosa, c - Versicolor, v - Virginica).

train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3]) 
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
  • test is the dataset you are trying to classify. Given an existing (trained) knn structure, test observations are analyzed row by row, and a prediction is generated.

The same dataset is used to construct test data. Of course, we know the true classification here, but we pretend that we do not. True classification is the same as before; it cannot be used by the knn: for knn this information is not available. We store this data in order to estimate our predictions.

test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl.test <- cl

Finally, we are ready to proceed. Here's a vector of predictions for the test dataset. If prob=TRUE, we additionally see how "confident" the algorithm is about each case:

pr.test <- knn(train, test, cl, k = 3, prob=TRUE)
 [1] s s s s s s s s s s s s s s s s s s s s s s s s s c c v c c c c c v c c c c c c c c c c
[45] c c c c c c v c c v v v v v c v v v v c v v v v v v v v v v v
attr(,"prob")
 [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
 [9] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[17] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[25] 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000
[33] 1.0000000 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[41] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
[49] 1.0000000 1.0000000 1.0000000 0.6666667 0.7500000 1.0000000 1.0000000 1.0000000
[57] 1.0000000 1.0000000 0.5000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667
[65] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667
[73] 1.0000000 1.0000000 0.6666667
Levels: c s v

We now may estimate how correct our model is.

sum(pr.test==cl.test)/length(cl.test)

Which is 70 out of 75, or 93% correct.

Refer to the statistical literature for more details about how knn works. For your problem, consider cross-validation technique to tune the model.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top