Pergunta

I have a text file which contains data in the LIBSVM format i.e it is as follows

165475 0:246870 1124384:2 342593:7 1141651:1 297582:1 1186846:1 17725:1 656602:1 463304:1 766612:1 573309:1 290046:1 748198:1 216665:1 950594:2 909004:1 29008:1 105623:1 5018:5 806027:1 1125729:1 757846:1 1023921:2 612980:1 120767:1 51340:1 108172:5 674420:2

where the first term (165475) represents the label of the dataset followed by the feature vectors : weight. The file comprises of a LOT of such samples.

My question is provided that these samples are being used in the context of a Text Classification problem, if I were to write my own code for k-nearest neighbors on this, how do I measure the distance between two samples?How do the weights of each feature contribute to the distance?

I am currently using Python but am open to code in any language as long as I can understand the logic. Any help would be appreciated.Thanks in advance!

Foi útil?

Solução

Every pair is of the form index:value. That gives a very simple vector for each entity. The weight (i.e. value) is simply the magnitude of the projection of this vector in the corresponding dimension (i.e. index).

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top