Question

Here is my word vector :

google
test
stackoverflow
yahoo

I have assigned a value for these words as follows :

google : 1
test : 2
stackoverflow : 3
yahoo : 4

Here are some sample users and their words :

user1   google, test , stackoverflow
user2   test , google
user3   test , yahoo
user4   stackoverflow , yahoo
user5   stackoverflow , google
user6

To cater for users which do not have value contained in the word vector I assign '0'

Based on this, this corresponds to :

user1   1, 2 , 3
user2   2 , 1 , 0
user3   2 , 4 , 0
user4   3 , 4 , 0
user5   3 , 1,  0
user6   0 , 0 , 0

I am unsure if these are the correct values or even is correct approach for applying values to each word vector value so can apply 'Eucludeian distance' and 'correlation'. I'm basing this on snippet from book 'Programming Collective Intelligence' :

"Collecting Preferences The first thing you need is a way to represent different people and their preferences. If you were building a shopping site, you might use a value of 1 to indicate that someone had bought an item in the past and a value of 0 to indicate that they had not. "

For my dataset I do not have preference values so I am just using a unique numerical value to represent if a user contains a word in word vector or not.

Are these the correct values to set for my word vector ? How should I determine what these values should be ?

Was it helpful?

Solution

To make distance and similarity metrics work out, you need one column per word in the vocabulary, then fill those columns with booleans zero and one as the corresponding words occur in samples. E.g.

                                 G   T   SO  Y!
google, test, stackoverflow  =>  1,  1,  1,  0
test, google                 =>  1,  1,  0,  0
stackoverflow, yahoo         =>  0,  0,  1,  1

etc.

The squared Euclidean distance between the first two vectors is now

(1 - 1)² + (1 - 1)² + (1 - 0)² + (0 - 0)² = 1

which makes intuitive sense as the vectors differ in exactly one position. Similarly, the squared distance between the final two vectors is four, which is the maximal squared distance in this space.

This encoding is an extension of the "one-hot" or "one-of-K" coding, and it's a staple of machine learning on text (although few textbooks care to spell it out).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top