Question

I have 2 questions,

  1. I've made a vector from a document by finding out how many times each word appeared in a document. Is this the right way of making the vector? Or do I have to do something else also?

  2. Using the above method I've created vectors of 16 documents, which are of different sizes. Now i want to apply cosine similarity to find out how similar each document is. The problem I'm having is getting the dot product of two vectors because they are of different sizes. How would i do this?

Was it helpful?

Solution

  1. Sounds reasonable, as long as it means you have a list/map/dict/hash of (word, count) pairs as your vector representation.

  2. You should pretend that you have zero values for the words that do not occur in some vector, without storing these zeros anywhere. Then, you can use the following algorithm to compute the dot product of these vectors (pseudocode):

    algorithm dot_product(a : WordVector, b : WordVector):
        dot = 0
        for word, x in a do
            y = lookup(word, b)
            dot += x * y
        return dot
    

    The lookup part can be anything, but for speed, I'd use hashtables as the vector representation (e.g. Python's dict).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top