Question

So I'm making a python class which calculates the tfidf weight of each word in a document. Now in my dataset I have 50 documents. In these documents many words intersect, thus having multiple same word features but with different tfidf weight. So the question is how do I sum up all the weights into one singular weight?

Was it helpful?

Solution

First, let's get some terminology clear. A term is a word-like unit in a corpus. A token is a term at a particular location in a particular document. There can be multiple tokens that use the same term. For example, in my answer, there are many tokens that use the term "the". But there is only one term for "the".

I think you are a little bit confused. TF-IDF style weighting functions specify how to make a per term score out of the term's token frequency in a document and the background token document frequency in the corpus for each term in a document. TF-IDF converts a document into a mapping of terms to weights. So more tokens sharing the same term in a document will increase the corresponding weight for the term, but there will only be one weight per term. There is no separate score for tokens sharing a term inside the doc.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top