Question

I would like to calculate the term frequency using tf-idf. I've drafted an equation where you should get the tf-idf value on the left side. Is this correct?

Tf-idf for DOCUMENT:

tf-idf(WORD) = occurrences(WORD,DOCUMENT) / number-of-words(DOCUMENT) * log10 ( documents(ALL) / ( 1 + documents(WORD, ALL) ) )
  • occurrences(WORD,DOCUMENT): number of occurrences of WORD in DOCUMENT
  • number-of-words(DOCUMENT): number of words in DOCUMENT
  • documents(ALL): number of documents in the database
  • documents(WORD, ALL): number of documents in the database which contain WORD

It would be great if you could help me. Thank you very much in advance!

Was it helpful?

Solution

According to the wikipedia article it is correct, you might want to change to 1+documents(WORD, ALL) instead of just documents(WORD, ALL) as the wikipedia article suggests.

TF-IDF on wikipedia

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top