Tf-idf: Is this approach correct?
Question
I would like to calculate the term frequency using tf-idf. I've drafted an equation where you should get the tf-idf value on the left side. Is this correct?
Tf-idf for DOCUMENT
:
tf-idf(WORD) = occurrences(WORD,DOCUMENT) / number-of-words(DOCUMENT) * log10 ( documents(ALL) / ( 1 + documents(WORD, ALL) ) )
occurrences(WORD,DOCUMENT)
: number of occurrences ofWORD
inDOCUMENT
number-of-words(DOCUMENT)
: number of words inDOCUMENT
documents(ALL)
: number of documents in the databasedocuments(WORD, ALL)
: number of documents in the database which containWORD
It would be great if you could help me. Thank you very much in advance!
Solution
According to the wikipedia article it is correct, you might want to change to 1+documents(WORD, ALL) instead of just documents(WORD, ALL) as the wikipedia article suggests.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow