Frequencies of lucene unigrams and bigrams
-
01-10-2019 - |
Question
i am storing in lucene index ngrams up to level 3. When I am reading the index and calculating scoring of terms and ngrams I am obtaining results like this
TERM FREQUENCY.... TFIDF
minority 25 16.512926
minority report 24 16.179296
report 27 13.559037
cruise 12 11.440491
tom cruise 7 8.737819
So, if we look at the example of "tom cruise", together as bigram it occurs 7 times. And from this we see that "cruise" occurs alone 5 times. So I dont want this duplication of frequency, because "cruise" alone has scored better than "tom cruise", which is not true, since it is contained inside.
Sorry if i explain bad i dont know how to call this type of scoring, if someone know to explain this technical words, please edit.
Thank you
Solution
I believe I answered a similar question you asked a while ago. IIUC, you want the more important terms to stand out, and you feel that "tom cruise" is more important than "cruise".
This looks like a problem in your model of the data. TFIDF seems to be wrong for what you want. You can try building a language model, as described in Peter Norvig's "Beautiful Data" chapter.
The gist is:
- Calculate a probability per each unigram, bigram and trigram (you will need smoothing or back-off as explained in the paper).
- Choose your terms by probability rather than TFIDF.
A Language Model Approach to Keyphrase Extraction seems to do similar stuff. Some alternatives are Kea (which uses TFIDF as one feature among several) and Peter Turney's Keyphrase extraction work.