Question

i am storing in lucene index ngrams up to level 3. When I am reading the index and calculating scoring of terms and ngrams I am obtaining results like this

TERM              FREQUENCY....      TFIDF
minority           25           16.512926
minority report 24           16.179296
report           27           13.559037
cruise           12           11.440491
tom cruise        7            8.737819

So, if we look at the example of "tom cruise", together as bigram it occurs 7 times. And from this we see that "cruise" occurs alone 5 times. So I dont want this duplication of frequency, because "cruise" alone has scored better than "tom cruise", which is not true, since it is contained inside.

Sorry if i explain bad i dont know how to call this type of scoring, if someone know to explain this technical words, please edit.

Thank you

Was it helpful?

Solution

I believe I answered a similar question you asked a while ago. IIUC, you want the more important terms to stand out, and you feel that "tom cruise" is more important than "cruise".

This looks like a problem in your model of the data. TFIDF seems to be wrong for what you want. You can try building a language model, as described in Peter Norvig's "Beautiful Data" chapter.

The gist is:

  • Calculate a probability per each unigram, bigram and trigram (you will need smoothing or back-off as explained in the paper).
  • Choose your terms by probability rather than TFIDF.

A Language Model Approach to Keyphrase Extraction seems to do similar stuff. Some alternatives are Kea (which uses TFIDF as one feature among several) and Peter Turney's Keyphrase extraction work.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top