Question

I have a Lucene index of a big text file (corpus), for some of n-grams I need to find a list of similar words (co-occurrence list).

For example, I have unigram - "table" with term frequency 1500 and I need to get such a co-occurrence list, with co-occurrence counts and the measured co-occurrence strength:

WORD       FREQ         Dice(Jaccard) coefficient
brown      1286         0.3
break      729          0.2
Était-ce utile?

La solution

Serach for brown and break.

Lucene will only return documents that contain both, if your set the parameters right.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top