Term Document Frequencies in Java with Lucene/Lingpipe in a large corpus

https://stackoverflow.com/questions/10348722

04-06-2021
|

Question

I am trying to analyze a large corpus of documents, which are in a huge file (3.5GB, 300K lines, 300K documents), one document per line. In this process I am using Lucene for indexing and Lingpipe for preprocessing.

The problem is that I want to get rid of very rare words in the documents. For example, if a word occurs less than MinDF times in the corpus (the huge file), I want to remove it.

I can try to do it with Lucene: Compute the Document Frequencies for all distinct terms, sort them in ascending order, get the terms that have DF lower than MinDF, go over the huge file again, and remove these terms line per line.

This process will be insanely slow. Does anybody know of any quicker way to do this using Java?

Regards

Solution

First create a temp index, then use the information in it to produce the final index. Use IndexReader.terms(), iterate over that, and you have TermEnum.docFreq for each term. Accumulate all low-freq terms and then feed that info into an analyzer that extends StopWordAnalyzerBase when you are creating the final index.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow