Question

I have some 100.000+ text documents. I'd like to find a way to answer this (somewhat ambiguous) question:

For a given subset of documents, what are the n most frequent words - related to the full set of documents?

I'd like to present trends, eg. a word cloud showing something like "these are the topics that are especially hot in the given date range". (Yes, I know that this is an oversimplification: words != topics etc.)

It seems that I could possibly calculate something like tf-idf values for all words in all documents, and then do some number crunching, but I don't want to reinvent any wheels here.

I'm planning on possibly using Lucene or Solr for indexing the documents. Would they help me with this question - how? Or would you recommend some other tools in addition / instead?

Was it helpful?

Solution

This should work: http://lucene.apache.org/java/3_1_0/api/contrib-misc/org/apache/lucene/misc/HighFreqTerms.html

This StackOverflow Question also covers term frequencies in general with Lucene.

If you were not using Lucene already, the operation you are talking about is a classic introductory problem for Hadoop (the "word count" problem).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top