Question

I have 100 Gb of documents. I would like to characterize it and get a general sense of what topics are prevalent.

The documents are plain text.

I have considered using a tool like Google Desktop to search, but it is too large to really guess what to search ask for and too time consuming to perform enough searches to cover the entire set.

Are there any freely available tools that will cluster a large dataset of documents?

Are there any such tools that can visualize such clusters?

Was it helpful?

Solution

For a basic NLP approach, you could represent each document as a vector based on word frequencies, then cluster the document vectors using Bayesian or other methods (SVM, k-means, etc).

For related answers, see this somewhat similar SO question.

OTHER TIPS

You need to look into tools that do natural language processing. Basically you can quite reliably determine (using statistical tools) the language of a document (see http://en.wikipedia.org/wiki/N-gram) and the domain of discourse (see http://en.wikipedia.org/wiki/Support_vector_machine). Some tools should be available if you start from wikipedia.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top