Question

Let's imagine, we can build a statistics table, how much each word is used in some English text or book. We can gather statistics for each text/book in library. What is the simplest way to compare these statistics with each other? How can we find group/cluster of texts with very statistically similar lexicon?

Was it helpful?

Solution

First, you'd need to normalize the lexicon (i.e ensure that both lexicons have the same vocabulary).

Then you could use a similarity metric like the Hellenger distance or the cosine similarity to compare the two lexicons.

It may also be a good idea to look into machine learning packages such as Weka.

This book is an excellent source for machine learning and you may find it useful.

OTHER TIPS

I would start by seeing what Lucene (http://lucene.apache.org/java/docs/index.html ) had to offer. After that you will need to use a machine learning method and look at http://en.wikipedia.org/wiki/Information_retrieval.

You might consider Kullback Leibler distance. For reference, see page 18 of Cover and Thomas:

Chapter 2, Cover and Thomas

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top