English texts lexicon comparison

https://stackoverflow.com/questions/4771679

22-10-2019
|

Question

Let's imagine, we can build a statistics table, how much each word is used in some English text or book. We can gather statistics for each text/book in library. What is the simplest way to compare these statistics with each other? How can we find group/cluster of texts with very statistically similar lexicon?

Solution

First, you'd need to normalize the lexicon (i.e ensure that both lexicons have the same vocabulary).

Then you could use a similarity metric like the Hellenger distance or the cosine similarity to compare the two lexicons.

It may also be a good idea to look into machine learning packages such as Weka.

This book is an excellent source for machine learning and you may find it useful.

OTHER TIPS

I would start by seeing what Lucene (http://lucene.apache.org/java/docs/index.html ) had to offer. After that you will need to use a machine learning method and look at http://en.wikipedia.org/wiki/Information_retrieval.

You might consider Kullback Leibler distance. For reference, see page 18 of Cover and Thomas:

Chapter 2, Cover and Thomas

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow