English texts lexicon comparison
-
22-10-2019 - |
سؤال
Let's imagine, we can build a statistics table, how much each word is used in some English text or book. We can gather statistics for each text/book in library. What is the simplest way to compare these statistics with each other? How can we find group/cluster of texts with very statistically similar lexicon?
المحلول
First, you'd need to normalize the lexicon (i.e ensure that both lexicons have the same vocabulary).
Then you could use a similarity metric like the Hellenger distance or the cosine similarity to compare the two lexicons.
It may also be a good idea to look into machine learning packages such as Weka.
This book is an excellent source for machine learning and you may find it useful.
نصائح أخرى
I would start by seeing what Lucene (http://lucene.apache.org/java/docs/index.html ) had to offer. After that you will need to use a machine learning method and look at http://en.wikipedia.org/wiki/Information_retrieval.
You might consider Kullback Leibler distance. For reference, see page 18 of Cover and Thomas: