Question

I'm trying to find out the importance of a word from a given set of random words. For instance, I would like to know that "accident" is the most important word from the words "man, woman, accident". A naive solution was to get a WordNet depth for each word and calculate an importance of the word based on the dissimilarity in the word depths. This solution is quite time consuming since this requires n(n-1) calculations to generate the final importance. Is there a better solution to handle this scenario?

Was it helpful?

Solution

The usual approach is that the less common a word is, the more important it is.

First, choose a corpus that represents your problem domain. Then run a word frequency count over it. You could skip these two sets and use a pre-made list, e.g. http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists and e.g. http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/PG/2006/04/1-10000 However, making word frequencies is one of the easier things to do in Python/NLTK.

The third step is to find the frequency of each of your input words, and the one with the lowest frequency is the most salient. Or, if this is input to another step and a real number is useful, tf-idf gives you that.

You might want to normalize/stem words first. That will depend on your application. But, if you do, make sure you do it both in the generation stage (i.e. normalize your corpus), and in the usage stage (normalize your inputs).

Here are some examples, using frequency counts from the Word Usage Trends box here at http://www.collinsdictionary.com/dictionary/english/man:

man          0.0289
woman        0.0149
walk         0.0064
shot         0.0049
accident     0.0048

Luckily those numbers match up with the correct answers you gave: accident and shot.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top