Question

I have about 5000 terms in a table and I want to group them into categories that make sense.

For example some terms are:

Nissan

Ford

Arrested

Jeep

Court

The result should be that Nissan, Ford, Jeep get grouped into one category and that Arrested and Court are in another category. I looked at the Stanford Classifier NLP. Am I right to assume that this is the right one to choose to do this for me?

Was it helpful?

Solution

I would suggest you to use NLTK if there weren't many proper nouns. You can use the semantic similarity from WordNet as features and try to cluster the words. Here's a discussion about how to do that.

To use the Stanford Classifier, you need to know how many buckets (classes) of words you want. Besides I think that is for documents rather than words.

OTHER TIPS

That's an interesting problem that the word2vec model that Google released may help with.

In a nutshell, a word is represented by an N-dimensional vector generated by a model. Google provides a great model that returns a 300-dimensional vector from a model trained on over 100 billion words from their news division.

The interesting thing is that there are semantics encoded in these vectors. Suppose you have the vectors for the words King, Man, and Woman. A simple expression (King - Man) + Woman will yield a vector that is exceedingly close to the vector for Queen.

This is done via a distance calculation (cosine distance is their default, but you can use your own on the vectors) to determine similarity between words.

For your example, the distance between Jeep and Ford would be much smaller than between Jeep and Arrested. Through this you could group terms 'logically'.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top