Question

I am trying to compare the semantic of two phrases. In Python I am using nltk and difflib. First I am removing the stop words from the phrases, then I am using WordNetLemmatizer and PorterStemmer to normalise the words then I am comparing the rest with the SequenceMatcher of difflib. I still think that there is a much better way than using difflib. Any suggestion or propostion? Is there any library that use Wordnet in the comparision between phrases? Is the steps I am making are correct?

Was it helpful?

Solution

In short, no, you cannot do this sort of semantics with NLTK. And using Wordnet will simply not work because most sentences contain words that are not in the database. The current way to approximate sentential semantics involves distributional techniques (word space models).

If you are a python programmer, scikit-learn and Gensim give you the functionality you want by means of Latent Semantic Analysis (LSA, LSI) and Latent Dirichlet Allocation (LDA). See the answers to this previous question. In Java, I would suggest you to try the excellent S-Space package.

However, most models will give you a strictly word-based representation. Combining the semantics of words into larger structures is much more difficult, unless you assume that phrases and sentences are bags-of-words (and thus, missing the difference between e.g. Mary loves Kate and Kate loves Mary.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top