Antonym search for expanding search terms

https://datascience.stackexchange.com/questions/15891

16-10-2019
|

Pergunta

I some code that identifies ngrams of interest in a small bit of text, then searches for the ngram in a larger bit of text to provide a snippet of the term used in context. One ngram might be unordered numerical set. Right now, if I can't find unordered numerical set, I start trimming the string from the left to see if numerical set is available in the larger text.

What I would like to do before I trim from the left, is to see if the any of the words have an antonym, like ordered for unordered. This is because my ngram might be defined in one way, but used in the opposite.

Is there a known list or other way of finding out if there is a highly correlated antonym I might try for a set of words?

For instance, I think I could write some regexps that look for prefixes, like "un" or "dis" and see if the word that results from removing these prefixes is a valid english word. This seems likely to be already solved, so before I try to create anything, I want to find out what might already exist.

I am currently using Python using Gensim, NLTK, and Word2Vec for the rest of my processing if it matters.

Solução

That's an interesting problem.

This according to me is the most comprehensive way(if speed is not a problem. Or you can just pull all these words and create a dictionary/database of your own).

You can this , https://wordsapiv1.p.mashape.com/words/love/antonyms (More info about this API at this link)

However, you can restrict results to antonyms with this api.

You can use requests to make API calls.

import requests
import simplejson
response = requests.get(url)
result = simplejson.loads(response)

Then search for the antonyms from the result. If you are getting a big list of antonyms, use only the top n results for your searching.

Although, W2V gives most commonly used words in the context of the keyword, it's hard to guess which is the antonym of the keyword.

Outras dicas

Word2Vec can be used to find a word that relates to another word in the same way an example pair does. (fi: x is to happy, as bad is to good). You could use that to generate candidates of antonyms on the fly.

It might not be as accurate as precompiled list, but it will cover almost every candidate. In fact Word2Vec could also help you find other (other than antonyms) likely candidates for likely words given a context.

You can start with small lists of antonyms like this one. Maybe you can get a comprehensive list that will take care of most of the cases you are interested in. For now, let assume that you don't have such a list an discuss algorithmic ways to do it.

As you wrote regarding prefixes like "un" and "dis", you can use rules based of morphology too. Such rules are likely to have high precision (pairs that obey the morphological rule will be antonyms) but low recall (you will miss many pairs). We should use these rules in order to increase our dataset of antonyms.

Now, we should take a dataset of texts as as Wikipeida or the Wall Street Journal even Google ngrams data set. Antonyms will tend to appear in the same context but not together. For example, people will write about "ordered list", "unordered list" but not about "ordered unordered list".

A proper association for this purpose is words that appear in distance smaller than X more than Y times and the joint probability is higher than the expected by Z. You can use the dataset of antonyms we have in order to find a proper values of the parameter above.

Once you calculate the association level among words, the antonyms will be associated to the same words but not associated to each other. Note that here you are expected to have high recall but lower precision since synonyms also tend to have such relations.

While that option of lists is of the lowest effort and the lowest benefit, and the association method takes a considerable effort, there is another middle way. Many dictionaries, like wiktionary, have a section of antonyms. You can scrape them and build such a list.

Such a list is of great use and I was quite surprised that there wasn't such a common resource. If you will build one and will be willing to share it, you'll be very helpful.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange