Should I keep common stop-words when preprocessing for word embedding?

https://datascience.stackexchange.com/questions/80227

13-12-2020
|

Pergunta

If I want to construct a word embedding by predicting a target word given context words, is it better to remove stop words or keep them?

the quick brown fox jumped over the lazy dog

quick brown fox jumped lazy dog

As a human, I feel like keeping the stop words makes it easier to understand even though they are superfluous.

So what about for a Neural Network?

Solução

In general stop-words can be omitted since they do not contain any useful information about the content of your sentence or document.

The intuition behind that is that stop-words are the most common words in a language and occur in every document independent of the context. Therefore they contain no valuable information which could hint to the content of the document.

Outras dicas

It's not mandatory. Removing stopwords can sometimes help and sometimes not. You should try both.

A case for not using stopwords: Using stopwords will provide context to the user's intent. So when you use a contextual model like BERT, all stopwords are kept to provide enough context information like the negation words (not, nor, never) which are considered to be stopwords.

According to this paper:

Surprisingly, the stopwords received as much attention as non-stop words, but removing them has no effect in MRR performances.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange