Question

I am trying to remove stop words before performing topic modeling. I noticed that some negation words (not, nor, never, none etc..) are usually considered to be stop words. For example, NLTK, spacy and sklearn include "not" on their stop word lists. However, if we remove "not" from these sentences below they lose the significant meaning and that would not be accurate for topic modeling or sentiment analysis.

1). StackOverflow is helpful      => StackOverflow helpful
2). StackOverflow is not helpful  => StackOverflow helpful

Can anyone please explain why these negation words are typically considered to be stop words?

Was it helpful?

Solution

Stop words are usually thought of as "the most common words in a language". However, other definitions based on different tasks are possible.

It clearly makes sense to consider 'not' as a stop word if your task is based on word frequencies (e.g. tf–idf analysis for document classification).

If you're concerned with the context (e.g. sentiment analysis) of the text it might make sense to treat negation words differently. Negation changes the so-called valence of a text. This needs to be treated carefully and is usually not trivial. One example would be the Twitter negation corpus. An explanation of the approach is given in this paper.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top