Faster way to remove stop words in Python

Question 1

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

nCalls Cumulative Time

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

So, caching the stopwords instance gives a ~70x speedup.

Question 2

Use a regexp to remove all words which do not match:

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

This will probably be way faster than looping yourself, especially for large input strings.

If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.

Question 3

Sorry for late reply. Would prove useful for new users.

Create a dictionary of stopwords using collections library

Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))

from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])

Question 4

First, you're creating stop words for each string. Create it once. Set would be great here indeed.

forbidden_words = set(stopwords.words('english'))

Later, get rid of [] inside join. Use generator instead.

Replace

' '.join([x for x in ['a', 'b', 'c']])

with

' '.join(x for x in ['a', 'b', 'c'])

Next thing to deal with would be to make .split() yield values instead of returning an array. ~~I believe regex would be good replacement here.~~ See thist hread for why s.split() is actually fast.

Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.

Question 5

Try using this by avoid looping and instead using regex to remove stopwords:

import re
from nltk.corpus import stopwords

cachedStopWords = stopwords.words("english")
pattern = re.compile(r'\b(' + r'|'.join(cachedStopwords) + r')\b\s*')
text = pattern.sub('', text)

Question 6

Using just a regular dict seems to be the fastest solution by far.
Surpassing even the Counter solution by about 10%

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'
text = " ".join([word for word in text.split() if word not in stopwords_dict])

Tested using the cProfile profiler

You can find the test code used here: https://gist.github.com/maxandron/3c276924242e7d29d9cf980da0a8a682

EDIT:

On top of that if we replace the list comprehension with a loop we get another 20% increase in performance

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'

new = ""
for word in text.split():
    if word not in stopwords_dict:
        new += word
text = new