NLTK Stopword List

https://stackoverflow.com/questions/22763224

24-06-2023
|

Question

I have the code beneath and I am trying to apply a stop word list to list of words. However the results still show words such as "a" and "the" which I thought would have been removed by this process. Any ideas what has gone wrong would be great .

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
filtered_words = [w for w in word_list if not w in stopwords.words('english')]
print filtered_words

Solution

A few things of note.

If you are going to be checking membership against a list over and over, I would use a set instead of a list.
stopwords.words('english') returns a list of lowercase stop words. It is quite likely that your source has capital letters in it and is not matching for that reason.
You aren't reading the file properly, you are checking over the file object not a list of the words split by spaces.

Putting it all together:

import nltk
from nltk.corpus import stopwords

word_list = open("xxx.y.txt", "r")
stops = set(stopwords.words('english'))

for line in word_list:
    for w in line.split():
        if w.lower() not in stops:
            print w

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow