Question

I have to write a script that will give me all content words in decending order of frequency. I need the 10 most frequent content words, I thus not only need to make a list of the 10 most frequent words of my corpus, I will also need to filter out any content words (and, or, any punctuation...). What I have so far is the following

fileids=corpus.fileids ()
text=corpus.words(fileids)
wlist=[]
ftable=nltk.FreqDist (text)
wlist.append(ftable.keys () )

This gives me a very neat list of all words in decending order of frequency, but how do I filter the function words out?

Thank you.

Was it helpful?

Solution

You want to filter out a set of words (stopwords). Taking the core idea from this SO answer:

You need to introduce a couple of lines into your code: Just after

fileids=corpus.fileids ()
text=corpus.words(fileids)

Add the following lines: Create a list of stopwords and filter them out from your text

#get a list of the stopwords
stp = nltk.corpus.stopwords.words('english')

#from your text of words, keep only the ones NOT in stp
filtered_text = [w for w in text if not w in stp]

Now, continue as you would

wlist=[]
ftable=nltk.FreqDist (filtered_text)
wlist.append(ftable.keys () )

Hope that helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top