Fdist and top 10 function words

https://stackoverflow.com/questions/14487659

17-01-2022
|

Question

I have to write a script that will give me all content words in decending order of frequency. I need the 10 most frequent content words, I thus not only need to make a list of the 10 most frequent words of my corpus, I will also need to filter out any content words (and, or, any punctuation...). What I have so far is the following

fileids=corpus.fileids ()
text=corpus.words(fileids)
wlist=[]
ftable=nltk.FreqDist (text)
wlist.append(ftable.keys () )

This gives me a very neat list of all words in decending order of frequency, but how do I filter the function words out?

Thank you.

Solution

You want to filter out a set of words (stopwords). Taking the core idea from this SO answer:

You need to introduce a couple of lines into your code: Just after

fileids=corpus.fileids ()
text=corpus.words(fileids)

Add the following lines: Create a list of stopwords and filter them out from your text

#get a list of the stopwords
stp = nltk.corpus.stopwords.words('english')

#from your text of words, keep only the ones NOT in stp
filtered_text = [w for w in text if not w in stp]

Now, continue as you would

wlist=[]
ftable=nltk.FreqDist (filtered_text)
wlist.append(ftable.keys () )

Hope that helps.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow