Replace
if len(words2) >=7:
long_tokens.append(words2)
with:
long_tokens += [w for w in words2 if len(w) >= 7]
Explanation: what you were doing was you were appending all the words (tokens) produced by corpus.words(fileids)
if the number of words was at least 7 (so I suppose always for your corpus). What you really wanted to do was to filter out the words shorter than 7 characters from the tokens set and append the remaining long words to long_tokens
.
Your function should return the result - tokens having 7 characters or more. I assume the way you create and deal with CategorizedPlaintextCorpusReader
is OK:
loc="C:\Users\Dell\Desktop\CORPUS"
Corpus= CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)
def long_words(corpus = Corpus):
long_tokens=[]
for cat in corpus.categories():
fileids = corpus.fileids(categories=cat)
words = corpus.words(fileids)
long_tokens += [w for w in set(words) if len(w) >= 7]
return set(long_tokens)
print "\n".join(long_words())
Here is an answer to the question you asked in the comments:
for loc in ['cat1','cat2']:
print len(long_words(corpus=CategorizedPlaintextCorpusReader(loc,'(?!\.svn).*\.txt, cat_pattern=r'(Shakespeare|Milton)/.*)), 'words over 7 in', loc