Question

I'm taking my first steps writing code to do linguistic analysis of texts. I use Python and the NLTK library. The problem is that the actual counting of words takes up close to 100 % of my CPU (iCore5, 8GB RAM, macbook air 2014) and ran for 14 hours before I shut the process down. How can I speed the looping and counting up?

I have created a corpus in NLTK out of three Swedish UTF-8 formatted, tab-separated files Swe_Newspapers.txt, Swe_Blogs.txt, Swe_Twitter.txt. It works fine:

import nltk
my_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(".", r"Swe_.*", cat_pattern=r"Swe_(\w+)\.txt")

Then I've loaded a text-file with one word per line into NLTK. That also works fine.

my_wordlist = nltk.corpus.WordListCorpusReader("/Users/mos/Documents/", "wordlist.txt")

The text-file I want to analyse (Swe_Blogs.txt) has this structure, and works fine to parse:

Wordpress.com   2010/12/08  3   1,4,11  osv osv osv …
bloggagratis.se 2010/02/02  3   0   Jag är utled på plogade vägar, matte är lika utled hon.
wordpress.com   2010/03/10  3   0   1 kruka Sallad, riven

EDIT: The suggestion to produce the counter as below, does not work, but can be fixed:

counter = collections.Counter(word for word in my_corpus.words(categories=["Blogs"]) if word in my_wordlist)

This produces the error:

IOError                                   Traceback (most recent call last)
<ipython-input-41-1868952ba9b1> in <module>()
----> 1 counter = collections.Counter(word for word in my_corpus.words("Blogs") if word    in my_wordlist)
       /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, categories)
182     def words(self, fileids=None, categories=None):
183         return PlaintextCorpusReader.words(
--> 184             self, self._resolve(fileids, categories))
185     def sents(self, fileids=None, categories=None):
186         return PlaintextCorpusReader.sents(

                /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, sourced)
 89                                            encoding=enc)
 90                            for (path, enc, fileid)
 ---> 91                            in self.abspaths(fileids, True, True)])
 92 
 93 
 /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/api.pyc in abspaths(self, fileids, include_encoding, include_fileid)
165             fileids = [fileids]
166 
--> 167         paths = [self._root.join(f) for f in fileids]
168 
169         if include_encoding and include_fileid:  

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/      lib/python2.7/site-packages/nltk/data.pyc in join(self, fileid)
174     def join(self, fileid):
175         path = os.path.join(self._path, *fileid.split('/'))
--> 176         return FileSystemPathPointer(path)
177 
178     def __repr__(self):

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/  lib/python2.7/site-packages/nltk/data.pyc in __init__(self, path)
152         path = os.path.abspath(path)
153         if not os.path.exists(path):
--> 154             raise IOError('No such file or directory: %r' % path)
155         self._path = path

IOError: No such file or directory: '/Users/mos/Documents/Blogs'

A fix is to assign my_corpus(categories=["Blogs"] to a variable:

blogs_text = my_corpus.words(categories=["Blogs"])

It's when I try to count all occurrences of each word (about 20K words) in the wordlist within the blogs in the corpus (115,7 MB) that my computer get's a little tired. How can I speed up the following code? It seems to work, no error messages, but it takes >14h to execute.

import collections
counter = collections.Counter()

for word in my_corpus.words(categories="Blogs"):
    for token in my_wordlist.words():
        if token == word:
            counter[token]+=1
        else:
            continue

Any help to improve my coding skills is much appreciated!

Was it helpful?

Solution

It seems like your double loop could be improved:

for word in mycorp.words(categories="Blogs"):
    for token in my_wordlist.words():
        if token == word:
            counter[token]+=1

This would be much faster as:

words = set(my_wordlist.words()) # call once, make set for fast check
for word in mycorp.words(categories="Blogs"):
    if word in words:
        counter[word] += 1

This takes you from doing len(my_wordlist.words()) * len(mycorp.words(...)) operations to closer to len(my_wordlist.words()) + len(mycorp.words(...)) operations, as building the set is O(n) and checking whether a word is in the set is O(1) on average.

You can also build the Counter direct from an iterable, as Two-Bit Alchemist points out:

counter = Counter(word for word in mycorp.words(categories="Blogs") 
                  if word in words)

OTHER TIPS

You already got good answers on how to count words properly with Python. The problem is that it will still be quite slow. If you are just exploring the corpora, using a chain of UNIX tools gives you a much quicker result. Assuming that your text is tokenized, something like this gives you the first 100 tokens in descending order:

cat Swe_Blogs.txt | cut --delimiter='\t' --fields=5 | tr ' ' '\n' | sort | uniq -c | sort -nr | head -n 100
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top