Word Frequency calculation for 1Gb text file in Python

https://stackoverflow.com/questions/14674266

06-03-2022
|

Pergunta

I am trying to calculate word frequency for a text file of size 1.2 GB which was around 203 million words. I am using the following Python code. But its giving me a memory error. Is there any solution for this?

Here is my code:

import re
# this one in honor of 4th July, or pick text file you have!!!!!!!
filename = 'inputfile.txt'
# create list of lower case words, \s+ --> match any whitespace(s)
# you can replace file(filename).read() with given string
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation marks to be removed
punctuation = re.compile(r'[.?!,":;]') 
for word in word_list:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = 1

print 'Unique words:', len(freq_dic)
# create list of (key, val) tuple pairs
freq_list = freq_dic.items()
# sort by key or word
freq_list.sort()
# display result
for word, freq in freq_list:
    print word, freq

And here is the error, I received:

Traceback (most recent call last):
  File "count.py", line 6, in <module>
    word_list = re.split('\s+', file(filename).read().lower())
  File "/usr/lib/python2.7/re.py", line 167, in split
    return _compile(pattern, flags).split(string, maxsplit)
MemoryError

Solução

The problem begins right here:

file(filename).read()

This reads in the whole file into a string. Instead, if you process the file line-by-line or chunk-by-chunk, you won't run into a memory problem.

with open(filename) as f:
    for line in f:

You could also benefit from using a collections.Counter to count the frequency of words.

In [1]: import collections

In [2]: freq = collections.Counter()

In [3]: line = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod'

In [4]: freq.update(line.split())

In [5]: freq
Out[5]: Counter({'ipsum': 1, 'amet,': 1, 'do': 1, 'sit': 1, 'eiusmod': 1, 'consectetur': 1, 'sed': 1, 'elit,': 1, 'dolor': 1, 'Lorem': 1, 'adipisicing': 1})

And to count some more words,

In [6]: freq.update(line.split())

In [7]: freq
Out[7]: Counter({'ipsum': 2, 'amet,': 2, 'do': 2, 'sit': 2, 'eiusmod': 2, 'consectetur': 2, 'sed': 2, 'elit,': 2, 'dolor': 2, 'Lorem': 2, 'adipisicing': 2})

A collections.Counter is a subclass of dict, so you can use it in ways with which you are already familiar. In addition, it has some useful methods for counting such as most_common.

Outras dicas

The problem is that you are trying to read the entire file into memory. The solution is read the file line by line, count the words of each line, and sum the results.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow