Pergunta

I have a small program that uses NLTK to get the frequency distribution of a rather large dataset. The problem is that after a few million words I start to eat up all the RAM on my system. Here's what I believe to be the relevant lines of code:

freq_distribution = nltk.FreqDist(filtered_words)               # get the frequency distribution of all the words
top_words = freq_distribution.keys()[:10]                       # get the top used words
bottom_words = freq_distribution.keys()[-10:]                   # get the least used words

There must be a way to write the key, value store to disk, I'm just not sure how. I'm trying to stay away from a document store like MongoDB and stay purely pythonic. If anyone has some suggestions I would appreciate it.

Foi útil?

Solução

By coincidence, I had the same problem in the past month. I was trying to use NLTK and FreqDist to create n-gram frequency tables from large datasets (eg. the English Wikipedia and Gutenberg datasets). My 8GB machine could store a unigram model in memory, but not a bigram one.

My solution was to use BerkeleyDB, which stores a k,v database to disk; but also stores an in-memory table cache for speed. For frequency distributions, this is VERY slow, so I also created my own sub-tables in memory using FreqDist, and then periodically saved them to BerkeleyDB (typically every 1000 or so input files). This greatly reduces the BerkeleyDB writes because it removes a lot of duplicates - eg. "the" in a unigram model is only written once instead of many 100,0000s of times. I wrote it up here:

https://www.winwaed.com/blog/2012/05/17/using-berkeleydb-to-create-a-large-n-gram-table/

The problem with using pickle is that you have to store the entire distribution in memory. The only way of being purely pythonic is to write your own implementation, with it's own k,v disk database and probably your own in-memory cache. Using BerkeleyDB is an awful lot easier, and efficient!

Outras dicas

I've used the JSON module to store large dictionaries (or other data structures) in these kinds of situations. I think pickle or cpickle may be more efficient, unless you want to store the data in human-readable form (often useful for nlp).

Here's how I do it:

import json
d = {'key': 'val'}
with open('file.txt', 'w') as f:
    json.dump(d, f)

Then to retrieve,

with open('file.txt', 'r') as f:
    d = json.loads(f.read())
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top