Build Dictionary without Loading All Texts

https://stackoverflow.com/questions/19474333

01-07-2022
|

Pergunta

I am new to Python and Gensim. I am currently working through one of the tutorials on gensim (http://radimrehurek.com/gensim/tut1.html). I have two question about this line of code:

# collect statistics about all tokens
>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))

1) Is the file mycorpus.txt fully loaded into memory before the Dictionary starts to get built? The tutorial explicitly says no:

Similarly, to construct the dictionary without loading all texts into memory

but when I monitor RAM usage in my Activity Monitor, the Python process hits 1 gig for a 3 gig file (I killed the process midway). This is strange, as I assumed the dictionary for my 3 gig text file would be MUCH smaller. Can someone clarify this point for me?

2) How can I recode this line so that I can do stuff between each line read? I want to print to screen to see the progress. Here is my attempt:

i = 1

for line in f:
    if i % 1000 == 0:
        print i
    dictionary = corpora.Dictionary([line.lower().split()])
    i += 1

This doesn't work because dictionary is being reinitialized for every line.

I realize these are very n00b questions - appreciate your help and patience.

Solução

1) No, they are passing a generator object which will yield only one line at a time to the dictionary constructor. Other than some caching done by python internally, it only reads basically 1 line at a time.

After the dictionary is built, it will probably take almost the same amount of memory as the original file -- After all, it's probably storing all that information.

2) As far as recoding it, you can make a new generator which does your action and yields the lines as it did before:

def generator(f)
    for i, line in enumerate(f):
        if i % 1000 == 0:
            print i
        yield line

with open('mycorpus.txt') as f:
    dictionary = corpora.Dictionary(line.lower().split() for line in generator(f))

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow