Question

I can save a serialized corpus into foobar.mm but when i try to load it, it gives UnpicklingError. Loading the dictionary seems fine though. Anyone knows how to resolve this? And why does this occur?

>>> from gensim import corpora
>>> docs = ["this is a foo bar", "you are a foo"]
>>> texts = [[i for i in doc.lower().split()] for doc in docs]
>>> print texts
[['this', 'is', 'a', 'foo', 'bar'], ['you', 'are', 'a', 'foo']]

>>> dictionary = corpora.Dictionary(texts)
>>> dictionary.save('foobar.dic')
>>> print dictionary
Dictionary(7 unique tokens)
>>> corpora.Dictionary.load('foobar.dic')
<gensim.corpora.dictionary.Dictionary object at 0x329f910>

>>> corpus = [dictionary.doc2bow(text) for text in texts]
>>> corpora.MmCorpus.serialize('foobar.mm', corpus)
>>> corpus = corpora.MmCorpus.load('foobar.mm')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.8.6-py2.7.egg/gensim/utils.py", line 166, in load
    return unpickle(fname)
  File "/usr/local/lib/python2.7/dist-packages/gensim-0.8.6-py2.7.egg/gensim/utils.py", line 492, in unpickle
    return cPickle.load(open(fname, 'rb'))
cPickle.UnpicklingError: invalid load key, '%'.
Was it helpful?

Solution

See the documentation at http://radimrehurek.com/gensim/tut1.html#corpus-formats

What you're trying to do is store the corpus in MatrixMarket format (=a text format) and then load it using the save/load binary interface.

To load a serialized MatrixMarket corpus, simply corpus = corpora.MmCorpus('foobar.mm')

OTHER TIPS

Since gensim's corpora (whatever this is) is using pickle as the stacktrace reveals, you will only be able to store data of a limited type. For more details see What can be pickled and unpickled? in the Python docs.

If this does not apply (i. e. if what you want to pickle and unpickle should be picklable) I fear you might have found a bug in the pickle module. Maybe you then can solve your issue by upgrading to a newer Python version.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top