
I'm having trouble pickling a vectorizer after I customize it.

from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
tfidf_vectorizer = TfidfVectorizer(analyzer=str.split)
pickle.dump(tfidf_vectorizer, open('test.pkl', "wb"))

this results in "TypeError: can't pickle method_descriptor objects"

However, if I don't customize the Analyzer, it pickles fine. Any ideas on how I can get around this problem? I need to persist the vectorizer if I'm going to use it more widely.

By the way, I've found that using the simple string split for analyzer and pre-processing the corpus to remove non-vocabulary and stop words is essential for decent run speed. Otherwise, most of the vectorizer run time is spent in "". Same goes for the HashingVectorizer

this is related to Persisting data in sklearn and (by the way, sklearn.externals.joblib.dump doesn't help either)


È stato utile?


This is not so much a scikit-learn problem as a general Python problem:

>>> pickle.dumps(str.split)
Traceback (most recent call last):
  File "<ipython-input-7-7d3648c78b22>", line 1, in <module>
  File "/usr/lib/python2.7/", line 1374, in dumps
    Pickler(file, protocol).dump(obj)
  File "/usr/lib/python2.7/", line 224, in dump
  File "/usr/lib/python2.7/", line 306, in save
    rv = reduce(self.proto)
  File "/usr/lib/python2.7/", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle method_descriptor objects

The solution is to use a pickleable analyzer:

>>> def split(s):
...     return s.split()
>>> pickle.dumps(split)
>>> tfidf_vectorizer = TfidfVectorizer(analyzer=split)
>>> type(pickle.dumps(tfidf_vectorizer))
<type 'str'>
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top