Question

I'm trying to write a function to be executed in several IPython engines. The function takes a pandas Series as an argument. Each element of the Series is a string, and the whole Series constitutes a corpus for TF.IDF computation.

After reading IPython parallel documentation and some tutorials, it seems to be quite straightforward to do, and I came up with the following:

import pandas as pd
from IPython.parallel import Client


def calculemus(corpus):
    from sklearn.feature_extraction.text import TfidfVectorizer

    vectorizer = TfidfVectorizer(min_df=1, stop_words='english')

    return vectorizer.fit_transform(corpus)


review = pd.read_csv('review.csv')['text']
review = review.fillna('')

client = Client()

r = client[-1].apply(calculemus, review).get()

BUT I got this error instead:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)/xxx/site-packages/IPython/zmq/serialize.pyc in unpack_apply_message(bufs, g, copy)
    154                     sa.data = m.bytes
    155 
--> 156     args = uncanSequence(map(unserialize, sargs), g)
    157     kwargs = {}
    158     for k in sorted(skwargs.iterkeys()):
/xxx/site-packages/IPython/utils/newserialized.pyc in unserialize(serialized)
    175 
    176 def unserialize(serialized):
--> 177     return UnSerializeIt(serialized).getObject()
/xxx/site-packages/IPython/utils/newserialized.pyc in getObject(self)
    159                 buf = self.serialized.getData()
    160                 if isinstance(buf, (bytes, buffer, memoryview)):
--> 161                     result = numpy.frombuffer(buf, dtype = self.serialized.metadata['dtype'])
    162                 else:
    163                     raise TypeError("Expected bytes or buffer/memoryview, but got %r"%type(buf))
ValueError: cannot create an OBJECT array from memory buffer

I'm not sure what the problem is, could someone enlighten me on this?


UPDATE

Apparently the error says exactly what it says. If I do this:

r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()

it kinda works.

So the next question is, is this a feature or a limitation of IPython?

Was it helpful?

Solution

This is a bug in IPython 0.13 that should be fixed in master. There is a special case for serializing numpy arrays that avoids copying data, and this behavior is triggered by an isinstance(numpy.ndarray) check. This was inappropriate, because isinstance catches subclasses, which includes pandas objects, but those pandas objects (and array subclasses in general) should not be treated in the same way, as metadata will be lost, and reconstruction on the other side will often fail.

PS:

r = client[-1].apply(calculemus, np.array(review, dtype=str)).get()

is equivalent to

r = client[-1].apply_sync(calculemus, np.array(review, dtype=str))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top