Question

I encounter weird behavior of an ipython cluster. The calculations finish, but many results never reach the client (and the engines just idle after finishing their first calculation).

I suspect something is wrong with zmq because 1) from time to time I see the following error:

  File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 118, in get
    if not self.ready():
  File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 132, in ready
    self.wait(0)
  File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 142, in wait
    self._ready = self._client.wait(self.msg_ids, timeout)
  File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 1058, in wait
    self.spin()
  File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 1015, in spin
    self._flush_results(self._task_socket)
  File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 814, in _flush_results
    idents,msg = self.session.recv(sock, mode=zmq.NOBLOCK)
  File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/zmq/session.py", line 642, in recv
    idents, msg_list = self.feed_identities(msg_list, copy)
  File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/zmq/session.py", line 673, in feed_identities
    idx = msg_list.index(DELIM)
ValueError: '<IDS|MSG>' is not in list

Additionally IPython.zmq has two test failures:

======================================================================
ERROR: test_send (IPython.zmq.tests.test_session.TestSession)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/clusterdata/python/env_stable/lib/python2.7/site-packages/IPython/zmq/tests/test_session.py", line 76, in test_send
    socket = MockSocket(zmq.Context.instance(),zmq.PAIR)
  File "/clusterdata/python/env_stable/lib/python2.7/site-packages/IPython/zmq/tests/test_session.py", line 34, in __init__
    self.data = []
  File "/clusterdata/python/env_stable/lib/python2.7/site-packages/zmq/sugar/attrsettr.py", line 38, in __setattr__
    self.__class__.__name__, upper_key)
AttributeError: MockSocket has no such option: DATA

======================================================================
ERROR: test_send (IPython.zmq.tests.test_session.TestSession)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/clusterdata/python/env_stable/lib/python2.7/site-packages/zmq/tests/__init__.py", line 108, in tearDown
    raise RuntimeError("context could not terminate, open sockets likely remain in test")
RuntimeError: context could not terminate, open sockets likely remain in test

----------------------------------------------------------------------

I use pyzmq 13.0.0 (as installed by pip), and the zeromq 3.2.2, compiled by the setup of pyzmq. I use ipython 13.1 and python 2.7.3.

Any suggestions of what could this be, and if not how I could figure out more information why these errors occur?

Update: It turns out the slowdown was due to a long task queue of ipcontroller, which was then taking 100% CPU and lagging horribly. That is a separate issue, but I would still appreciate feedback on the above.

Was it helpful?

Solution

Answered by @minrk in comments. ZMQ errors were unimportant, performance was due to scheduling, and was solved by setting TaskScheduler.hwm=0.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top