App crashing after too many missed heartbeats

https://stackoverflow.com/questions/13816435

07-12-2021
|

Domanda

I have an app that is distributing load on a bunch of workers. So far all workers are running on the same VM, have not needed to scale up yet. My problem is that, like every 3-4 days, the worker crashes with the error message below - no contact between the client and the rabbitmq server in 1200 secs (I guess).

Traceback (most recent call last):
  File "/var/www/vhosts/niklas/workers/builder.py", line 170, in <module>
    BuildWorker().main()
  File "/var/www/vhosts/niklas/lib/worker.py", line 29, in main
    self.msgs.ch.start_consuming()
  File "/usr/local/lib/python2.6/dist-packages/pika/adapters/blocking_connection.py", line 722, in start_consuming
    self.connection.process_data_events()
  File "/usr/local/lib/python2.6/dist-packages/pika/adapters/blocking_connection.py", line 93, in process_data_events
    self.process_timeouts()
  File "/usr/local/lib/python2.6/dist-packages/pika/adapters/blocking_connection.py", line 99, in process_timeouts
    self._call_timeout_method(self._timeouts.pop(timeout_id))
  File "/usr/local/lib/python2.6/dist-packages/pika/adapters/blocking_connection.py", line 164, in _call_timeout_method
    timeout_value['method']()
  File "/usr/local/lib/python2.6/dist-packages/pika/heartbeat.py", line 85, in send_and_check
    return self._close_connection()
  File "/usr/local/lib/python2.6/dist-packages/pika/heartbeat.py", line 106, in _close_connection
    HeartbeatChecker._STALE_CONNECTION % duration)
  File "/usr/local/lib/python2.6/dist-packages/pika/adapters/blocking_connection.py", line 75, in close
    self.process_data_events()
  File "/usr/local/lib/python2.6/dist-packages/pika/adapters/blocking_connection.py", line 91, in process_data_events
    self._handle_timeout()
  File "/usr/local/lib/python2.6/dist-packages/pika/adapters/blocking_connection.py", line 198, in _handle_timeout
    self._on_connection_closed(None, True)
  File "/usr/local/lib/python2.6/dist-packages/pika/adapters/blocking_connection.py", line 235, in _on_connection_closed
    raise exceptions.AMQPConnectionError(*self.closing)
pika.exceptions.AMQPConnectionError: (320, 'Too Many Missed Heartbeats, No reply in 1200 seconds')

My question is, what could possibly cause this? This only happen to ~1 out of three workers, the others are running fine without any error message or warning (again, all workers and rabbitmq-server on the same VM). I'm using the standard method in Python library pika, start_consuming(), to retrieve new requests. The code is way to big too attach here, and considering the error message, it seems to be out of my code or a system issue.

I'm using:

Python Pika 0.9.8
Rabbitmq 3.0.0
Debian 6.0
All workers are started inside screen
VM hosted at Linode, 512MB memory

Soluzione

We experienced a similar problem due to a bug (#236) in pika 0.9.8.

https://github.com/pika/pika/pull/236

This should be fixed in 0.9.9 or can be resolved by patching your pika library with the source code attached to the linked issue on github.

(Pika was closing a connection on 2 cumulative missed heartbeats rather than 2 consecutive ones).

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow