Question

I have a Python daemon running in production. It employs between 7 and 120 threads. Recently the smallest instance (7 threads) started to show hangs while all other instances never showed this kind of problem. Attaching strace to the python process shows that all threads are calling futex FUTEX_WAIT_PRIVATE, so they are probably trying to lock something.

How would you debug such a problem?

Note that this is a production system running from flash memory, so disk writes are constrained, too.

Was it helpful?

Solution

The observation was slightly incorrect. One thread wasn't calling futex, but instead swapping while holding the gil. Since the machine in question is low hardware this swapping took very long and seemed to be a deadlock. The underlying problem is a memory leak. :-(

OTHER TIPS

Dear Helmut, I've the same problem with one thread hanging on FUTEXT_WAIT_PRIVATE.

It seems you have solved the issue. Can you share more information about the solution?

UPD:

The reason for the lock was finally found (at least for my case): it was due to import lock in Python.

Consider following situation:

file1.py:

import file2

file2.py:

create thread "thread2"

run "thread2"

wait until "thread2" finish with some function (let's say go Go())

def Go():

import some_module

....

Here the import in Go() would hang up since the import is locked in the main thread (by import file2) which will not be released until Go() finishes. The user will see in strace hang on FUTEX_WAIT_PRIVATE.

To work around this place the code executed during the import of file2 into Do() function and run it after importing file2:

import file2

file2.Do()

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top