Question

I run a fairly popular browser-based web game, running under Apache (worker) and mod_perl. During peak times, when the server is handling about 4200 requests per minute, once every 3-15 minutes or so an Apache process will hang.

I have established that these processes get stuck in a "FUTEX_WAIT" state, and don't appear to be doing anything: they don't consume CPU or grow larger in RAM. But it's a serious problem because they just sit there, occupying RAM.

My current solution is a cron job that culls Apache processes stuck in futex_wait_queue_me. But that's not great, because users who happen to be waiting on a response from the hung Apache processes receive errors (500: server closed connection without sending data back).

I have been unable to reproduce the problem on my development machine, and can't figure out how to proceed with troubleshooting. I would love to know: How can I diagose this further?

Edit: I have found that the problem occurs following a burst in traffic, when Apache spawns some more worker processes, then tries to cull them afterward. This is how that looks when it works normally, from the child's point of view:

$ sudo strace -p 21764
Process 21764 attached - interrupt to quit
read(5, "!", 1)                         = 1
tgkill(21764, 21791, SIGHUP)            = 0
tgkill(21764, 21791, SIG_0)             = 0
select(0, NULL, NULL, NULL, {0, 500000}) = ? ERESTARTNOHAND (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
rt_sigreturn(0xf)                       = -1 EINTR (Interrupted system call)
munmap(0x7f9905750000, 8392704)         = 0
munmap(0x7f98f8736000, 8392704)         = 0
[...]
madvise(0x7f98e4021000, 73728, MADV_DONTNEED) = 0
exit_group(0)                           = ?
Process 21764 detached

... but occasionally it goes like this:

$ sudo strace -p 24133
Process 24133 attached - interrupt to quit
read(5, "!", 1)                         = 1
tgkill(24133, 24164, SIGHUP)            = 0
tgkill(24133, 24164, SIG_0)             = 0
--- SIGTERM (Terminated) @ 0 (0) ---
rt_sigreturn(0xf)                       = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
tgkill(24133, 24140, SIGUSR1)           = 0
futex(0x7f9904f4e9d0, FUTEX_WAIT, 24140, NULL

... and proceeds no further.

I don't know how to debug this any further.

Was it helpful?

Solution

This was due to a bug in mod-perl, since fixed, documented here:

http://www.gossamer-threads.com/lists/modperl/dev/104026

OTHER TIPS

pick the lowest traffic time, and fire up apache with strace on the live machine, so you can track down the cause of the error, for one internet blogger a solution boiled down to

rm /dev/random 
mknod -m 644 /dev/random c 1 9 

you can avoid 500: server closed connection without sending data back by using using a reverse-proxy-setup, so when apache detects a timeout without data, it forwards clients the request to a different mod_perl child

that way, instead of client getting 500, his request takes an extra 5 seconds (don't ask me for how-to , see the mod_perl/apache guide :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top