Question

I'm chasing a Heisenbug in a linux x64 process. (Attaching to a the process with a debugger or strace makes the problem never occur.) I've been able to put in an infinite loop when the code detects the fault and attach with gdb that way, but it just shows me that a file descriptor (fd) that should be working is no longer valid. I really want to get a history of the fd, hence trying strace, but of course that won't let the problem repo.

Other factors indicate that the problem with gdb/strace is timing. I've tried running strace with -etrace=desc or even -eraw=open and outputting to a ramdisk to see if that would reduce the strace overhead in the right way to trigger the problem, but no success. I tried running strace+, but it is a good order of magnitude slower than strace.

The process I'm attaching to is partly a commercial binary that I don't have source access to, and partly code I preload into the process space, so printf-everywhere isn't 100% possible.

Do you have suggestions for how to trace the fd history?

Update: added note about strace+

Was it helpful?

Solution

I solved the tracing problem by:

  1. Preloading wrapper stub functions around the relevant system calls, open(), close() and poll()
  2. Logging the relevant information in a filename created on a ramdisk.

(The actual issue was a race, with the kernel's poll() tring to access pollfd memory and returning EFAULT.)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top