Methods/Tools for solving a Mystery Segfault while running on condor

https://stackoverflow.com/questions/3681755

02-10-2019
|

Question

I'm writing a C application which is run across a compute cluster (using condor). I've tried many methods to reveal the offending code but to no avail.

Clues:

On Average when I run the code on 15 machines for 2 days, I get two or three segfaults (signal 11).
When I run the code locally I do not get a segfault. I ran it for nearly 3 weeks on my home machine.

Attempts:

I ran the code in valGrind for four days locally with no memory errors.
I captured the segfault signal by defining my own signal handler so that I can output some of the program state.
Now when a segfault happens I can print out the current stack using backtrace.
I can print out variable values.
I created a variable which is set to the current line number.
Have also tried commenting chunks of the code out, hoping that if the problem goes away I will discover the segfault.

Sadly the line number outputted is fairly random. I'm not entirely sure what I can do with the stacktrace. Am I correct in assuming that it only records the address of the function in which the segfault occurs?

Suspicions:

I suspect that the check pointing system which condor uses to move jobs across machines is more sensitive to memory corruption and this is why I don't see it locally.
That indices are being corrupted by the bug, and that these indices are causing the segfault. This would explain the fact that the segfaults are occurring on fairly random line numbers.

UPDATE

Researching this some more I've found the following links:

LibSegFault - a library for automatically catching and printing state data about segfaults.
Stack unwinding (stack trace) with GCC tutorial on catching segfaults and get the line numbers of the offending instructions.

UPDATE 2

Greg suggested looking at the condor log and to 'correlate the segfaults to when condor restarts the executable from a checkpoint'. Looking at the logs the segfaults all occur immediately after a restart. All of the failures appear to occur when a job switches from one type of machine to another type.

UPDATE 3

The segfault was being caused by differences between hosts, by setting the 'requiremets' field in the condor submit file to problem completely disappeared.

One can set individual machines:

requirements = machine == "hostname1" || machine == "hostname2"

or an entire class of machines:

requirements = classOfMachinesName

See requirements example here

Solution

if you can, compile with debugging, and run under gdb. alternatively, get core dumped and load that into debugger.

mpich has built-in debugger, or you can buy commercial parallel debugger.

Then you can step through the code to see what happening in debugger

http://nmi.cs.wisc.edu/node/1610

http://nmi.cs.wisc.edu/node/1611

OTHER TIPS

Can you create a core dump when your segfault happens? You can then debug this dump to try to figure out the state of the code when it crashed.

Look at what instruction caused the fault. Was it even a valid instruction or are you trying to execute data? If valid, what memory is it trying to access? Where did this pointer come from. You need to narrow down the location of your fault (stack corruption, heap corruption, uninitialized pointer, accessing invalid memory). If it's a corruption, see if if there's any tell-tale data in the corrupted area (pointers to symbols, data that looks like something in your structures, ...). Your memory allocator may already have built in features to debug some corruption (see MALLOC_CHECK_ on Linux or MallocGuardEdges on Mac OS). A common case for these is using memory that has been free()'d, so logging your malloc() / free() pairs might help.

If you have used the condor_compile tool to relink your code with the condor checkpointing code, it does a few things differently than a normal link. Most importantly, it statically links your code, and uses it's own malloc. Another big difference is that condor will then run it on a foreign machine, where the environment may be different enough from what you expect to cause problems.

The executable generated by condor_compile is runnable as a standalone binary outside of the condor system. If you run the binary emitted from condor_compile locally, outside of condor, do you still see the segfaults?

If it doesn't, can you correlate the segfaults to when condor restarts the executable from a checkpoint (the user log will tell you when this happens).

You've tried most of what I'd think of. The only other thing I'd suggest is start adding a lot of logging code and hope you can narrow down where the error is happening.

The one thing you do not say is how much flexibility you have to solve the problem. Can you, for example, have the system come to a halt and just run your application? Also how important are these crashes to solve?

I am assuming that for the most part you do. This may require a lot of resources.

The short term step is to put tons of "asserts" ( semi handwritten ) of each variable to make sure it hasn't changed when you don't want it to. This can ccontinue to work as you go through the long term process.

Long term-- try running it on a cluster of two ( maybe your home computer and a VM ). Do you still see the segfaults. If not increase the cluster size until you start seeing segfaults.

Run it on a minimum configuration ( to get segfaults ) and record all your inputs till a crash. Automate running the system with the inputs that you recorded, tweaking them until you can consistent get a crash with minimal input.

At that point look around. If you still can't find the bug, then you will have to ask again with some extra data you gathered with those runs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow