Tracing memory corruption on a production linux server

https://stackoverflow.com/questions/1182837

19-09-2019
|

Question

Guys, could you please recommend a tool for spotting a memory corruption on a production multithreaded server built with c++ and working under linux x86_64? I'm currently facing the following problem : every several hours my server crashes with a segfault and the core dump shows that error happens in malloc/calloc which is definitely a sign of memory being corrupted somewhere.

Actually I have already tried some tools without much luck. Here is my experience so far:

Valgrind is a great(I'd even say best) tool but it slows down the server too much making it unusable in production. I tried it on a stage server and it really helped me find some memory related issues but even after fixing them I still get crashes on the production server. I ran my stage server under Valgrind for several hours but still couldn't spot any serious errors.
ElectricFence is said to be a real memory hog but I couldn't even get it working properly. It segfaults almost immediately on the stage server in random weird places where Valgrind didn't show any issues at all. Maybe ElectricFence doesn't support threading well?.. I have no idea.
DUMA - same story as ElectricFence but even worse. While EF produced core dumps with readable backtraces DUMA shows me only "?????"(and yes server is built with -g flag for sure)
dmalloc - I configured the server to use it instead of standard malloc routines however it hangs after several minutes. Attaching a gdb to the process reveals it's hung somewhere in dmalloc :(

I'm gradually getting crazy and simply don't know what to do next. I have the following tools to be tried: mtrace, mpatrol but maybe someone has a better idea?

I'd greatly appreciate any help on this issue.

Update: I managed to find the source of the bug. However I found it on the stage server not production one using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

Solution 3

Folks, I managed to find the source of the bug. However I found it on the stage server using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

OTHER TIPS

Yes, C/C++ memory corruption problems are tough. I also used several times valgrind, sometimes it revealed the problem and sometimes not.

While examining valgrind output don't tend to ignore its result too fast. Sometimes after a considerable time spent, you'll see that valgrind gave you the clue on the first place, but you ignored it.

Another advise is to compare the code changes from previously known stable release. It's not a problem if you use some sort of source versioning system (e.g. svn). Examine all memory related functions (e.g. memcpy, memset, sprintf, new, delete/delete[]).

Compile your program with gcc 4.1 and the -fstack-protector-all switch. If the memory corruption is caused by stack smashing this should be able to detect it. You might need to play with some of the additional parameters of SSP.

Have you tried -fmudflap? (scroll up a few lines to see the options available).

you can try IBM purify, but i am afraid that is not opensource..

The Google Perftools --- which is Open Source --- may be of help, see the heap checker documentation.

Try this one: http://www.hexco.de/rmdebug/ I used it extensively, its has a low impact in performance(it mostly impacts amount of ram) but the allocation algorithm is the same. Its always proven enough to find any allocation bugs. Your program will crash as soon as the bug occurs, and it will have a detailed log.

I'm not sure if it would have caught your particular bug, but the MALLOC_CHECK_ environment variable (malloc man page) turns on additional checking in the default Linux malloc implementation, and typically doesn't have a significant runtime cost.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow