Having hard time tracking memory corruption - when running with Valgrind runs correctly with no errors

Question 1

With > 1 workers, I get: a. usually heap-corruption error, b.double-free. When running with valgrind -v with > 1 workers the program completes successfully

Based on the above symptoms, it looks to me that there is clearly some sort of synchronization problem is happening in your program. It looks like your program is sharing the heap memory address between the threads and hence whenever there is some data race you are facing problem.

You have also mentioned that when you are running valgrind -v, then your program is completing successfully. This indicates that your program has synchronization problem and that too is dependant on the sequence/timing. These are one of the most difficult bug to find out.We should also remember that dynamic tools would not give any warning until program goes and execute something wrong. I mean there could be problem in the program, but sequence of execution(as there is some timing related problem) determined whether tools would capture those failure or not.

Having said that, I think there is not sort cut way to find such bugs in big programs.However I strongly suspect that there is some data racing scenario which is leading to memory corruption/double free. So you may want to use Helgrind to check/find data racing/threading problem which might be leading to memory corruption.

Question 2

Now that I don't get any errors from valgrind to start with, what can I do to find the memory corruption problem in this complex and big application?

Well let me describe to you what I did to find memory leaks in Microsoft's implementation of JavaScript back in the 1990s.

First I ensured that in the debug version of my program, as many memory allocations as possible were being routed to the same helper methods. That is, I redefined malloc, new, etc, to all be synonyms for an allocator that I wrote myself.

That allocator was just a thin shell around an operating system virtual heap memory allocator, but it had some extra smarts. It allocated extra memory at the beginning and end of the block and filled that with sentinel values, a threadsafe count of the number of allocations so far, and a threadsafe doubly-linked list of all allocations. The "free" routine would verify that the sentinel values on both sides were still intact; if not, then there's a memory corruption somewhere. It would unlink the block from the linked list and free it.

At any point I could ask the memory manager for a list of all the outstanding blocks in memory in the order they had been allocated. Any items left in the list when the DLL was unloaded were memory leaks.

Those tools enabled me to find memory leaks and memory corruptions in real time very easily.

Question 3

Please use CORE DUMP::[mostly it used in double-free,glibc detected type errors]

Compile your program with gcc -g option for debug information

ulimit -a

it will show you size of core file

ulimit -c unlimited

it will set size of core file unlimited

now run you program, then in your current directory a file will generate named "core"

then analyze it by GDB as below..

gdb ./youprogram core

gdb)bt

it will show you where is problem..

if you find any difficulty then write me...