Distributed software debug with gdb

Question 1

You can try to do it like this:

First start nodes with gdbserver on remote hosts. It is even possible to start it without a program to debug, if you start it with --multi flag. When server is in multi mode, you can control it from your local session, I mean that you can make it start a program you want to debug. Then, start multiple inferiors in your gdb session

gdb> add-inferior -copies <number of servers>

switch them to a remote target and connect them to remote servers

gdb> inferior 1
gdb> target extended-remote host:port // use extended to switch gdbserver to multi mode
// start a program if gdbserver was started in multi mode
gdb> inferior 2
...

Now you have them all attached to one gdb session. The problem is that, AFAIK, it is not much better than to start multiple gdb's from different console tabs. On the other hand you can write some scripts or auto tests this way. See the gdb tutorial: server and inferiors.

Question 2

I don't believe there is one, simple, answer to debugging "many remote applications". Yes, you can attach to a process on another machine, and step through it in GDB. But it's quite awkward to debug a large number of interdependent processes, especially when the problem is complicated.

I believe a good set of logging capabilities in the code, supplemented with additional logs for specific debugging as needed, is more likely to give you a good/fast result.

Another option might be to run the processes on one machine, rather than on multiple machines. Perhaps even use threads within one process, to simulate the behaviour of multiple machines, simplifying the debugging process. Of course, this doesn't prevent bugs that appear ONLY when you run 20 processes on 20 different machines. But the basic idea is to reduce the number of those bugs to a minimum, and debug most things in a "simpler environment".

Aggressive use of defensive programming paradigms, such as liberal use of assert is clearly a good idea (perhaps with a macro to turn it off for the production runs, but make sure that you don't just leave error paths completely unchecked - it is MUCH harder to detect that the reason something crashes is that a memory allocation failed than to track down where that NULL pointer came from some 20 function calls away from a failed allocation.