Question

Very strange bug, perhaps someone will see something I'm missing.

I have a C++ program which forks off a bash shell, and then passes commands to it.

Periodically, the commands will contain nonsense and the bash process will hang. I detect this using semtimedwait, and then run a little function like this:

if (kill(*bash_pid, SIGKILL)) {
    cerr << "Error sending SIGKILL to the bash process!" << endl;
    exit(1); 
} else {
    // collect exit status
    long counter = 0;
    do {
        pid = waitpid(*bash_pid, &status, WNOHANG);
        if (pid == 0) { // status not available yet
            sleep(1);
        }
        if(counter++ > 5){
            cerr << "ERROR: Bash child process ignored SIGKILL >5 sec!" << endl;
        }
    } while (pid != *bash_pid && pid != -1);
    if(pid == -1){
        cerr << "Failed to clean up zombie bash process!" << endl;
        exit(1);
    }

    // re-initialized bash process
    *bash_pid = init_bash();
 }

Assuming I understand the workings of waitpid correctly, this should first send SIGKILL to the shell, and then essentially sit in a spinlock, trying to reap the resulting process. Eventually, it succeeds and then a new bash process is started with init_bash().

At least, that's what should happen. Instead, the child process's exit status is never collected, and it continues to exist as a zombie process. In spite of this, the parent does exit the loop and manages to restart the bash process, and continues with normal execution. Eventually too many zombies are generated and the system runs out of pids.

Additionally:

  • Fork is called in exactly one place in the program, inside init_bash.
  • Checks prevent init_bash from being called except once at the program's start and after a call to the function above.

Thoughts?

Was it helpful?

Solution

Articles that I read indicate that the reason for a zombie process is that a child process does an exit however the parent never collects the child's exit.

This article provides several ways to kill a zombie process from the command line. One technique is to use other signals besides SIGKILL for instance SIGTERM.

This article has an answer which suggests SIGKILL should not be used.

One of the techniques is to kill the parent thereby also killing its child processes including any zombies. The author indicates that there appear to be child processes that just remain as zombies until the OS is restarted.

You do not mention the mechanism used to communicate the commands to the child process. However one option may be to turn the child process loose by disconnecting it from its parent similar to the way a child of a terminal process can be disconnected from the terminal session. That way the child will become its own process and if there is a problem may exit without becoming a zombie.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top