Question

I'm having a problem with child processes hanging onto a socket after exec(). This process 1) reads udp packets, and 2) kills/starts other processes. This process monitors other processes via the udp packets that they send.

This runs on Windows, Linux, and AIX. I have not experienced any issues on AIX, only on Linux. (The Windows code is significantly different, so I won't go into details about that.)

I am setting the FD_CLOEXEC flag on the returned descriptor immediately after the creating it via fcntl(). This must run on Red Hat EL 4-6, so using O_CLOEXEC on creation is not an option (the kernels in RHEL4/5 do not have the option.)

For maintenance, the monitoring process may need to be restarted, and when I attempt to restart it, I find that occasionally one of the child processes is still bound to the socket, preventing the monitoring process from doing so. [Normally this wouldn't be an issue (because the user would see the restart failed and take appropriate action), however the monitor itself is monitored via a different mechanism (to avoid a SPOF), and an automated restart of the monitoring process may fail if one of its child processes is holding onto the socket. This can lead to more Bad Things happening downstream. ]

I have went so far as to add code between the fork() and the exec() calls to explicitly close the socket (with associated shutdown) in the child process, and synchronized the fork() and the read() via a pthread_mutex so that I am not reading from the socket when a fork occurs.

The socket is created with

s = socket( AF_INET, SOCK_DGRAM, IPPROTO_UDP )

and no other options. Immediately after the creation, I make the call to fcntl to set FD_CLOEXEC. The process is still single-threaded at this point, so there is no race condition (in theory) before the flag is set.

The bind is done next, while still single-threaded. It binds to the first IPV4 address matching "localhost" as returned by getaddrinfo (probably unnecessary, but it's using an underlying utility function to simplify the call to bind.)

The close logic in the child process after the fork (none of which should be necessary because of the FD_CLOEXEC) is:

char retryClose = 1;
int eno = 0;
int retries = 20;

if ( shutdown( s, SHUT_RDWR ) ) {
    /* Failed to shutdown. Wait and try again */
    my_sleep( 3000 ); /* sleep using select(0,NULL,NULL,NULL, timeval) */
    shutdown( socketno, SHUT_RDWR );
    /* not much else can be done... */
}
while ( retryClose && ( close( s ) == -1 ) )
{
    /* save error number */
    eno = errno;
    /* check specific error */
    switch ( eno ) {
        case ( EIO ) :
        /* terminate loop if retries have expired; otherwise sleep for a while and try again */
            if ( --retries <= 0 ) {
                retryClose = 0;
            }
            else {
                my_sleep( 50 );
            }
            case ( EINTR ) :
                break;
            case ( EBADF ) :
        default:
            retryClose = FALSE;
            break;
    } /* switch ( eno ) */
}

So, I'm setting the FD_CLOEXEC flag, and explicitly closing the fd prior to the exec() call.

Am I missing anything? Is there anything I can do to ensure that the child process really doesn't hang onto the socket?

No correct solution

OTHER TIPS

Turns out, it wasn't the fork/exec that was causing the problem.

The server process could be restarted several times after starting all children processes, without any problems, but occasionally, when the server would die, one of the child processes would actually grab the server socket.

Switching from using connect()/send() in the client to just sendto() seems to have resolved the problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top