Race condition when using dup2

Question 1

There is an explanation in fs/file.c, do_dup2():

/*
 * We need to detect attempts to do dup2() over allocated but still
 * not finished descriptor.  NB: OpenBSD avoids that at the price of
 * extra work in their equivalent of fget() - they insert struct
 * file immediately after grabbing descriptor, mark it larval if
 * more work (e.g. actual opening) is needed and make sure that
 * fget() treats larval files as absent.  Potentially interesting,
 * but while extra work in fget() is trivial, locking implications
 * and amount of surgery on open()-related paths in VFS are not.
 * FreeBSD fails with -EBADF in the same situation, NetBSD "solution"
 * deadlocks in rather amusing ways, AFAICS.  All of that is out of
 * scope of POSIX or SUS, since neither considers shared descriptor
 * tables and this condition does not arise without those.
 */
fdt = files_fdtable(files);
tofree = fdt->fd[fd];
if (!tofree && fd_is_open(fd, fdt))
    goto Ebusy;

Looks like EBUSY is returned when the descriptor to be freed is in some kind of incomplete state when it's still being opened (fd_is_open but not present in fdtable).

EDIT (more info and do want bounty)

In order to understand how !tofree && fd_is_open(fd, fdt) can happen, let's see how files are opened. Here a simplified version of sys_open :

long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
    /* ... irrelevant stuff */
    /* allocate the fd, uses a lock */
    fd = get_unused_fd_flags(flags);
    /* HERE the race condition can arise if another thread calls dup2 on fd */
    /* do the real VFS stuff for this fd, also uses a lock */
    fd_install(fd, f);
    /* ... irrelevant stuff again */
    return fd;
}

Basically two very important things happen: a file descriptor is allocated and only then it is actually opened by the VFS. These two operations modify the fdt of the process. They both use a lock, so nothing bad is to expect inside those two calls.

In order to memorize which fds have been allocated a bit vector called open_fds is used by the fdt. After get_unused_fd_flags(), the fd has been allocated and the corresponding bit set in open_fds. The lock on the fdt has been released, but the real VFS job hasn't been done yet.

At this precise moment, another thread (or another process in the case of shared fdt) can call dup2 which will not block because the locks have been released. If the dup2 took its normal path here, the fd would be replaced, but fd_install would be still run for the old file. Hence the check and return of Ebusy.

I found additional info on this race condition in the comments of fd_install() which confirms my explanation:

/* The VFS is full of places where we drop the files lock between
 * setting the open_fds bitmap and installing the file in the file
 * array.  At any such point, we are vulnerable to a dup2() race
 * installing a file in the array before us.  We need to detect this and
 * fput() the struct file we are about to overwrite in this case.
 *
 * It should never happen - if we allow dup2() do it, _really_ bad things
 * will follow. */

Question 2

I'm not entirely aware of the choices Linux made, but the comment from the Linux kernel in the other answer points to stuff I've worked on in OpenBSD 13 years ago, so here my attempt at remembering what the hell was going on.

Because of the way open is implemented, it first allocates a file descriptor, then it actually tries to finish the opening operation with the file descriptor table unlocked. One reason might be that we don't actually want to cause the side effects of open (the simplest would be changing atime on the file, but for example opening devices can have much more severe side effects) if it fails because we're out of file descriptors. The same applies to all other operations that allocate file descriptors, when you read the text below just substitute open with "any system call that allocates file descriptors". I don't remember if this is mandated by POSIX or just The Way Things Have Always Been Done.

open can allocate memory, go down to the file system and do a bunch of things that are potentially blocking for a long time. In the worst case for filesystems like fuse it might even go back up to userland. For that reason (and others) we don't actually want to hold the file descriptor table locked during the whole open operation. Locks inside the kernel are quite bad to hold while sleeping, doubly so if the completion of the locked operation might require interaction with userland[1].

The problem happens when someone calls open in one thread (or a process that shares the same file descriptor table), it allocates a file descriptor and hasn't finished it yet while at the same time another thread does a dup2 pointing to the same file descriptor that open just got. Since an unfinished file descriptor is still invalid (for example read and write will return EBADF when you try to use it) we can't actually close it just yet.

In OpenBSD this is solved by keeping track of allocated, but not yet open file descriptors with complex reference counting. Most operations will just pretend like the file descriptor isn't there (but it isn't allocatable either) and will just return EBADF. But for dup2 we can't pretend it isn't there, because it is. The end result is that if two threads concurrently call open and dup2, open will actually perform a full open operation on the file, but since dup2 won the race for the file descriptor, the last thing open does is to decrement the reference count on the file it just allocated and close it again. Meanwhile dup2 won the race and pretended to close the file descriptor that open got (which it actually didn't do it was actually open that did it). It doesn't really matter which behavior the kernel chooses since in both cases this is a race that will lead to unexpected behavior for either open or dup2. At best, Linux returning EBUSY is just shrinking the window for a race, but the race is still there, there's nothing preventing the dup2 call to happen just as open is returning in the other thread and replace the file descriptor before the caller of open has a chance to use it.

The error in your question will most likely happen when you hit this race. To avoid it do not dup2 to a file descriptor you don't know the state of unless you are sure that there is no one else that will be accessing the file descriptor table at the same time. And the only way to be sure is to be the only thread running (file descriptors are opened behind your back by libraries all the time) or knowing exactly what file descriptor you're overwriting. The reason dup2 over an unallocated file descriptor is allowed in the first place is that it's a common idiom to close fds 0, 1 and 2 and dup2 /dev/null into them.

On the other hand, not closing file descriptors before dup2 will lose the error return from close. I wouldn't worry about that though, since the errors from close are stupid and shouldn't be there in the first place: Handling C Read Only File Close Errors For another example of unexpected behavior of threads and how file descriptors behave strangely because of what I've been talking about here see this question: Socket descriptor not getting released on doing 'close ()' for a multi-threaded UDP client

Here's some example code to trigger this:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <err.h>
#include <pthread.h>

static void *
do_bad_things(void *v)
{
    int *ip = v;
    int fd;

    sleep(2);   /* pretend this is proper synchronization. */

    if ((fd = open("/dev/null", O_RDONLY)) == -1)
        err(1, "open 2");

    if (dup2(fd, *ip))
        warn("dup2");

    return NULL;
}

int
main(int argc, char **argv)
{
    pthread_t t;
    int fd;

    /* This will be our next fd. */
    if ((fd = open("/dev/null", O_RDONLY)) == -1)
        err(1, "open");
    close(fd);

    if (mkfifo("xxx", 0644))
        err(1, "mkfifo");

    if (pthread_create(&t, NULL, do_bad_things, &fd))
        err(1, "pthread_create");

    if (open("xxx", O_RDONLY) == -1)
        err(1, "open fifo");

    return 0;
}

A FIFO is the standard method to cause open to block for as long as you wish. As expected, this works silently on OpenBSD and on Linux dup2 returns EBUSY. On MacOS for some reason it kills the shell where I did "echo foo > xxx", while a normal program that just opens it for writing works fine, I have no idea why.

[1] An anecdote here. I've been involved in writing a fuse-like filesystem used for an AFS implementation. One bug we had was that we held a file object lock while calling into the userland. The locking protocol for directory entry lookups requires you to hold the directory lock, then look up the directory entry, lock the object under that directory entry and then release the directory lock. Since we held file object lock, some other process came in and tried to look up the file, which led to that process to sleep for the file lock while still holding the directory lock. Another process came in, tried to look up the directory, and ended up holding the lock of the parent directory. Long story short, we ended up with a chain of locks held until we reached the root directory. Meanwhile the filesystem daemon was still talking to the server over the network. For some reason the network operation failed and the filesystem daemon needed to log an error message. To do that it had to read some locale database. And to do that it needed to open a file using the full path. But since the root directory was locked by someone else, the daemon waited for that lock. And we had a deadlock chain 8 locks long. That's why the kernel often performs complex contortionist gymnastics to avoid holding locks during long operations, especially filesystem operations.