Question

Linux's "man close" warns (SVr4, 4.3BSD, POSIX.1-2001):

Not checking the return value of close() is a common but nevertheless serious programming error. It is quite possible that errors on a previous write(2) operation are first reported at the final close(). Not checking the return value when closing the file may lead to silent loss of data. This can especially be observed with NFS and with disk quota.

I can believe that this error is common (at least in applications; I'm no kernel hacker). But how serious is it, today or at any point in the past three decades? In particular:

Is there a simple, reproducible example of such silent loss of data? Even a contrived one like sending SIGKILL during close()?

If such an example exists, can the data loss be handled more gracefully than just

printf("Sorry, dude, you lost some data.\n"); ?

Was it helpful?

Solution

[H]ow serious is it, today or at any point in the past three decades?

Typical applications process data. They consume some input, and produce a result. So, there are two general cases where close() may return an error: when closing an input (read-only?) file, and when closing a file that was just generated or modified.

The known situations where close() returns an error are specific to writing/flushing data to permanent storage. In particular, it is common for an operating system to cache data locally, before actually writing to the permanent storage (at close(), fsync(), or fdatasync()); this is very common with remote filesystems, and is the reason why NFS is mentioned on the man page.

I have never encountered an error while closing a read-only input file. All the cases I can think of where it might happen in real life using any of the common filesystems are ones where there is a catastrophic failure, something like kernel data structure corruption. If that happens, I think the close() error cannot be the only sign that something is terribly wrong.

When writing to a file on a remote filesystem, close()-time errors are woefully common, if the local network is prone to glitches or just drops a lot of packets. As an end user, I want my applications to tell me if there was an error when writing to a file. Usually the connection to the remote filesystem is broken altogether, and the fact that writing to a new file failed, is the first indicator to the user.

If you don't check the close() return value, the application will lie to the user. It will indicate (by a lack of an error message if not otherwise), that the file was correctly written, when in fact it wasn't, and the application was told so; the application just ignored the indication. If the user is like me, they'll be very unhappy with the application.

The question is, how important is user data to you? Most current application programmers don't care at all. Basile Starynkevitch (in a comment to the original question) is absolutely right; checking for close() errors is not something most programmers bother to do.

I believe that attitude is reprehensible; cavalier disregard for user data.

It is natural, though, because the users have no clear indication as to which application corrupted their data. In my experience the end users end up blaming the OS, hardware, open source or free software in general, or the local IT support; so, there is no pressure, social or otherwise, for a programmer to care. Because only programmers are aware of details such as this, and most programmers don't care, there is no pressure to change the status quo.

(I know saying the above will make a lot of programmers hate my guts, but at least I'm being honest. The typical response I get for pointing out things such as this is that this is such a rare occurrence, that it would be a waste of resources to check for this. That is likely true.. but I for one am willing to spend more CPU cycles and paying a few percent more to the programmers, if it means my machine actually works more predictably, and tells me if it lost the plot, rather than silently corrupts my data.)

Is there a simple, reproducible example of such silent loss of data?

I know of three approaches:

  1. Use an USB stick, and yank it out after the final write() but before the close(). Unfortunately, most USB sticks have hardware that is not designed to survive that, so you may end up bricking the USB stick. Depending on the filesystem, your kernel may also panic, because most filesystems are written with the assumption that this will never ever happen.

  2. Set up an NFS server, and simulate intermittent packet drops by using iptables to drop all packets between the NFS server and the client. The exact scenario depends on the server and client, mount options, and versions used. A test bed should be relatively easy to set up using two or three virtual machines, however.

  3. Use a custom filesystem to simulate a write error at close() time. Current kernels do not let you force-unmount tmpfs or loopback mounts, only NFS mounts, otherwise this would be easy to simulate by force-unmounting the filesystem after the final write but prior the close(). (Current kernels simply deny the umount if there are open files on that filesystem.) For application testing, creating a variant of tmpfs that returns an error at close() if the file mode indicates it is desirable (for example, other-writable but not other-readable or other-executable, ie. -??????-w-) would be quite easy, and safe. It would not actually corrupt the data, but it would make it easy to check how the application behaves if the kernel reports (the risk of) data corruption at close time.

OTHER TIPS

Calling POSIX's close() may lead to errno being set to:

  1. EBADF: Bad file number
  2. EINTR: Interrupted system call
  3. EIO: I/O error (from POSIX Specification Issue 6 on)

Different errors indicate different issues:

  1. EBADF indicates a programming error, as the program should have kept track of which file/socket descriptors are still open. I'd consider testing for this error a quality management action.

  2. EINTR seems to be the most difficult to handle as it is not clear whether the file/socket descriptor passed is valid after the function returned or not (under Linux it propably is not: http://lkml.org/lkml/2002/7/17/165). Observing this error you should perhaps check the program's way of handling signals.

  3. EIO is expected to appear under special conditons only, as mentioned in the man-pages. However at least just because of this one should track this error, as if it occurs most likely there something went really wrong.

All in all each of these errors has at least one good reason of being caught, so just do it! ;-)

Possible specific reactions:

  1. In terms of stability ignoring an EBADF might be acceptable, however the error shall not happen. As stated fix your code as the program does not seem to really know what it is doing.

  2. Observing an EINTR could indicate signals are running wild. This is not nice. Definitly go for the root cause. As it is unclear whether descriptors got closed or not go for a system restart asap.

  3. Running into an EIO definitly could inicate a serious failure in the hardware*1 involved. However, before the strongly recommended shutdown of the system it might be worth to simply retry the operation, although the same concerns apply as for the EINTR that it is uncertain whether the descriptor really got closed or not. In case it did got closed it is a bad idea to close it again, as it might already be in use by another thread. Go for shutdown and hardware*1 replacement asap.


*1 Hardware it to be seen in a broder sense here: An NFS server acts as a disk, so the EIO could simply due to a misconfigured server or network or whatever is involved in the NFS connection.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top