Random corruption in file created/updated from shell script on a singular client to NFS mount

Question 1

It turns out this was actually another issue altogether, apparently to do with an out-of-date page being written to disk.

A bug fix was supplied to the linux-nfs implementation: http://www.spinics.net/lists/linux-nfs/msg41357.html

Question 2

Well, the emails corresponding to the corrupt job_info files should tell you what was in MSG (which will probably be business as usual). You may want to check how NFS is being run: there's a remote possibility that you are running NFS over UDP without checksums. That could explain some corrupt data. I also hear that UDP/TCP checksums are not strong enough and the data can still end up corrupt -- maybe you are hitting such a problem (I have seen corrupt packets slipping through a network stack at least once before, and I'm quite sure some checksumming was going on). Presumably the MSG goes out as a single packet and there might be something about it that makes checksum conflicts with the garbage you see more likely. Of course it could also be an NFS bug (client or server), a server-side filesystem bug, busted piece of RAM... possibilities are almost endless here (although I see how the fact that it's always MSG that gets corrupted makes some of those quite unlikely). The problem might be related to seeking (which happens during the append). You could also have a bug somewhere else in the system, causing multiple clients to open the same job_info file, making it a jumble.

Question 3

You can also try using different file for 'time' output and then merge them together with job_info at the end of script. That may help to isolate problem further.

Shell opens 'job_info' file for writing, outputs MSG and then shall close its file descriptor before launching main job. 'time' program opens same file for append as stream and I suspect the seek over NFS is not done correctly which may cause that garbage. Can't explain why, but normally this shall not happen (and is not happening). Such rare occasions may point to some race condition somewhere, can be caused by out of sequence packet delivery (due to network latency spike) or retransmits which causes duplicate data, or a bug somewhere. At first look I would suspect some bug, but that bug may be triggered by some network behavior, e.g. unusually large delay or spike of packet loss.

File access between different processes are serialized by kernel, but for additional safeguard may be worth adding some artificial delays - sleep timers between outputs for example.

Network is not transparent, especially a large one. There can be WAN optimization devices which are known to cause application issues sometimes. CIFS and NFS are good candidates for optimization over WAN with local caching of filesystem operations. Might be worth looking for recent changes with network admins..

Another thing to try, although can be difficult due to rare occurrences is capture of interesting NFS sessions via tcpdump or wireshark. In really tough cases we do simultaneous capturing on both client and server side and then compare the protocol logic to prove that network is or is not working correctly. That's a whole topic in itself, requires thorough preparation and luck but usually a last resort of desperate troubleshooting :)