Question

On a Linux cluster I run many (N > 10^6) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N was small I was able to store each result in a separate file to be parsed later. With large N however, I find that I am wasting storage space (for the file creation) and simple commands like ls require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long.

Each computation is required to run through a qsub scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.

How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?

Was it helpful?

Solution

1) As you say, it's not ls which is failing; it's the shell which does glob expansion before starting up ls. You can fix that problem easily enough by using something like

find . -type f -name 'GLOB' | xargs UTILITY

eg.:

find . -type f -name '*.dat' | xargs ls -l

You might want to sort the output, since find (for efficiency) doesn't sort the filenames (usually). There are many other options to find (like setting directory recursion depth, filtering in more complicated ways, etc.) and to xargs (maximum number of arguments for each invocation, parallel execution, etc.). Read the man pages for details.

2) I don't know how you are creating the individual files, so it's a bit hard to provide specific solutions, but here are a couple of ideas:

  1. If you get to create the files yourself, and you can delay the file creation until the end of the job (say, by buffering output), and the files are stored on a filesystem which supports advisory locking or some other locking mechanism like atomic linking, then you can multiplex various jobs into a single file by locking it before spewing the output, and then unlocking. But that's a lot of requirements. In a cluster you might well be able to do that with a single file for all the jobs running on a single host, but then again you might not.

  2. Again, if you get to create the files yourself, you can atomically write each line to a shared file. (Even NFS supports atomic writes but it doesn't support atomic append, see below.) You'd need to prepend a unique job identifier to each line so that you can demultiplex it. However, this won't work if you're using some automatic mechanism such as "my job writes to stdout and then the scheduling framework copies it to a file", which is sadly common. (In essence, this suggestion is pretty similar to the MapReduce strategy. Maybe that's available to you?)

  3. Failing everything else, maybe you can just use sub-directories. A few thousand directories of a thousand files each is a lot more manageable than a single directory with a few million files.

Good luck.

Edit As requested, some more details on 2.2:

You need to use Posix I/O functions for this, because, afaik, the C library does not provide atomic write. In Posix, the write function always writes atomically, provided that you specify O_APPEND when you open the file. (Actually, it writes atomically in any case, but if you don't specify O_APPEND then each process retains it's own position into the file, so they will end up overwriting each other.)

So what you need to do is:

  1. At the beginning of the program, open a file with options O_WRONLY|O_CREATE|O_APPEND. (Contrary to what I said earlier, this is not guaranteed to work on NFS, because NFS may not handle O_APPEND properly. Newer versions of NFS could theoretically handle append-only files, but they probably don't. Some thoughts about this a bit later.) You probably don't want to always use the same file, so put a random number somewhere into its name so that your various jobs have a variety of alternatives. O_CREAT is always atomic, afaik, even with crappy NFS implementations.

  2. For each output line, sprintf the line to an internal buffer, putting a unique id at the beginning. (Your job must have some sort of unique id; just use that.) [If you're paranoid, start the line with some kind of record separator, followed by the number of bytes in the remaining line -- you'll have to put this value in after formatting -- so the line will look something like ^0274:xx3A7B29992A04:<274 bytes>\n, where ^ is hex 01 or some such.]

  3. write the entire line to the file. Check the return code and the number of bytes written. If the write fails, try again. If the write was short, hopefully you followed the "if you're paranoid" instructions above, also just try again.

Really, you shouldn't get short writes, but you never know. Writing the length is pretty simple; demultiplexing is a bit more complicated, but you could cross that bridge when you need to :)

The problem with using NFS is a bit more annoying. As with 2.1, the simplest solution is to try to write the file locally, or use some cluster filesystem which properly supports append. (NFSv4 allows you to ask for only "append" permissions and not "write" permissions, which would cause the server to reject the write if some other process already managed to write to the offset you were about to use. In that case, you'd need to seek to the end of the file and try the write again, until eventually it succeeds. However, I have the impression that this feature is not actually implemented. I could be wrong.)

If the filesystem doesn't support append, you'll have another option: decide on a line length, and always write that number of bytes. (Obviously, it's easier if the selected fixed line length is longer than the longest possible line, but it's possible to write multiple fixed-length lines as long as they have a sequence number.) You'll need to guarantee that each job writes at different offsets, which you can do by dividing the job's job number into a file number and an interleave number, and write all the lines for a particular job at its interleave modulo the number of interleaves, into a file whose name includes the file number. (This is easiest if the jobs are numbered sequentially.) It's OK to write beyond the end of the file, since unix filesystems will -- or at least, should -- either insert NULs or create discontiguous files (which waste less space, but depend on the blocksize of the file).

Another way to handle filesystems which don't support append but do support advisory byte-range locking (NFSv4 supports this) is to use the fixed-line-length idea, as above, but obtaining a lock on the range about to be written before writing it. Use a non-blocking lock, and if the lock cannot be obtained, try again at the next line-offset multiple. If the lock can be obtained, read the file at that offset to verify that it doesn't have data before writing it; then release the lock.

Hope that helps.

OTHER TIPS

If you are only concerned by space:

parallel --header : --tag computation {foo} {bar} {baz} ::: foo 1 2 ::: bar I II ::: baz . .. | pbzip2 > out.bz2

or shorter:

parallel --tag computation ::: 1 2 ::: I II ::: . .. | pbzip2 > out.bz2

GNU Parallel ensures output is not mixed.

If you are concerned with finding a subset of the results, then look at --results.

Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Another possibility would be to use N files, with N greater or equal to the number of nodes in the cluster, and assign the files to your computations in a round-robin fashion. This should avoid concurrent writes to any of the files, provided you have a reasonnable guarantee on the order of execution of your computations.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top