Domanda

I followed this blog entry to parallelize sort by splitting a large file, sorting and merging. The steps are:

  1. split -l5000000 data.tsv '_tmp'
  2. ls -1 _tmp* | while read FILE; do sort $FILE -o $FILE & done
  3. sort -m _tmp* -o data.tsv.sorted

Between step 2 and 3, one must wait until the sorting step has finished. I assumed that wait without any arguments would be the right thing, since according to the man page, if wait is called without arguments all currently active child processes are waited for.

However, when I try this in the shell (i.e. executing steps 1 and 2, and then wait), wait returns immediately, although top shows the sort processes are still running.

Ultimately I want to increase the speed of a script with that, so its not a one time thing I could do manually on the shell.

I know sort has a --parallel option since version 8, however on the cluster I am running this, an older version is installed, and I am also curious about how to solve this issue.

È stato utile?

Soluzione

Here's a simple test case reproducing your problem:

true | { sleep 10 & }
wait
echo "This echos immediately"

The problem is that the pipe creates a subshell, and the forked processes are part of that subshell. The solution is to wait in that subshell instead of your main parent shell:

true | { sleep 10 & wait } 
echo "This waits"

Translated back into your code, this means:

ls -1 _tmp* | { while read FILE; do sort $FILE -o $FILE & done; wait; }

Altri suggerimenti

From the bash man page:

Each command in a pipeline is executed as a separate process (i.e., in a subshell).

So when you pipe to while, a subshell is created. Everything else in step 2 is executed within this subshell, (ie, all the background processes). The script then exits the while loop, leaving the subshell, and wait is executed in the parent shell, where there is nothing to wait for. You can avoid using the pipeline by using a process substitution:

while read FILE; do 
    sort $FILE -o $FILE & 
done < <(ls -1 _tmp*)
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top