Question

I am trying to parallelise the use of a perl script. The input and output arguments for the perl script take the filenames from a directory. This is straightforward with gnu parallel.

ls dir | parallel script.pl --input {} --output {.}.out

However, an additional argument in the script requires that I get the value of the first row and last row, second column, from each file... something like

ls dir | parallel script.pl --input {} --output {.}.out --otherargs range:{1}-{2}

where the {1} and {2} are derived from a previous/simultaneous use of awk or sed to get these values with, for example

awk 'NR==1 {print $2}; END {print $2}' 

But where do I put (how can I put) this awk like step in the "workflow" to allow the perl script to use it?

Looking at

Change text in argument for xargs (or GNU Parallel)

would the right approach simply be to do this?

ls | parallel script.pl --input {} --output {.}.out --otherargs range:{1}-{2} :::: <(awk 'NR==1 {print $2}) <(awk 'END {print $2})

Thank you.

Was it helpful?

Solution

It is not really clear what you want. If this is not, then give us a full example of input and the wanted output.

ls | parallel script.pl --input {3} --output {3.}.out --otherargs range:{1}-{2} :::: <(ls | awk 'NR==1 {print $2}') <(ls | awk 'END {print $2}') -

or:

parallel script.pl --input {3} --output {3.}.out --otherargs range:{1}-{2} :::: <(ls | awk 'NR==1 {print $2}') <(ls | awk 'END {print $2}') <(ls)

Walk through the tutorial http://www.gnu.org/software/parallel/parallel_tutorial.html Your command line will love you for it.

OTHER TIPS

This could be the solution you need:

#!/bin/bash
readarray -t LIST < <(ls)
FIRST=${LIST[0]}; LAST=${LIST[@]:(-1)}
printf '%s\n' "${LIST[@]}" | parallel script.pl --input {} --output {.}.out --otherargs "range:${FIRST}-${LAST}"

Run it as bash script.sh. And perhaps you need to run sort? <(ls | sort). I think the concept will already follow even if you have a different source for $FIRST and $LAST.

A similar concept using a temporary file:

ls > temp
FIRST=$(awk 'NR==1 {print $2}' temp)
LAST=$(awk 'END {print $2}' temp}
parallel script.pl --input {} --output {.}.out --otherargs "range:${FIRST}-${LAST}" < temp

Also I think this is what you you really need with your Awk commands:

{read -r FIRST; read -r LAST;} < <(awk 'NR==1{print $2;next}{t=$2};END{print t}' temp)

My own solution was a bash script, passed to GNU paralle, but Ole's above is more elegant (a GNU parallel one liner)..... the bash script that collects the relevant variables and passes them to the perl script. Run this script in GNU parallel.

Here is the bash script

#!/bin/bash
sample=$1
describer=$(echo ${sample} | sed 's/.sync//') # removes .sync suffix
a=($(awk 'NR==1 {print $2}' ${sample}))
b=($(awk 'END {print $2}' ${sample}))

perl script.pl --input ${describer}.sync --output ${describer}.genepop  
--argument scaffold_1:$a-$b  

Followed by

ls | parallel bash bash.script.sh

This makes the collection of the variables from the file part and parcel of a files analysis.

Thanks for the motivating insight konsolebox. I should have paid attention to my own old post too.

Storing text and numeric variable from file to use in perl script

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top