Frage

I have two columns in a file, and I want to automate summing both values per row

for example

read write
5    6
read write
10   2
read write
23   44

I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line.

I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line

lines=`grep -v READ $x|wc -l | awk '{print $1}'`
line_num=1
arr_num=0


while [ $line_num -le $lines ]
do

    arr[$arr_num]=`grep -v READ $x |  sed $line_num'q;d' | awk '{print $2 + $3}'`
    echo $line_num
    line_num=$[$line_num+1]
    arr_num=$[$arr_num+1]

done

However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?

War es hilfreich?

Lösung

Use instead and take advantage of modulus function:

awk '!(NR%2){print $1+$2}' infile

Andere Tipps

awk is probably faster, but the idiomatic way to do this is something like:

while read -a line; do      # read each line one-by-one, into an array
                            # use arithmetic expansion to add col 1 and 2
    echo "$(( ${line[0]} + ${line[1]} ))"
done < <(grep -v READ input.txt)

Note the file input file is only read once (by grep) and the number of externally forked programs is kept to a minimum (just grep, called only once for the whole input file). The rest of the commands are bash builtins.

Using the <( ) process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a | pipe could be used.

Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that:

awk '
    NR%2 == 1 {next} 
    NR == 2 {max = $1+$2; next} 
    $1+$2 > max {max = $1+$2}
    END {print max}
' filename

You could also use a pipeline with tools that implicitly loop over the input like so:

grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE

This assumes there are spaces between your read and write data values.

Why not run:

awk 'NR==1 { print "sum"; next } { print $1 + $2 }'

You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process.

You can use Perl or Python instead of awk if you prefer.

Your code is running grep, sed and awk on each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.

Assuming that it's always one 'header' row followed by one 'data' row:

awk '
  BEGIN{ max = 0 }
  {
    if( NR%2 == 0 ){
      sum = $1 + $2;
      if( sum > max ) { max = sum }
    }
  }
  END{ print max }' input.txt

Or simply trim out all lines that do not conform to what you want:

grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk '
  BEGIN{ max = 0 }
  {
    sum = $1 + $2;
    if( sum > max ) { max = sum }
  }
  END{ print max }' input.txt
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top