Question

I just ran these two commands on a file having around 250 million records.

awk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt

and

nawk '{if(substr($0,472,1)=="9") print $0}' < file1.txt >> file2.txt

The record length is 482. The first command gave the correct number of records in file2.txt i.e.; 60 million but the nawk command gives only 4.2 million.

I am confused and would like to know if someone has come across issue like this. How exactly this simple command being treated in a different way internally? Is there a buffer which can hold only up to certain number of bytes while using nawk?

would appreciate if someone can throw some light on this.

My OS details are

SunOS <hostname> 5.10 Generic_147148-26 i86pc i386 i86pc
Was it helpful?

Solution

The difference probably lies on the buffer limit of Nawk. One of the records (lines) found in your input file has probably exceeded it.

This crucial line can be found in awk.h:

#define RECSIZE (8 * 1024)  /* sets limit on records, fields, etc., etc. */

OTHER TIPS

Your command can be reduced to just this:

awk 'substr($0,472,1)==9'

On Solaris (which you are on) when you run awk by default you are running old, broken awk (/usr/bin/awk) so I suspect that nawk is the one producing the correct result.

Run /usr/xpg4/bin/awk with the same script/arguments and see which of your other results it's output agrees with.

Also, check if your input file was created on Windows by running dos2unix on it and see if it's size changes and, if so, re-run your awk commands on the modified files. If it was created on Windows then it will have some control-Ms in there that could be causing chaos.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top