Question

Please help to improve the below formatting command as it is taking lot of time, Input file delimiter is ** separated, 22.00 Million rows and 87 columns. In output need to choose only 2 columns print substr($3,0,15),substr($4,3,10) & comma separated delimiter.

time zcat hlr*.gz | awk -F"**" '{OFS=","; print substr($3,0,15),substr($4,3,10)}' >Op_Formatted.csv

When I am running the above command in uname: Linux is taking 5 Hours 32 Minutes

real    319m48.471s
user    313m49.924s
sys 1m32.803s

whereas uname: CYGWIN_NT-6.1 is taking 16 minutes only

real    16m52.823s
user    17m35.485s
sys 0m6.986s

Sample Input:

2**000001**804421890831817F**819200000068FFFF**00** 0** 21- 10** 72- 1** 90- 32** 51- 1** 54- 1** 55- 1** 126- 5** 141- 44** 143- 1** 140- 58** 105- 0** 106- 0** 121- 4** 147- 1** 152- 1** 34- 0** 33- 4** 9- 1** 10- 1** 38- 1** 110- 1** 2- 1** 4- 1** 5- 1** 6- 1** 8- 1** 43- 1** 44- 1** 45- 1** 46- 1** 85- 0** 86- 4** 42- 0** 47- 0** 48- 0** 49- 0** 112- 1**9607500248789478**
2**000002**804421812449266F**819200000227FFFF**00** 0** 21- 10** 72- 1** 90- 32** 51- 1** 54- 1** 55- 1** 126- 5** 141- 44** 143- 1** 140- 5** 105- 0** 106- 0** 121- 4** 147- 1** 152- 1** 34- 0** 33- 7** 9- 1** 10- 1** 38- 1** 110- 1** 2- 1** 4- 1** 5- 1** 6- 1** 8- 1** 43- 1** 44- 1** 45- 1** 46- 1** 85- 0** 86- 4** 42- 0** 47- 0** 48- 0** 49- 0** 112- 1**4592140525164919**
2**000003**804421830628518F**819200000312FFFF**00** 0** 21- 10** 72- 1** 90- 35** 51- 1** 54- 1** 55- 1** 126- 5** 141- 44** 140- 58** 105- 0** 106- 0** 121- 4** 147- 1** 152- 1** 34- 0** 33- 4** 9- 1** 10- 1** 38- 1** 110- 1** 2- 1** 4- 1** 5- 1** 6- 1** 8- 1** 43- 1** 44- 1** 45- 1** 46- 1** 85- 0** 86- 4** 42- 0** 47- 0** 48- 0** 49- 0** 112- 1**6570980506503001**

Sample Output:

804421890831817,9200000068
804421812449266,9200000227
804421830628518,9200000312
Was it helpful?

Solution

Check if your linux env has memory issue or disk I/O read issue. I am fine in my environment.

Here are some suggestions.

First, put OFS outside, in your command, OFS is defined again on every line.

zcat hlr*.gz | awk '{print substr($3,0,15),substr($4,3,10)}' FS="**" OFS="," >Op_Formatted.csv >Op_Formatted.csv

If the position is not changed, try this:

zcat hlr*.gz | awk '{print substr($0,12,15) "," substr($0,32,10)}' >Op_Formatted.csv >Op_Formatted.csv

Test with first command on a 3000 lines file

real    0m0.297s
user    0m0.249s
sys     0m0.046s

Test with second command:

real    0m0.078s
user    0m0.077s
sys     0m0.030s
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top