Question

I have a log file which runs in to gigabytes in size which I parse into a csv file which is used for processing and data analysis. While creating the csv file I want the date to be in a particular format.

Input file:

Apr 22 23:08:26 a,x,y
Apr 22 23:08:26 b,y,z
Apr 22 23:08:26 c,s,s

Output file:

20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s

Currently I'm doing this using the following awk statement - but it takes hours to run through a file which is more than a gigabyte in size.

awk 'BEGIN { OFS = "," } {getDate="date -f \"%b %d %H:%M:%S\" \""$1" "$2" "$3"\" \"+%Y%m%d\",\"%H:%M:%S\""
while ( ( getDate | getline date ) > 0 ) { }
close(getDate);
print date,$4}' inputFile

Could this be further optimized? Is awk the right tool to use here?

Was it helpful?

Solution 2

Running the date command once per line for millions of lines is going to be painfully slow. Anything that avoids that is going to be faster. One answer has suggested sed — that has many merits; another suggested Perl — ditto.

Using awk, you could look at:

awk 'BEGIN { m["Jan"] = "01"; m["Feb"] = "02"; m["Mar"] = "03";
             m["Apr"] = "04"; m["May"] = "05"; m["Jun"] = "06";
             m["Jul"] = "07"; m["Aug"] = "08"; m["Sep"] = "09";
             m["Oct"] = "10"; m["Nov"] = "11"; m["Dec"] = "12";
           }
           {
             printf "2014%s%02d,%s,", m[$1], $2, $3;
             pad=""
             for (i = 4; i <= NF; i++) { printf("%s%s", pad, $i); pad = " " }
             printf "\n"
           }
    ' log-file

If you have GNU awk, it has time manipulation functions built in, though frankly treating the date information as strings and numbers as shown is quite as effective.

Given an input log file like this:

Apr 22 23:08:26 a,x,y
Apr 22 23:08:26 b,y,z
Apr 22 23:08:26 c,s,s
Jan 31 00:19:50 c,info with spaces,some more info
Feb  2 00:20:41 c,info with spaces,some more info
Mar 13 00:31:32 c,info with spaces,some more info
May  5 00:42:23 c,info with spaces,some more info
Jun 16 00:53:14 c,info with spaces,some more info
Jul 27 00:04:05 c,info with spaces,some more info
Aug  8 00:15:56 c,info with spaces,some more info
Sep 29 00:26:47 c,info with spaces,some more info
Oct 30 00:37:38 c,info with spaces,some more info
Nov 12 00:49:29 c,info with spaces,some more info
Dec 22 00:50:10 c,info with spaces,some more info

It generates output like this:

20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s
20140131,00:19:50,c,info with spaces,some more info
20140202,00:20:41,c,info with spaces,some more info
20140313,00:31:32,c,info with spaces,some more info
20140505,00:42:23,c,info with spaces,some more info
20140616,00:53:14,c,info with spaces,some more info
20140727,00:04:05,c,info with spaces,some more info
20140808,00:15:56,c,info with spaces,some more info
20140929,00:26:47,c,info with spaces,some more info
20141030,00:37:38,c,info with spaces,some more info
20141112,00:49:29,c,info with spaces,some more info
20141222,00:50:10,c,info with spaces,some more info

OTHER TIPS

you could try a (assuming this is always for this year) :

sed -e 's/\(:[0-9]\{2\}\) /\1,/
s/^Jan \([0-9]*\) /201401\1,/;t
s/^Feb \([0-9]*\) /201402\1,/;t
s/^Mar \([0-9]*\) /201403\1,/;t
s/^Apr \([0-9]*\) /201404\1,/;t
s/^May \([0-9]*\) /201405\1,/;t
s/^Jun \([0-9]*\) /201406\1,/;t
s/^Jul \([0-9]*\) /201407\1,/;t
s/^Aug \([0-9]*\) /201408\1,/;t
s/^Sep \([0-9]*\) /201409\1,/;t
s/^Oct \([0-9]*\) /201410\1,/;t
s/^Nov \([0-9]*\) /201411\1,/;t
s/^Dec \([0-9]*\) /201412\1,/' YourFile

t is an optimization when a substitution is occuring, not needed to test another one on the same line. For pure performance you could remove line that are not used (if you only have a log having 1 or 2 month, not needed to test other one).

I know you didn't tag with perl, perhaps it isn't an option but personally I would consider using it. You could do something like this:

#!/usr/bin/env perl

use strict;
use warnings;

use Time::Piece;

{
    open my $in, "<", "logfile" or die "couldn't open logfile: $!";
    open my $out, ">", "new_logfile" or die "couldn't open new_logfile: $!";

    while(<$in>) {
        my @cols = split;
        my $t = Time::Piece->strptime("$cols[0] $cols[1] 2014", "%b %e %Y");
        print $out join(",", ($t->strftime("%Y%m%d"),@cols[2,-1])),"\n";
    }
}

This uses the core Time::Piece module to parse the times in your log file and convert them to the format you require. Using perl without calling any external functions is likely to be a lot faster than what you have currently. I hard-coded the year 2014 because I'm not sure where else it would come from.

Here's one way using awk. Run like:

awk -f script.awk input.txt

Contents of script.awk:

BEGIN {

    OFS=","
}

{
    i = index("JanFebMarAprMayJunJulAugSepOctNovDec", $1)

    m = sprintf ("%02d", ((i - 1) / 3) + 1)

    print "2014" m $2, $3, $4
}

Results:

20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top