Efficiently changing the date format of an existing log file

Question 1

Running the date command once per line for millions of lines is going to be painfully slow. Anything that avoids that is going to be faster. One answer has suggested sed — that has many merits; another suggested Perl — ditto.

Using awk, you could look at:

awk 'BEGIN { m["Jan"] = "01"; m["Feb"] = "02"; m["Mar"] = "03";
             m["Apr"] = "04"; m["May"] = "05"; m["Jun"] = "06";
             m["Jul"] = "07"; m["Aug"] = "08"; m["Sep"] = "09";
             m["Oct"] = "10"; m["Nov"] = "11"; m["Dec"] = "12";
           }
           {
             printf "2014%s%02d,%s,", m[$1], $2, $3;
             pad=""
             for (i = 4; i <= NF; i++) { printf("%s%s", pad, $i); pad = " " }
             printf "\n"
           }
    ' log-file

If you have GNU awk, it has time manipulation functions built in, though frankly treating the date information as strings and numbers as shown is quite as effective.

Given an input log file like this:

Apr 22 23:08:26 a,x,y
Apr 22 23:08:26 b,y,z
Apr 22 23:08:26 c,s,s
Jan 31 00:19:50 c,info with spaces,some more info
Feb  2 00:20:41 c,info with spaces,some more info
Mar 13 00:31:32 c,info with spaces,some more info
May  5 00:42:23 c,info with spaces,some more info
Jun 16 00:53:14 c,info with spaces,some more info
Jul 27 00:04:05 c,info with spaces,some more info
Aug  8 00:15:56 c,info with spaces,some more info
Sep 29 00:26:47 c,info with spaces,some more info
Oct 30 00:37:38 c,info with spaces,some more info
Nov 12 00:49:29 c,info with spaces,some more info
Dec 22 00:50:10 c,info with spaces,some more info

It generates output like this:

20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s
20140131,00:19:50,c,info with spaces,some more info
20140202,00:20:41,c,info with spaces,some more info
20140313,00:31:32,c,info with spaces,some more info
20140505,00:42:23,c,info with spaces,some more info
20140616,00:53:14,c,info with spaces,some more info
20140727,00:04:05,c,info with spaces,some more info
20140808,00:15:56,c,info with spaces,some more info
20140929,00:26:47,c,info with spaces,some more info
20141030,00:37:38,c,info with spaces,some more info
20141112,00:49:29,c,info with spaces,some more info
20141222,00:50:10,c,info with spaces,some more info

Question 2

you could try a (assuming this is always for this year) :

sed -e 's/\(:[0-9]\{2\}\) /\1,/
s/^Jan \([0-9]*\) /201401\1,/;t
s/^Feb \([0-9]*\) /201402\1,/;t
s/^Mar \([0-9]*\) /201403\1,/;t
s/^Apr \([0-9]*\) /201404\1,/;t
s/^May \([0-9]*\) /201405\1,/;t
s/^Jun \([0-9]*\) /201406\1,/;t
s/^Jul \([0-9]*\) /201407\1,/;t
s/^Aug \([0-9]*\) /201408\1,/;t
s/^Sep \([0-9]*\) /201409\1,/;t
s/^Oct \([0-9]*\) /201410\1,/;t
s/^Nov \([0-9]*\) /201411\1,/;t
s/^Dec \([0-9]*\) /201412\1,/' YourFile

t is an optimization when a substitution is occuring, not needed to test another one on the same line. For pure performance you could remove line that are not used (if you only have a log having 1 or 2 month, not needed to test other one).

Question 3

I know you didn't tag with perl, perhaps it isn't an option but personally I would consider using it. You could do something like this:

#!/usr/bin/env perl

use strict;
use warnings;

use Time::Piece;

{
    open my $in, "<", "logfile" or die "couldn't open logfile: $!";
    open my $out, ">", "new_logfile" or die "couldn't open new_logfile: $!";

    while(<$in>) {
        my @cols = split;
        my $t = Time::Piece->strptime("$cols[0] $cols[1] 2014", "%b %e %Y");
        print $out join(",", ($t->strftime("%Y%m%d"),@cols[2,-1])),"\n";
    }
}

This uses the core Time::Piece module to parse the times in your log file and convert them to the format you require. Using perl without calling any external functions is likely to be a lot faster than what you have currently. I hard-coded the year 2014 because I'm not sure where else it would come from.

Question 4

Here's one way using awk. Run like:

awk -f script.awk input.txt

Contents of script.awk:

BEGIN {

    OFS=","
}

{
    i = index("JanFebMarAprMayJunJulAugSepOctNovDec", $1)

    m = sprintf ("%02d", ((i - 1) / 3) + 1)

    print "2014" m $2, $3, $4
}

Results:

20140422,23:08:26,a,x,y
20140422,23:08:26,b,y,z
20140422,23:08:26,c,s,s