Question

I'm trying to count the number of occurrences of a particular string from a bunch of .gz logfiles on an hourly basis. Each logfile statement starts with the following time format:

2013-11-21;09:07:23.433.

For example, to be more clear, find the count of occurrences of string "abc" between 8am to 9am, then 9am to 10am and so on. Any ideas on how to do it?

Was it helpful?

Solution

Since you just want to count occurrences, you may simply zcat the contents of the file, grep the portion that describes what you're looking for -- words/time intervals --, and finally sort/count (sort | uniq -c) the entries. The following would probably suffice:

zcat *.gz | grep <word> | grep -oP "^\d{4}-\d{2}-\d{2};\d{2}" | sort | uniq -c

The above command shall find the lines in your logfile that contains the <word> you're looking for, extract both date and hour from such entries, and later count the occurrences. In case you don't want to take into account days/months/years, you may use:

zcat *.gz | grep <word> | grep -oP "^\d{4}-\d{2}-\d{2};\K\d{2}" | sort | uniq -c

The \K added in the grep expression is a flag for look-behind in PCRE -- Perl Compatible Regular Expression.

OTHER TIPS

Try this :

zgrep -c '2013-11-21;0[89]:.*abc' file.gz

Or awk (gawk in linux) will work:

zcat *.gz  | awk -F'[\.;:]' '{arr[$2]++} END{for(i in arr){print i, arr[i]} }' 2>/dev/null

the redirection is there because some awks, notably gawk, will complain about . not being a metacharacter

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top