Is there a simple tool or a library for “rolling up” a set of events with timestamps into a count of events in each time window of a given span?

https://stackoverflow.com/questions/10111600

30-05-2021
|

質問

My specific problem is that I have a set of Apache access logs, and I want to extract from them a “rolled up” count of requests by grouping them into a set of time windows of a specified time.

Example of my data:

127.0.0.1 - - [01/Dec/2011:00:00:11 -0500] "GET / HTTP/1.0" 304 266 "-" "Sosospider+(+http://help.soso.com/webspider.htm)"
127.0.0.1 - - [01/Dec/2011:00:00:24 -0500] "GET /feed/rss2/ HTTP/1.0" 301 447 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=12878631678486589417)"
127.0.0.1 - - [01/Dec/2011:00:00:25 -0500] "GET /feed/ HTTP/1.0" 304 189 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=12878631678486589417)"
127.0.0.1 - - [01/Dec/2011:00:00:30 -0500] "GET /robots.txt HTTP/1.0" 200 333 "-" "Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/)"
127.0.0.1 - - [01/Dec/2011:00:00:30 -0500] "GET / HTTP/1.0" 200 10011 "-" "Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/)"

as you can see, each line represents an event — in this case, a HTTP request — and contains a timestamp.

Assuming my data covers 3 days, and I specify a time window size of 1 day, I’d like to generate something like this:

Start    End     Count
2011-12-01 05:00     2011-12-02 05:00    2822
2011-12-02 05:00     2011-12-03 05:00    2572
2011-12-03 05:00     2011-12-04 05:00    604

But I need to be able to vary the size of the window — I might want to analyze a given dataset using windows of 5 minutes, 10 minutes, 1 hour, 1 day, or 1 week, etc.

I also need the library/tool to be capable of analyzing a dataset (a series of lines) of hundreds or even thousands of megabytes in size.

A prebuilt tool which can accept the data via standard input would be great, but a library would be totally fine, as I could just build the tool around the library. Any language would be fine; if I don’t know it I can learn it.

I’d prefer to do this by piping the access log data directly into a tool/library with minimal dependencies — I’m not looking for suggestions to store the data in a database and then query the database to do the analysis. If I need to, I can figure that out myself.

I tried Splunk and found it way too heavyweight and complex for my case. It’s not just a tool, it’s a whole system with its own datastore, complex indexing and querying abilities, etc.

My question is: does such a library and/or tool exist?

Full disclosure

I must admit, I actually tried and failed to find something like this a few months ago, so I wrote my own. For some reason I didn’t think to post this question at that time. I will share the lib/tool I wrote in an answer shortly. But I really am curious if something like this does exist; maybe I just missed it when I was searching a few months ago.

解決

As mentioned in the question, I actually attempted a few months ago, unsuccessfully, to find something like this, so I wrote my own. (For some reason I didn’t think to post this question at that time.)

I took this as an opportunity to learn functional programming (FP) and to shore up my proficiency with CoffeeScript. So I wrote Rollups as a CoffeeScript tool which runs on Node. I’ve since added Scala and Clojure versions, as part of my further exploration of FP.

All the versions are intended to be usable as both a tool and a library, although they’re all only part of the way towards that — I think currently only the Clojure version is truly safe to use as a library, and I haven’t tested it that way.

The tools work as I described in my question. Given a file or set of files containing Apache access logs, I invoke them like so:

$ gzcat *.access.log.gz | rollup.clj -w 1d

(or rollup.coffee, rollup.scala) and the output is exactly like the example in the question.

This tool solved my problem, and I’m no longer actively using it on a day-to-day basis. But I’d love to improve it further for others’ use, if I knew that others were using it. So feedback would be welcome!

他のヒント

Splunk (http://www.splunk.com/) would be the tool I think of for a problem like this. It's available in free and paid versions; I have not licensed it myself, just used it already installed.

So, how automatic does this have to be? Can I give a not-really answer that is still useful?

If you wanted to be really ghetto about it, what I usually do is end up one-offing an ugly bunch of shell. Here's one that will sum by hour using some cut tricks and awk (which I'm admittedly not very good at but is incredibly fast and powerful).

cat access_log | cut -d '[' -f 2 | cut -d ' ' -f 1 | cut -d ':' -f 1,2 | awk '{ date=$1; if (date==olddate) sum=sum+1; else { if (olddate!="") {print olddate,sum}; olddate=date; sum=1}} END {print date,sum}'

(This post on plotting with awk helped me figure out the aggregation bit.)

That should output something like:

12/Apr/2012:11 207
12/Apr/2012:12 188
12/Apr/2012:13 317

Which can be pretty easily played with itself. Awk is neat.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow