Вопрос

The problem: find log lines from time between say 2 months ago and 1 month ago and those log lines have to contain several specified words (not even regexes are necessary, though it would be nice to have them).

The catch: there's 20T+ of logs (gzipped!) to sift through and the search has to be fast (preferably done in a few seconds).

My first thought was to use PyTables since I store various numerical data together with log line timestamp and log lines themselves in Pandas anyway (I could use Table format IIRC to store in Pandas' HDFStore), use built-in PyTables querying. I haven't parsed the whole dataset yet though, only a small subset of logs (for analytical purposes). I have basic parsing part done (extract timestamp, a few key parameters, add log line, save), but I also need fast querying part.

Is it feasible? Is there a better solution available for Python?

I was thinking about using text indexer built into Postgres until I found out that it does linear scan of the column in the table anyway, so I might as well use grep...

What could be preferable solution (available in Python) for indexing and scanning such big datasets for simple word patterns? Kyoto/Tokyo Cabinet?

UPDATE: (Anonymized) log example follows.

23419 2013-11-27 12:35:59 [INFO] 12772792:ce7429c9d63dc630dce613ccb5a0ae55:201311271235498008010001 func: item uploaded, path=tt6-nas/itemhome174/pool2/20131127/12/35/252273696_12772792.d

23419 2013-11-27 12:35:59 [WARNING] 12772792:ce7429c9d63dc630dce613ccb5a0ae55:201311271235498008010001 parse_zz: no test found: input=

23413 2013-11-27 12:35:59 [INFO] 15417668:a0f5116658f701fd848ac9fec3743c2c:201311271235578010010001 Test ok, funcname = zzz_get_results itemid = 15417668 ay_id = 959 ip = 22.222.22.22 session_id = a0f5116658f701fd848ac9fec3743c2c

23413 2013-11-27 12:35:59 [INFO] 15417668:a0f5116658f701fd848ac9fec3743c2c:201311271235578010010001 calling testfunc with args={'aa': False, 'medid': 15417668, '_objname': 'aa', '_clt_id': '46.238.87.23', '_pvid': '201311271235578010010001', 'limit': 3, '_login': 'aaa',...

23421 2013-11-27 12:35:59 [INFO] 5642372:1ebd76b4b5c43e36323faf846077a881:201311271235592288010001 calling item_get_info with args={'test': False, '_elemid': 5642372, '_session_id': '1ebd76b4b5c43e36323faf846077a881', ..

23421 2013-11-27 12:35:59 [INFO] 031 items:get_item=0. "time": 0.008256 query: url=http://hostname.tld:9603/getfunc?date=20131127&test1=7&limit=0&itemid=56119 body=85419:: NO_DATA
Это было полезно?

Решение

I found the solution:

http://swtch.com/~rsc/regexp/regexp4.html

https://code.google.com/p/codesearch/

I can't believe nobody here knew this, according to Russ Cox index of n-grams is "an old information retrieval trick".

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top