Pergunta

I'm trying to do some data mining on log file. It's a flat file that has a massive list of events per line. The file itself can reach upwards of 500MB as well. Each line is a comma-separated variable width list of events that each contain data that pertains specifically to that event.

I've gone through a few iterations and really haven't been able to decide how I'd like to have the data end up (normalized or de-normalized)? If I want to pre-process the data, or possibly post-processing it after it's in the database? Or something else entirely?

Things I've used so far: sed + awk, C#, g(awk), Ruby, Postgres Things I've considered: Possibly a no-SQL database? Possibly any other ideas?

Ultimately, I've used each of these tools to make single "pass" through the file and output another file that has a hard-set number of columns (30) for every line. After that I've been using postgres and I created one massive postgres table that has 30 columns and I can quickly import that file in to the table using a simple COPY postgres command (basically a Bulk Copy Insert).

Annoyances: The data is totally de-normalized. I've basically got a massive dump of data in one table, that I can surely query on and get at the data that I need, but that massive 30 column table is testing my sensibilities.

Questions: Would you attempt to normalize the data? If so, what are your thoughts on it? Would you do post-processing of the 30 column table? Pre-processing before inserting it in to the database at all? Any other ideas?

Foi útil?

Solução

Have you tried looking at logstash or splunk?

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top