Question

I have a huge set of log lines and I need to parse each line (so efficiency is very important).

Each log line is of the form

cust_name time_start time_end (IP or URL )*

So ip address, time, time and a possibly empty list of ip addresses or urls separated by semicolons. If there is only ip or url in the last list there is no separator. If there is more than 1, then they are separated by semicolons.

I need a way to parse this line and read it into a data structure. time_start or time_end could be either system time or GMT. cust_name could also have multiple strings separated by spaces.

I can do this by reading character by character and essentially writing my own parser. Is there a better way to do this ?

Was it helpful?

Solution

OTHER TIPS

I've had success with Boost Tokenizer for this sort of thing. It helps you break an input stream into tokens with custom separators between the tokens.

Using regular expressions (boost::regex is a nice implementation for C++) you can easily separate different parts of your string - cust_name, time_start ... and find all that urls\ips

Second step is more detailed parsing of that groups if needed. Dates for example you can parse using boost::datetime library (writing custom parser if string format isn't standard).

Why do you want to do this in C++? It sounds like an obvious job for something like perl.

Consider using a Regular Expressions library...

Custom input demands custom parser. Or, pray that there is an ideal world and errors don't exist. Specially, if you want to have efficiency. Posting some code may be of help.

for such a simple grammar you can use split, take a look at http://www.boost.org/doc/libs/1_38_0/doc/html/string_algo/usage.html#id4002194

UPDATE changed answer drastically!

I have a huge set of log lines and I need to parse each line (so efficiency is very important).

Just be aware that C++ won't help much in terms of efficiency in this situation. Don't be fooled into thinking that just because you have a fast parsing code in C++ that your program will have high performance!

The efficiency you really need here is not the performance at the "machine code" level of the parsing code, but at the overall algorithm level.

Think about what you're trying to do.
You have a huge text file, and you want to convert each line to a data structure,

Storing huge data structure in memory is very inefficient, no matter what language you're using!

What you need to do is "fetch" one line at a time, convert it to a data structure, and deal with it, then, and only after you're done with the data structure, you go and fetch the next line and convert it to a data structure, deal with it, and repeat.

If you do that, you've already solved the major bottleneck.

For parsing the line of text, it seems the format of your data is quite simplistic, check out a similar question that I asked a while ago: C++ string parsing (python style)

In your case, I suppose you could use a string stream, and use the >> operator to read the next "thing" in the line.

see this answer for example code.

Alternatively, (I didn't want to delete this part!!) If you could write this in python it will be much simpler. I don't know your situation (it seems you're stuck with C++), but still

Look at this presentation for doing these kinds of task efficiently using python generator expressions: http://www.dabeaz.com/generators/Generators.pdf

It's a worth while read. At slide 31 he deals with what seems to be something very similar to what you're trying to do.

It'll at least give you some inspiration.
It also demonstrates quite strongly that performance is gained not by the particular string-parsing code, but the over all algorithm.

You could try to use a simple lex/yacc|flex/bison vocabulary to parse this kind of input.

The parser you need sounds really simple. Take a look at this. Any compiled language should be able to parse it at very high speed. Then it's an issue of what data structure you build & save.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top