Question

I have to parse a large log file (2GB) using reg ex in python. In the log file regular expression matches line which I am interested in. Log file can also have unwanted data.

Here is a sample from the file:

"#DEBUG:: BFM [L4] 5.4401e+08ps MSG DIR:TX SCB_CB TYPE:DATA_REQ CPortID:'h8 SIZE:'d20 NumSeg:'h0001 Msg_Id:'h00000000"

My regular expression is ".DEBUG.*MSG."

First I will split it using the white spaces then the "field:value" patterns are inserted into the sqlite3 database; but for large files it takes around 10 to 15 minutes to parse the file.

Please suggest the best way to do the above task in minimal time.

Was it helpful?

Solution

As others have said, profile your code to see why it is slow. The cProfile module in conjunction with the gprof2dot tool can produce nice readable information

Without seeing your slow code, I can guess a few things that might help:

First is you can probably get away with using the builtin string methods instead of a regex - this might be marginally quicker. If you need to use regex's, it's worthwhile precompiling outside the main loop using re.compile

Second is to not do one insert query per line, instead do the insertions in batches, e.g add the parsed info to a list, then when it reaches a certain size, perform one INSERT query with executemany method.

Some incomplete code, as an example of the above:

import fileinput

parsed_info = []
for linenum, line in enumerate(fileinput.input()):
    if not line.startswith("#DEBUG"):
        continue # Skip line

    msg = line.partition("MSG")[1] # Get everything after MSG
    words = msg.split() # Split on words
    info = {}
    for w in words:
        k, _, v = w.partition(":") # Split each word on first :
        info[k] = v

    parsed_info.append(info)

    if linenum % 10000 == 0: # Or maybe  if len(parsed_info) > 500:
        # Insert everything in parsed_info to database
        ...
        parsed_info = [] # Clear

OTHER TIPS

Paul's answer makes sense, you need to understand where you "lose" time first. Easiest way if you don't have a profiler is to post a timestamp in milliseconds before and after each "step" of your algorithm (opening the file, reading it line by line (and inside, time taken for the split / regexp to recognise the debug lines), inserting it in the DB, etc...).

Without further knowledge of your code, there are possible "traps" that would be very time consuming : - opening the log file several times - opening the DB every time you need to insert data inside instead of opening one connection and then write as you go

"The best way to do the above task in minimal time" is to first figure out where the time is going. Look into how to profile your Python script to find what parts are slow. You may have an inefficient regex. Writing to sqlite may be the problem. But there are no magic bullets - in general, processing 2GB of text line by line, with a regex, in Python, is probably going to run in minutes, not seconds.

Here is a test script that will show how long it takes to read a file, line by line, and do nothing else:

from datetime import datetime

start = datetime.now()
for line in open("big_honkin_file.dat"):
    pass
end = datetime.now()
print (end-start)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top