best way to parsing Large files by regex python

Question 1

As others have said, profile your code to see why it is slow. The cProfile module in conjunction with the gprof2dot tool can produce nice readable information

Without seeing your slow code, I can guess a few things that might help:

First is you can probably get away with using the builtin string methods instead of a regex - this might be marginally quicker. If you need to use regex's, it's worthwhile precompiling outside the main loop using re.compile

Second is to not do one insert query per line, instead do the insertions in batches, e.g add the parsed info to a list, then when it reaches a certain size, perform one INSERT query with executemany method.

Some incomplete code, as an example of the above:

import fileinput

parsed_info = []
for linenum, line in enumerate(fileinput.input()):
    if not line.startswith("#DEBUG"):
        continue # Skip line

    msg = line.partition("MSG")[1] # Get everything after MSG
    words = msg.split() # Split on words
    info = {}
    for w in words:
        k, _, v = w.partition(":") # Split each word on first :
        info[k] = v

    parsed_info.append(info)

    if linenum % 10000 == 0: # Or maybe  if len(parsed_info) > 500:
        # Insert everything in parsed_info to database
        ...
        parsed_info = [] # Clear

Question 2

Paul's answer makes sense, you need to understand where you "lose" time first. Easiest way if you don't have a profiler is to post a timestamp in milliseconds before and after each "step" of your algorithm (opening the file, reading it line by line (and inside, time taken for the split / regexp to recognise the debug lines), inserting it in the DB, etc...).

Without further knowledge of your code, there are possible "traps" that would be very time consuming : - opening the log file several times - opening the DB every time you need to insert data inside instead of opening one connection and then write as you go

Question 3

"The best way to do the above task in minimal time" is to first figure out where the time is going. Look into how to profile your Python script to find what parts are slow. You may have an inefficient regex. Writing to sqlite may be the problem. But there are no magic bullets - in general, processing 2GB of text line by line, with a regex, in Python, is probably going to run in minutes, not seconds.

Here is a test script that will show how long it takes to read a file, line by line, and do nothing else:

from datetime import datetime

start = datetime.now()
for line in open("big_honkin_file.dat"):
    pass
end = datetime.now()
print (end-start)