Question

Right now I have a log parser reading through 515mb of plain-text files (a file for each day over the past 4 years). My code currently stands as this: http://gist.github.com/12978. I've used psyco (as seen in the code) and I'm also compiling it and using the compiled version. It's doing about 100 lines every 0.3 seconds. The machine is a standard 15" MacBook Pro (2.4ghz C2D, 2GB RAM)

Is it possible for this to go faster or is that a limitation on the language/database?

Was it helpful?

Solution

Don't waste time profiling. The time is always in the database operations. Do as few as possible. Just the minimum number of inserts.

Three Things.

One. Don't SELECT over and over again to conform the Date, Hostname and Person dimensions. Fetch all the data ONCE into a Python dictionary and use it in memory. Don't do repeated singleton selects. Use Python.

Two. Don't Update.

Specifically, Do not do this. It's bad code for two reasons.

cursor.execute("UPDATE people SET chats_count = chats_count + 1 WHERE id = '%s'" % person_id)

It be replaced with a simple SELECT COUNT(*) FROM ... . Never update to increment a count. Just count the rows that are there with a SELECT statement. [If you can't do this with a simple SELECT COUNT or SELECT COUNT(DISTINCT), you're missing some data -- your data model should always provide correct complete counts. Never update.]

And. Never build SQL using string substitution. Completely dumb.

If, for some reason the SELECT COUNT(*) isn't fast enough (benchmark first, before doing anything lame) you can cache the result of the count in another table. AFTER all of the loads. Do a SELECT COUNT(*) FROM whatever GROUP BY whatever and insert this into a table of counts. Don't Update. Ever.

Three. Use Bind Variables. Always.

cursor.execute( "INSERT INTO ... VALUES( %(x)s, %(y)s, %(z)s )", {'x':person_id, 'y':time_to_string(time), 'z':channel,} )

The SQL never changes. The values bound in change, but the SQL never changes. This is MUCH faster. Never build SQL statements dynamically. Never.

OTHER TIPS

Use bind variables instead of literal values in the sql statements and create a cursor for each unique sql statement so that the statement does not need to be reparsed the next time it is used. From the python db api doc:

Prepare and execute a database operation (query or command). Parameters may be provided as sequence or mapping and will be bound to variables in the operation. Variables are specified in a database-specific notation (see the module's paramstyle attribute for details). [5]

A reference to the operation will be retained by the cursor. If the same operation object is passed in again, then the cursor can optimize its behavior. This is most effective for algorithms where the same operation is used, but different parameters are bound to it (many times).

ALWAYS ALWAYS ALWAYS use bind variables.

In the for loop, you're inserting into the 'chats' table repeatedly, so you only need a single sql statement with bind variables, to be executed with different values. So you could put this before the for loop:

insert_statement="""
    INSERT INTO chats(person_id, message_type, created_at, channel)
    VALUES(:person_id,:message_type,:created_at,:channel)
"""

Then in place of each sql statement you execute put this in place:

cursor.execute(insert_statement, person_id='person',message_type='msg',created_at=some_date, channel=3)

This will make things run faster because:

  1. The cursor object won't have to reparse the statement each time
  2. The db server won't have to generate a new execution plan as it can use the one it create previously.
  3. You won't have to call santitize() as special characters in the bind variables won't part of the sql statement that gets executed.

Note: The bind variable syntax I used is Oracle specific. You'll have to check the psycopg2 library's documentation for the exact syntax.

Other optimizations:

  1. You're incrementing with the "UPDATE people SET chatscount" after each loop iteration. Keep a dictionary mapping user to chat_count and then execute the statement of the total number you've seen. This will be faster then hitting the db after every record.
  2. Use bind variables on ALL your queries. Not just the insert statement, I choose that as an example.
  3. Change all the find_*() functions that do db look ups to cache their results so they don't have to hit the db every time.
  4. psycho optimizes python programs that perform a large number of numberic operation. The script is IO expensive and not CPU expensive so I wouldn't expect to give you much if any optimization.

As Mark suggested, use binding variables. The database only has to prepare each statement once, then "fill in the blanks" for each execution. As a nice side effect, it will automatically take care of string-quoting issues (which your program isn't handling).

Turn transactions on (if they aren't already) and do a single commit at the end of the program. The database won't have to write anything to disk until all the data needs to be committed. And if your program encounters an error, none of the rows will be committed, allowing you to simply re-run the program once the problem has been corrected.

Your log_hostname, log_person, and log_date functions are doing needless SELECTs on the tables. Make the appropriate table attributes PRIMARY KEY or UNIQUE. Then, instead of checking for the presence of the key before you INSERT, just do the INSERT. If the person/date/hostname already exists, the INSERT will fail from the constraint violation. (This won't work if you use a transaction with a single commit, as suggested above.)

Alternatively, if you know you're the only one INSERTing into the tables while your program is running, then create parallel data structures in memory and maintain them in memory while you do your INSERTs. For example, read in all the hostnames from the table into an associative array at the start of the program. When want to know whether to do an INSERT, just do an array lookup. If no entry found, do the INSERT and update the array appropriately. (This suggestion is compatible with transactions and a single commit, but requires more programming. It'll be wickedly faster, though.)

Additionally to the many fine suggestions @Mark Roddy has given, do the following:

  • don't use readlines, you can iterate over file objects
  • try to use executemany rather than execute: try to do batch inserts rather single inserts, this tends to be faster because there's less overhead. It also reduces the number of commits
  • str.rstrip will work just fine instead of stripping of the newline with a regex

Batching the inserts will use more memory temporarily, but that should be fine when you don't read the whole file into memory.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top