Frage

I'm building a script to process web server logs, and I'm trying to incorporate MaxMinds's IP dataset (http://dev.maxmind.com/geoip/legacy/geolite/) into the script in order to get the country the hit is coming from.

Currently, my script works fine when I just have it extract the information I want, however when I try to add IP lookup's it slows down - a lot - by about 1800%. So, I'm curious if this has something to do with my code or if there is a way I can speed this up.

For example, when I run the following code extracting date and ip address, for this experiment it took about 6.5 seconds.

extractedData = []

for log in logList:
    ip = log[-1]
    date = log[0]
    dateIP = [date, ip]
    extractedData.append(dateIP)

When I add pyGeoIP and try to incorporate the country code it slows down. The following code took 2 mintues and 7seonds to run.

extractedData = []

gi = pygeoip.GeoIP('/path/to/GeoIP.dat') 

for log in logList:
    ip = log[-1]
    country = gi.country_name_by_addr(ip)
    date = log[0]
    dateCountry = [date, country]
    extractedData.append(dateCountry)

So, is there a way to speed this up since this look up will slow to process down too much.

Thanks!

War es hilfreich?

Lösung

Since you're doing many queries, you should load the database into memory. As it stands, you're repeatedly reading from the disk, which is painfully slow.

Exchange this line:

gi = pygeoip.GeoIP('/path/to/GeoIP.dat') 

to this:

gi = pygeoip.GeoIP('/path/to/GeoIP.dat', pygeoip.MEMORY_CACHE) 

Andere Tipps

I had the same problem but tried both Python and PHP on a CentOS box. Running 3M ip-addresses through a Python script took 19.5 minutes. Applying the MEMORY_CACHE optimization brought it down to 8 minutes. Running the same data through a PHP script took 2-2/3 minutes.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top