Question

I have a text file that has 120000 lines, where every line has exactly this format: ean_code;plu;name;price;state

I tried various operations, including working with file straight away, and best results were given if file was just loaded in memory line by line with readlines() and written to list (at the start of the program).

so i have these 2 lines:

matcher = re.compile('^(?:'+eanic.strip()+'(?:;|$)|[^;]*;'+eanic.strip()+'(?:;|$))').match
line=[next(l.split(';') for l in list if matcher(l))]
do sth with line....

What these lines are trying to accomplish is, they are trying to find (as fast as possible) a plu/ean, which was given by user input in fields: ean_code or plu.

I am particulary interested in second line, as it impacts my performance on WinCE device (PyCE port of python 2.5).

I tried every possible solution there is to make it faster, but this is fastest way to iterate through a certain list to find a match that re.compile is generating.

Any faster way other than for in list comprehension to iterate over big list (120000 lines in my case)?

I am looking for any kind of way possible with any kind of data structure (that is supported until Python 2.5) that will give me faster result than above two lines...

Just to mention, that this is performed on Handheld device (630MHz ARM), with 256MB RAM, and without any kind of connection besides USB is present. Sadly, database access and existance is not an option.

Was it helpful?

Solution

I made a test file and tested a few variations. The fastest way of searching for a static string (as you appear to be doing) by iterating over the file is by using string in line.

However, if you'll be using the loaded data to search more than once (actually more than 30 times according to the testnumbers below), it's worth your (computational) time to produce lookup tables for the PLUs and EANs in the form of dicts and use these for future searches.

loaded 120000 lines
question regex    0.114868402481
simpler regex     0.417045307159
other regex       0.386662817001
startswith        0.236350297928
string in         0.020356798172  <-- iteration winner
dict construction 0.611148500443
dict lookup       0.000002503395  <-- best if you are doing many lookups

Test code follows:

import re
import timeit

def timefunc(function, times, *args):
    def wrap():
        function(*args)
    t = timeit.Timer(wrap)
    return t.timeit(times) / times

def question(lines):
    eanic = "D41RP9"
    matcher = re.compile('^(?:'+eanic.strip()+'(?:;|$)|[^;]*;'+eanic.strip()+'(?:;|$))').match
    line=[next(l.split(';') for l in lines if matcher(l))]
    return line

def splitstart(lines):
    eanic = "D41RP9"
    ret = []
    for l in lines:
        s = l.split(';')
        if s[0].startswith(eanic) or s[1].startswith(eanic):
            ret.append(l)
    return ret

def simpler(lines):
    eanic = "D41RP9"
    matcher = re.compile('(^|;)' + eanic)
    return [l for l in lines if matcher.search(l)]

def better(lines):
    eanic = "D41RP9"
    matcher = re.compile('^(?:' + eanic + '|[^;]*;' + eanic + ')')
    return [l for l in lines if matcher.match(l)]

def strin(lines):
    eanic = "D41RP9"
    return [l for l in lines if eanic in l]

def mkdicts(lines):
    ean = {}
    plu = {}
    for l in lines:
        s = l.split(';')
        ean[s[0]] = s
        plu[s[1]] = s
    return (ean, plu)

def searchdicts(ean, plu):
    eanic = "D41RP9"
    return (ean.get(eanic, None), plu.get(eanic, None))

with open('test.txt', 'r') as f:
    lines = f.readlines()

    print "loaded", len(lines), "lines"
    print "question regex\t", timefunc(question, 10, lines)
    print "simpler regex\t", timefunc(simpler, 10, lines)
    print "other regex\t", timefunc(simpler, 10, lines)
    print "startswith\t", timefunc(splitstart, 10, lines)
    print "string in\t", timefunc(strin, 10, lines)
    print "dict construction\t", timefunc(mkdicts, 10, lines)
    ean, plu = mkdicts(lines)
    print "dict lookup\t", timefunc(searchdicts, 10, ean, plu)

OTHER TIPS

At first I looked up some modules, that are available for python 2.5:

You can use csv-module to read your data. Could be faster.

You can store your data via pickle or cPickle module. So you can store python-objects (like dict, tuples, ints and so on). Comparing ints is faster than searching in strings.

You iterate through a list, but you say your data are in a text file. Do not load your whole text file in a list. Maybe the following is fast enough and there is no need for using the modules I mentioned above.

f = open('source.txt','r') # note: python 2.5, no with-statement yet
stripped_eanic = eanic.strip()
for line in f:
    if stripped_eanic in line: # the IDs have a maximum length, don't they? So maybe just search in line[:20]
        # run further tests, if you think it is necessary
        if further_tests:
            print line
            break
else:
    print "No match"

Edit

I thought about what I mentioned above: Do not load the whole file in a list. I think that is only true, if your search is a onetime procedure and your script exits there. But if you want to search several times I suggest using dict (like beerbajay suggested) and cPickle-files instead of text-file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top