Question

I have an input file consisting of lines with numbers and word sequences, structured like this:

\1-grams:
number   w1    number
number   w2    number
\2-grams:
number   w1 w2   number
number   w1 w3   number
number   w2 w3   number
\end\

I want to store the word sequences (so-called n-grams) in such a way that I can easily retrieve both numbers for each unique n-gram. What I do now, is the following:

all = {}
ngrams = {}
for line in open(file):
    m = re.search('\\\([1-9])-grams:',line.strip()) # find nr of words in sequence
    if m != None:
        n = int(m.group(1))
        ngrams = {} # reinitialize dict for new n
    else:
        m = re.search('(-[0-9]+?[\.]?[0-9]+)\t([^\t]+)\t?(-[0-9]+\.[0-9]+)?',line.strip()) #find numbers and word sequence
        if m != None:
            ngrams[m.group(2)] = '{0}|{1}'.format(m.group(1), m.group(3))
        elif "\end\\" == line.strip():
            all[int(n)] = ngrams

In this way I can easily and quite quickly find the numbers for e.g. the sequence s='w1 w2' this way:

all[2][s]

The problem is that this stored procedure is rather slow, especially when there are a lot (>100k) of n-grams and I'm wondering whether there is a faster way to achieve the same result without having a decrease in access speed. Am I doing something suboptimal here? Where can I improve?

Thanks in advance,

Joris

Was it helpful?

Solution

I would try doing fewer regexp searches.

It's worth considering a few other things:

  • Storing all the data in a single dictionary may speed things up; a data hierarchy with extra layers doesn't help, perhaps counterintuitively.

  • Storing a tuple lets you avoid calling .format().

  • In CPython, code in functions is faster than global code.

Here's what it might look like:

def load(filename):
    ngrams = {}
    for line in open(filename):
        if line[0] == '\\':
            pass  # just ignore all these lines
        else:
            first, rest = line.split(None, 1)
            middle, last = rest.rsplit(None, 1)
            ngrams[middle] = first, last
    return ngrams

ngrams = load("ngrams.txt")

I would want to store int(first), int(last) rather than first, last. That would speed up access, but slow down load time. So it depends on your workload.

I disagree with johnthexii: doing this in Python should be much faster than talking to a database, even sqlite, as long as the data set fits in memory. (If you use a database, that means you can do the load once and not have to repeat it, so sqlite may end up being exactly what you want—but you can't do that with a :memory: database.)

OTHER TIPS

Regarding optimization of your code.

1) compile the regular expressions before loop. See help for re.compile.

2) Avoid regular expressions whenever it's possible. For example "-grams" string prepended with number can be checked by simple string comparison

Personally I would move to a database (sqllite3 is built in to python) with indexes. Indexes make queries go fast. Python also supports in memory sqllite databases.

You can also supply the special name :memory: to create a database in RAM.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top