Fastest way to store n-grams (strings with variable amount of words) in python

Question 1

I would try doing fewer regexp searches.

It's worth considering a few other things:

Storing all the data in a single dictionary may speed things up; a data hierarchy with extra layers doesn't help, perhaps counterintuitively.
Storing a tuple lets you avoid calling .format().
In CPython, code in functions is faster than global code.

Here's what it might look like:

def load(filename):
    ngrams = {}
    for line in open(filename):
        if line[0] == '\\':
            pass  # just ignore all these lines
        else:
            first, rest = line.split(None, 1)
            middle, last = rest.rsplit(None, 1)
            ngrams[middle] = first, last
    return ngrams

ngrams = load("ngrams.txt")

I would want to store int(first), int(last) rather than first, last. That would speed up access, but slow down load time. So it depends on your workload.

I disagree with johnthexii: doing this in Python should be much faster than talking to a database, even sqlite, as long as the data set fits in memory. (If you use a database, that means you can do the load once and not have to repeat it, so sqlite may end up being exactly what you want—but you can't do that with a :memory: database.)

Question 2

Regarding optimization of your code.

1) compile the regular expressions before loop. See help for re.compile.

2) Avoid regular expressions whenever it's possible. For example "-grams" string prepended with number can be checked by simple string comparison

Question 3

Personally I would move to a database (sqllite3 is built in to python) with indexes. Indexes make queries go fast. Python also supports in memory sqllite databases.

You can also supply the special name :memory: to create a database in RAM.