I would try doing fewer regexp searches.
It's worth considering a few other things:
Storing all the data in a single dictionary may speed things up; a data hierarchy with extra layers doesn't help, perhaps counterintuitively.
Storing a tuple lets you avoid calling
.format()
.In CPython, code in functions is faster than global code.
Here's what it might look like:
def load(filename):
ngrams = {}
for line in open(filename):
if line[0] == '\\':
pass # just ignore all these lines
else:
first, rest = line.split(None, 1)
middle, last = rest.rsplit(None, 1)
ngrams[middle] = first, last
return ngrams
ngrams = load("ngrams.txt")
I would want to store int(first), int(last)
rather than first, last
. That would speed up access, but slow down load time. So it depends on your workload.
I disagree with johnthexii: doing this in Python should be much faster than talking to a database, even sqlite, as long as the data set fits in memory. (If you use a database, that means you can do the load once and not have to repeat it, so sqlite may end up being exactly what you want—but you can't do that with a :memory: database.)