Вопрос

I have the following code:

keywordindex = cPickle.load(open('keywordindex.p','rb'))#contains~340 thousand items
masterIndex = {}

indexes = [keywordindex]
for partialIndex in indexes:
    start = time.time()
    for key in partialIndex.iterkeys():
        if key not in masterIndex.keys():
            masterIndex[key]= partialIndex[key]
        elif key in masterIndex.keys():
            masterIndex[key].extend(partialIndex[key])
    cPickle.dump(masterIndex,open('MasterIndex.p','wb'))

    print int(time.time() - start), ' seconds'#every thousand loops

and I am experiencing degrading performance as the loop runs, the first 10 thousand take about 5 seconds per thousand, but every 10 thousand or so it takes another second, until its taking 3 times longer. I have tried to streamline the code in every possible way, but I can't seem to figure out what is causing it. Is there a reason for this? It's not a memory issue, I'm only at 30% usage

Это было полезно?

Решение

This block contains two instances of horridly inefficient coding:

    if key not in masterIndex.keys():
        masterIndex[key]= partialIndex[key]
    elif key in masterIndex.keys():
        masterIndex[key].extend(partialIndex[key])

First, the key is or is not in masterIndex, so there's no useful point at all to the elif test. If the not in test fails, the in test must succeed. So this code does the same:

    if key not in masterIndex.keys():
        masterIndex[key]= partialIndex[key]
    else:
        masterIndex[key].extend(partialIndex[key])

Second, masterIndex is a dict. Dicts support very efficient membership testing without any help from you ;-) By materializing its keys into a list (via .keys()), you're changing what should be a blazing fast dict lookup into a horridly slow linear search over a list. So do this instead:

    if key not in masterIndex:
        masterIndex[key]= partialIndex[key]
    else:
        masterIndex[key].extend(partialIndex[key])

The code will run much faster then.

Другие советы

You don't need to search through masterIndex.keys(). Also, just use an empty else clause:

if key not in masterIndex:
    ...
else:
    ...

The in operator on a dict queries the keys of that dict and the time complexity of that operation is O(1) on average.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top