Question

I'm using autovivification to store data in a multiprocessing setting. However, I can't figure out how to incorporate it in the multiprocessing manager function.

My autovivification code comes from Multiple levels of 'collection.defaultdict' in Python and works fine when no multiprocessing occurs.

class vividict(dict):  
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

My multiproc code is relative simpel:

if __name__ == "__main__":
    man = Manager()
    ngramDict = man.dict()
    print(ngramDict) # {}
    s_queue = Queue()

    aProces = Process(target=insert_ngram, args=(s_queue,ngramDict,))
    aProces.start()
    aProces.join()
    print(ngramDict) # {}
    write_to_file()

In insert_ngram the dictionary is read, written and updated:

def insert_ngram(sanitize_queue, ngramDict):
    ngramDict = Vividict() # obviously this overwrites the manager
    try:
        for w in iter(s_queue.get, None):
        if ngramDict[w[0]][w[1]][w[2]][w[3]][w[4]]:
            ngramDict[w[0]][w[1]][w[2]][w[3]][w[4]]+=int(w[5])
        else:
            ngramDict[w[0]][w[1]][w[2]][w[3]][w[4]]=int(w[5])
    print(ngramDict) # prints the expected ngramdict
    return
except KeyError as e:
    print("Key %s not found in %s" % (e, ngramDict))
except Exception as e:
    print("%s failed with: %s" % (current_process().name, e))

I've tried a series of what I thought were good solutions, but I can't get it to work, except for calling write_to_file in insert_ngram but that's not really a neat solution.

Is there a possibility to get Manager.dict() to autovivifacte?

--------- UPDATE 6-12-2013 --------

Since the Manager() provdes a proxy, any mutations to a manager.Dict() within a subprocess aren't stored/kept track of. (see also: How does multiprocessing.Manager() work in python?) This can be solved as folows:

def insert_ngram(sanitize_queue, ngramDict):
    localDict = Vividict()
    localDict.update(ngramDict)
    #do stuff
    ngramDict.update(ngramiDict)

I'm waiting for my machine to finish some tasks so I can see how this performs. It seems like a performance-hit to copy Dicts up and down like this. (my Dicts run into the 200Mb+)

--------- UPDATE 8-12-2013 -------- In my application, the dict.update() is only hit once, so even though the Dict is ~200Mb+, on the whole it's not a big performance impact...

Was it helpful?

Solution

The multiprocessing Manager() provdes a proxy to a dictionary or list. Any mutations to a manager.Dict() within a subprocess aren't stored/kept track of. One thus needs to copy the mutations to the proxy-variable that belongs to the Manager. (see also: How does multiprocessing.Manager() work in python?)

This can be solved as folows:

def insert_ngram(queue, managerDict):
    # create a local dictionary with vivification
    localDict = Vividict() 
    # copy the existing manager.dict to the local dict.
    localDict.update(managerDict) 
    #do stuff 
    # copy the local dictionary to the manager dict
    managerDict.update(localDict) 
    return 

Although this seems like some serious overhead, in this case it's not too bad, as the manager dictionary only needs to update() before joining with the main process.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top