Question

I have a program (PatchDock), which takes its input from a parameters file, and produces an output file. Running this program is time-intensive, and I'd like to cache results of past runs so that I need not run the same parameters twice.

I'm able to parse the input and output files into appropriate data structures. The input file, for example, is parsed into a dictionary-like object. The input keys are all strings, and the values are primitive data types (ints, strings, and floats).

My approach

My first idea was to use the md5 hash of the input file as the keys in a shelve database. However, this fails to capture cached files with the exact same inputs, but some slight differences in the input files (comments, spacing, order of parameters, et cetera).

Hashing the parsed parameters seems like the best approach to me. But the only way I can think of getting a unique hash from a dictionary is to hash a sorted string representation.

Question

Hashing a string representation of a parameters dictionary seems like a roundabout way of achieving my end goal- keying unique input values to a specified output. Is there a more straightforward way to achieve this caching system?

Ideally, I'm looking to achieve this in Python.

Was it helpful?

Solution

Hashing a sorted representation of the parsed input is actually the most straightforward way of doing this, and the one that makes sense. Your instincts were correct.

Basically, you're normalizing the input (by parsing it and sorting it), and then using that to construct a hash key.

OTHER TIPS

Hashing seems a very viable way, but doing this yourself seems a bit overkill. Why not use the tuple of inputs as key for your dictionary? You wouldn't have to worry about hashing and possible collisions yourself. All you have to do is fix a order for the keyword arguments (and depending on your requirements add a flag-object for keywords that are not set).

You also might find the functools.lru_cache useful, if you are using Python 3.2+. This is a decorator that will enable caching for the last n calls of the decorated function. If you are using a older version there are backports of this functionality out there.

Also there seem to be a project with similar goals called FileDict which might be worth looking at.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top