Is there another way to avoid duplication of large hashable objects?

https://stackoverflow.com/questions/18410321

26-06-2022
|

Question

I am processing text and have the need to store large sequences of hashable objects - sometimes strings, sometimes tuples of words, etc. I've been thinking of using the hash function to provide an simple store and retrieve class but with my first approach it is possible that a single hash key might resolve to more than one item. Given that I add a get function that takes the return value of add as an argument I cannot know which item in the list to return.

class HashStore:
    def __init__(self):
        self.uniques = {}

    def add(self, big_hashable):
        hash_value = hash(big_hashable)
        if hash_value not in self.uniques:
            self.uniques[hash_value] = [big_hashable]
        elif big_hashable not in self.uniques[hash_value]:
            self.uniques[hash_value].append(big_hashable)

        return hash_value

Another approach ends up assuring that there is only a single mapping for each unique hashable item.

class SingleStore:
    def __init__(self):
        self.uniques = {}
        self.indexed = {}
        self.index = 0

    def add(self, big_hashable):
        if big_hashable not in self.uniques:
            self.index += 1
            self.uniques[big_hashable] = self.index
            self.indexed[self.index] = big_hashable

        return self.uniques[big_hashable]

This works and assures that the return value of add can be used to return a unique value. It just seems a bit clumsy. Is there a better, more Pythonic way of handling this situation?

I've been ambiguous as to the question. There are two issues - one is that I have millions of objects that are currently using keys ranging from 100s to 1000s of bytes each (the big_hashable thing). Converting those to integers would enable processing of more data than I currently can. Secondly, keeping only a single canonical copy of each big_hashable thing would cut down on memory usage as well, though it is the first issue that is driving my question, because each key is actually a separate copy of the big_hashable thing.

Solution

If you don't need to be able to efficiently retrieve a canonical copy of an object given a different copy, you can just use a set:

s = set()
s.add(3)
s.add(3)
# s only has one 3 in it

If you do need to be able to efficiently retrieve canonical copies of objects, don't store them by the hash value - that'd be horribly broken. Just use the hashable directly.

class Interner(object):
    def __init__(self):
        self._store = {}
    def canonical_object(self, thing):
        """Returns a canonical object equal to thing.

        Always returns the same result for equal things.

        """

        return self._store.setdefault(thing, thing)

With the weakref module, you can improve this to not keep a canonical object if the client code lets go of it, just like the built-in intern function does for strings.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow