Pergunta

I'm not talking about distributed key/value systems, such as typically used with memcached, which use consistent hashing to make adding/removing nodes a relatively cheap procedure.

I'm talking about your standard in-memory hashtable like python's dict or perl's hash.

It would seem like the benefits of using consistent hashing would also apply to these standard data structures, by lowering the cost of resizing the hashtable. Real-time systems (and other latency-sensitive systems) would benefit from / require hashtables optimized for low-cost growth, even if overall throughput declines slightly.

Wikipedia alludes to "incremental resizing" but basically talks about a hot/cold replacement approach to resizing; there is a separate article about "extendible hashing" that uses a trie for bucket lookup to accomplish cheap rehashing.

Just curious if anyone's heard of in-core, single-node hashtables that use consistent hashing to lower growth cost. Or is this requirement better met using something other approach (ala the two wikipedia bits listed above)?

or ... is my whole question misguided? Do memory paging considerations make the complexity not worth it? That is, the extra indirection of consistent hashing lets you rehash only a fraction of the total keys, but perhaps that doesn't matter because you'll probably have to read from each existing page, so memory latency is your primary factor, and whether you rehash some or all of the keys doesn't matter compared to the cost of the memory access.... but on the other hand, with consistent hashing, all of your key remaps have the same destination page, so there's going to be less memory thrashing than if your keys remap to any of the existing pages.

EDIT: added "data-structures" tag, clarified final sentence to say "page" instead of "bucket".

Foi útil?

Solução

I haven't heard of this in the wild, but it may be a good idea if you choose the right consistent hash implementation. Specifically, Jump Consistent Hashing by Google et al. First I'll go into why Jump, then I'll go into how it can be useful in a local data structure.

Jump Consistent Hashing

Jump Consistent Hashing (which I'll shorten to Jump) is great for this space for a few reasons. Jump assumes that nodes don't fail, which is great for local data structures because they, well, don't fail! This allows Jump to merely be a mapping to a range of numbers [0, numBuckets), requiring only 2-4 bytes of space.

Further the implementation is simple and fast. And it is even faster if we remove the reference implementation's floating point divides and replace them with half as many integer divides. (Which we can, by the way.)

All this can be used for a variation on...

ConcurrentHashMap

But first, Java's Concurrent Hash Map at a high-level.

Java's ConcurrentHashMap is parameterized by a number of buckets. This sharding factor is constant through the life of the map. Each of these buckets is itself a hash map with its own lock.

When inserting a key-value pair into the map, the key is hashed into one of the buckets. The lock for that key is taken, and the item is inserted into the bucket's hash map before releasing the lock. Whilst inserting into bucket x another thread can be inserting concurrently into bucket y, but it will wait for the lock if inserting into bucket x. Thus Java's ConcurrentHashMap has n-way concurrency, where n is the bucket parameter of the constructor.

Just like any hash map, a bucket in ConcurrentHashMap can fill up and need to grow. Just like the regular hash map it does this by doubling its size and rehashing everything in the bucket back into its bigger self. Except that 'its bigger self' is only the bucket's 'self'. If a bucket is a hot spot and gets more than its fair share of keys, the bucket will grow disproportionately compared to the other buckets. And each time a bucket grows it takes longer and longer to rehash into itself. This last point is not only a problem for hot spots, but when the hash table plain old gets more keys.

Imagine if we could grow the number of buckets as the number of keys grows. With this we could dampen the amount of growth each individual bucket grows.

Enter consistent hashing, which allows us to add more buckets!

ConcurrentHashMap take 2: Consistent Hashing Style

We can get ConcurrentHashMap to grow its number of buckets in a two easy steps.

First replace the function that maps to each bucket with the jump consistent hash function. So far everything should work the the same.

Second split off a new bucket when a bucket is filled; also grow the filled bucket. Actually, only split off a new bucket if the filled bucket becomes the largest biggest in terms of capacity. That can be calculated without iterating the buckets.

With consistent hashing the split will only direct keys into the new bucket and not backwards into any of the old buckets.

End notes

I'm sure there can be improvements on this scheme. To wit, splitting off a bucket requires a full table scan to move keys into the new bucket. This is surely no worse than a vanilla hash map, and likely better, but it is at a disadvantage to the ConcurrentHashMap implementation which likely doesn't have to do a full scan.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top