Pergunta

I'm using hash_ring package for distributing objects among servers. I've assumed that distribution would be uniform, as it's based on MD5 hashes. Unfortunately it's not the case.

I'm using random keys which are generated using uuid.uuid4(). I've verified, that MD5 itself in fact does give uniform distribution. However, when I'm distributing using hash_ring.HashRing there are differences of 20-30% between most and least populated buckets.

  • Can uniformity of distribution be improved in hash_ring by tweaking some settings?
  • Are there other good alternatives to do consistent hashing in Python?

The code I've used to test uniformity of distribution:

ring = hash_ring.HashRing(range(8))
for _ in range(10):
     counters = [0]*8
     for _ in range(100000):
         counters[ring.get_node(str(uuid.uuid4()))] += 1
     print counters

Which printed out:

[11115, 11853, 14033, 11505, 13640, 12292, 12851, 12711]
[11164, 11833, 14024, 11562, 13365, 12302, 13002, 12748]
[11354, 11756, 14017, 11583, 13201, 12231, 13135, 12723]
[11182, 11672, 13936, 11441, 13563, 12240, 13129, 12837]
[11069, 11866, 14045, 11541, 13443, 12249, 12894, 12893]
[11196, 11791, 14158, 11533, 13517, 12319, 13039, 12447]
[11297, 11944, 14114, 11536, 13154, 12289, 12990, 12676]
[11250, 11770, 14145, 11657, 13340, 11960, 13161, 12717]
[11027, 11891, 14099, 11615, 13320, 12336, 12891, 12821]
[11148, 11677, 13965, 11557, 13503, 12309, 13123, 12718]

For comparison I've did non-consistent hashing directly using MD5:

for _ in range(10):
    counters = [0]*8
    for _ in range(100000):
        counters[int(hashlib.md5(str(uuid.uuid4())).hexdigest(),16)%8] += 1
    print counters

with much better results:

[12450, 12501, 12380, 12643, 12446, 12444, 12506, 12630]
[12579, 12667, 12523, 12385, 12386, 12445, 12702, 12313]
[12624, 12449, 12451, 12401, 12580, 12449, 12562, 12484]
[12359, 12543, 12371, 12659, 12508, 12416, 12619, 12525]
[12425, 12526, 12565, 12732, 12381, 12481, 12335, 12555]
[12514, 12576, 12528, 12294, 12658, 12319, 12518, 12593]
[12500, 12471, 12460, 12502, 12637, 12393, 12442, 12595]
[12583, 12418, 12428, 12311, 12581, 12780, 12372, 12527]
[12395, 12569, 12544, 12319, 12607, 12488, 12424, 12654]
[12480, 12423, 12492, 12433, 12427, 12502, 12635, 12608]
Foi útil?

Solução

the hash ring sacrifices the "eveness" of your md5 test code to maintain mappings when the number of entries changes. see http://www.lexemetech.com/2007/11/consistent-hashing.html. so the differences you see are not because of uuid4, or because of an error, but because the library uses a different algorithm from your test.

if you changed the number of bins in your md5 code you'd need to change the modular division (the % 8) and suddenly (almost) all mappings would change. consistent hashing avoids this. that is why it can't use the same "obviously uniform" approach you do.

the downside of the consistency approach is that it's not perfectly uniform (it depends on the hashes of the the bins, rather than on the hashes of the objects you put in the bins, so you don't get the "evening out" you would expect as you add more objects). but you can increase uniformity by increasing the number of points on the ring - by increasing the number of "replicas" used in the code.

so assuming that the final api matches that shown at http://amix.dk/blog/viewEntry/19367 just set a larger value for replicas (try doubling it, or even just adding 1 - you're already pretty flat).


update: i looked around a bit more and this blog post http://amix.dk/blog/post/19369 describes the changes made to the latest code. it looks like that does use more replicas than 3 (i don't completely understand the code, sorry) and it also seems that it's now based on a well-known "standard" implementation.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top