Pergunta

Let's say that I have a set of users in my database, that have GUIDs as their IDs. I use xxhash to generate fixed-length hashes for each value, so that I can then proceed to "bucketizing" them and being able to do random sampling with the help of the modulo function.

That said, if I have a hash such as 367b50760441849e, I want to be able to use hash % 20 == 0 to randomly pick 5% of the population (hence, 20 "buckets"). This is the approach that is used in Kusto hash() with a modulo argument.

With this in mind, what is the approach that should be used to calculate an integer value from the hash, so that I can calculate the modulo?

Foi útil?

Solução

Any good hash will be uniformly distributed, which means that you can assume a uniform distribution when you apply modulo n, as long as $n < 2^{M/2}$, where M is the number of bits in your hash, see here. So for SHA1-32 you would at most modulo by $2^{16}$.

There is no approach to calculating an integer value; what you have there is an hexadecimal representation of a hash, you just need to convert it to a numeric type if you obtained it as a string. XXH32() and XXH64() both already produce an unsigned int output.

Licenciado em: CC-BY-SA com atribuição
scroll top