Вопрос

I have a code that uses a cyclic polynomial rolling hash (Buzhash) to compute hash values of n-grams of source code. If i use small hash values (7-8 bits) then there are some collisions i.e. different n-grams map to the same hash value. If i increase the bits in the hash value to say 31, then there are 0 collisions - all ngrams map to different hash values.

I want to know why this is so? Do the collisions depend on the number of n-grams in the text or the number of different characters that an n-gram can have or is it the size of an n-gram?

How does one choose the number of bits for the hash value when hashing n-grams (using rolling hashes)?

Это было полезно?

Решение

How Length effects Collisions

This is simply a question of permutations.

If i use small hash values (7-8 bits) then there are some collisions

Well, let's analyse this. With 8 bits, there are 2^8 possible binary sequences that can be generated for any given input. That is 256 possible hash values that can be generated, which means that in theory, every 256 message digest values generated guarantee a collision. This is called the birthday problem.

If i increase the bits in the hash value to say 31, then there are 0 collisions - all ngrams map to different hash values.

Well, let's apply the same logic. With 31 bit precision, we have 2^31 possible combinations. That is 2147483648 possible combinations. And we can generalise this to:

Let N denote the amount of bits we use.
Amount of different hash values we can generate (X) = 2^N

Assuming repetition of values is allowed (which it is in this case!)

This is an exponential growth, which is why with 8 bits, you found a lot of collisions and with 31 bits, you've found very little collisions.

How does this effect collisions?

Well, with a very small amount of values, and an equal chance for each of those values being mapped to an input, you have it that:

Let A denote the number of different values already generated.
Chance of a collision is: A / X 

Where X is the possible number of outputs the hashing algorithm can generate.

When X equals 256, you have a 1/256 chance of a collision, the first time around. Then you have a 2/256 chance of a collision when a different value is generated. Until eventually, you have generated 255 different values and you have a 255/256 chance of a collision. The next time around, obviously it becomes a 256/256 chance, or 1, which is a probabilistic certainty. Obviously it usually won't reach this point. A collision will likely occur a lot more than every 256 cycles. In fact, the Birthday paradox tells us that we can start to expect a collision after 2^N/2 message digest values have been generated. So following our example, that's after we've created 16 unique hashes. We do know, however, that it has to happen, at minimum, every 256 cycles. Which isn't good!

What this means, on a mathematical level, is that the chance of a collision is inversely proportional to the possible number of outputs, which is why we need to increase the size of our message digest to a reasonable length.

A note on hashing algorithms

Collisions are completely unavoidable. This is because, there are an extremely large number of possible inputs (2^All possible character codes), and a finite number of possible outputs (as demonstrated above).

Другие советы

If you have hash values of 8 bits the total possible number of values is 256 - that means that if you hash 257 different n-grams there will be for sure at least one collision (...and very likely you will get many more collisions, even with less that 257 n-grams) - and this will happen regardless of the hashing algorithm or the data being hashed.

If you use 32 bits the total possible number of values is around 4 billion - and so the likelihood of a collision is much less.

'How does one choose the number of bits': I guess depends on the use of the hash. If it is used to store the n-grams in some kind of hashed data structure (a dictionary) then it should be related to the possible number of 'buckets' of the data structure - e.g. if the dictionary has less than 256 buckets that a 8 bit hash is OK.

See this for some background

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top