Domanda

When using adler32() as a hash function, one should expect rare collisions.

We can do the exact math of collisions probability, but roughly speaking,
since it is a 32-bits hash function, there should not be many collisions
on a sample set of a few thousands items.

This is hardly the case.

Here is an example: let's take strings that include a date in the middle, something like

"Some prefix text " + date + " some postfix text."

where the dates` format is yyyy-mm-dd, and looping over 2012.

There are 91 collisions in this example!

Even worse: there are 7 cases where 3 dates collided.

How come such a commonly-used hash function perform so poorly?
Or am I missing something?

Here are the detailed results of the above example:

0x592a0f1f: 2012-01-30, 2012-02-02, 2012-10-21
0x592b0f1f: 2012-02-11, 2012-10-30, 2012-11-02
0x593d0f20: 2012-01-31, 2012-02-03, 2012-10-22
0x593e0f20: 2012-02-12, 2012-10-31, 2012-11-03
0x59410f20: 2012-03-11, 2012-11-30, 2012-12-02
0x59560f21: 2012-03-30, 2012-04-02, 2012-12-21
0x59690f22: 2012-03-31, 2012-04-03, 2012-12-22

0x59020f1d: 2012-01-10, 2012-10-01
0x59150f1e: 2012-01-11, 2012-10-02
0x59160f1e: 2012-01-20, 2012-10-11
0x59170f1e: 2012-02-01, 2012-10-20
0x59180f1e: 2012-02-10, 2012-11-01
0x59280f1f: 2012-01-12, 2012-10-03
0x59290f1f: 2012-01-21, 2012-10-12
0x592c0f1f: 2012-02-20, 2012-11-11
0x592d0f1f: 2012-03-01, 2012-11-20
0x592e0f1f: 2012-03-10, 2012-12-01
0x593b0f20: 2012-01-13, 2012-10-04
0x593c0f20: 2012-01-22, 2012-10-13
0x593f0f20: 2012-02-21, 2012-11-12
0x59400f20: 2012-03-02, 2012-11-21
0x59420f20: 2012-03-20, 2012-12-11
0x59430f20: 2012-04-01, 2012-12-20
0x594e0f21: 2012-01-14, 2012-10-05
0x594f0f21: 2012-01-23, 2012-10-14
0x59500f21: 2012-02-04, 2012-10-23
0x59510f21: 2012-02-13, 2012-11-04
0x59520f21: 2012-02-22, 2012-11-13
0x59530f21: 2012-03-03, 2012-11-22
0x59540f21: 2012-03-12, 2012-12-03
0x59550f21: 2012-03-21, 2012-12-12
0x59570f21: 2012-04-11, 2012-12-30
0x59610f22: 2012-01-15, 2012-10-06
0x59620f22: 2012-01-24, 2012-10-15
0x59630f22: 2012-02-05, 2012-10-24
0x59640f22: 2012-02-14, 2012-11-05
0x59650f22: 2012-02-23, 2012-11-14
0x59660f22: 2012-03-04, 2012-11-23
0x59670f22: 2012-03-13, 2012-12-04
0x59680f22: 2012-03-22, 2012-12-13
0x596a0f22: 2012-04-12, 2012-12-31
0x596c0f22: 2012-04-30, 2012-05-02
0x59740f23: 2012-01-16, 2012-10-07
0x59750f23: 2012-01-25, 2012-10-16
0x59760f23: 2012-02-06, 2012-10-25
0x59770f23: 2012-02-15, 2012-11-06
0x59780f23: 2012-02-24, 2012-11-15
0x59790f23: 2012-03-05, 2012-11-24
0x597a0f23: 2012-03-14, 2012-12-05
0x597b0f23: 2012-03-23, 2012-12-14
0x597c0f23: 2012-04-04, 2012-12-23
0x59820f23: 2012-05-30, 2012-06-02
0x59870f24: 2012-01-17, 2012-10-08
0x59880f24: 2012-01-26, 2012-10-17
0x59890f24: 2012-02-07, 2012-10-26
0x598a0f24: 2012-02-16, 2012-11-07
0x598b0f24: 2012-02-25, 2012-11-16
0x598c0f24: 2012-03-06, 2012-11-25
0x598d0f24: 2012-03-15, 2012-12-06
0x598e0f24: 2012-03-24, 2012-12-15
0x598f0f24: 2012-04-05, 2012-12-24
0x59950f24: 2012-05-31, 2012-06-03
0x59980f24: 2012-06-30, 2012-07-02
0x599a0f25: 2012-01-18, 2012-10-09
0x599b0f25: 2012-01-27, 2012-10-18
0x599c0f25: 2012-02-08, 2012-10-27
0x599d0f25: 2012-02-17, 2012-11-08
0x599e0f25: 2012-02-26, 2012-11-17
0x599f0f25: 2012-03-07, 2012-11-26
0x59a00f25: 2012-03-16, 2012-12-07
0x59a10f25: 2012-03-25, 2012-12-16
0x59a20f25: 2012-04-06, 2012-12-25
0x59ae0f25: 2012-07-30, 2012-08-02
0x59ae0f26: 2012-01-28, 2012-10-19
0x59af0f26: 2012-02-09, 2012-10-28
0x59b00f26: 2012-02-18, 2012-11-09
0x59b10f26: 2012-02-27, 2012-11-18
0x59b20f26: 2012-03-08, 2012-11-27
0x59b30f26: 2012-03-17, 2012-12-08
0x59b40f26: 2012-03-26, 2012-12-17
0x59b50f26: 2012-04-07, 2012-12-26
0x59c10f26: 2012-07-31, 2012-08-03
0x59c40f26: 2012-08-30, 2012-09-02
0x59c40f27: 2012-02-28, 2012-11-19
0x59c50f27: 2012-03-09, 2012-11-28
0x59c60f27: 2012-03-18, 2012-12-09
0x59c70f27: 2012-03-27, 2012-12-18
0x59c80f27: 2012-04-08, 2012-12-27
0x59d70f27: 2012-08-31, 2012-09-03
0x59da0f28: 2012-03-28, 2012-12-19
0x59db0f28: 2012-04-09, 2012-12-28
È stato utile?

Soluzione

Adler-32 was never intended to be and is not a hash function. It's purpose is error detection after decompression. It serves that purpose well since it is fast and since errors in the compressed data are amplified by the decompressor.

In the examples you give, you are using Adler-32 on very short strings, for which it has no chance to even make use of all 32 bits. Adler-32 requires at least a few hundred bytes to get rolling.

There are many non-cryptographic hash functions that are very fast and have good hash behavior, including avoidance of collisions. Take a look at the CityHash family of hash functions.

If you need cryptographic (non-spoofable) hash functions, then look at SHA-2 and SHA-3.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top