Pergunta

According to Birthday paradox:

If i apply it to the Database (please correct me if I'm wrong): if we need to store UNIQUE hashed data in Database and we have a hash algorythm that can generate 365 unique hash values, there is a 50% chance that data collision will occur after first 23 data entries and 99.9%(!) chance of collision after first 75 database entries.

The amount of unique hashes our algorithm can generate and amount of data entries can grow exponentially, but the probabillity of collision will stay the same. If this right?

I have a huge table with transactions (for E-commerce) and I have field 'receipt' set as unique. And the actual receipt numbers is what bothering me.

Example of the receipt number: BHF2Z47E uppercase only A-Z/0-9 with the length of 8 symbols.

UPDATE:

The Birthday Paradox

Foi útil?

Solução

The birthday paradox just states that if you randomly generate values in a space of n, there is a rapid phase transition from no collisions to collisions when you store sqrt(n) values - that's where the probability increases to over 50%.

In your example you have an alphabet of 26 + 10 characters, and 8 digits; so that's 36^8 or about 2.8 trillion possible keys; you can expect to have more than a 50% collision probability after about 1.6 million entries; that's not very good. There is a decent chance of collision at even a fraction of that.

As a comparison, let's say you generated a 160-bit random key for each receipt (2^160 possible values); then you would need to generate about 2^80 receipts (around 10^24) to reach the same probability of collision. You could sell your product as a very large company for your whole life and probably still not see any. Another perspective is that your hard disk or computer will fail before you observe a collision.

The table in this article gives some concrete numbers for you. For example, with a 256-bit hash value and 10^31 values inserted, you would get a collision probability of 10^-15. According to that article, that is around the uncorrectable error rate of your hard disk. That's probably the magnitude of what you should be aiming for with your receipts to avoid overwriting them. It's not to hard to make the values just a little bigger.

Of course, this depends on the fact that you seed your PRNG properly with random data; otherwise you could get the same key easily :)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top