Pergunta

I am saving URL's in a database, and when i insert a new one, i want to check if that url exists already in the database.

A common practice (if i'm not mistaken) is to hash the urls using md5 or sha-1 etc... and checking that field in database for duplicates prior inserting a new one.

I know md5 can produce collisions, also sha-1...

What do you suggest for me? My needs are:

  • DB Size: Eventually 10 to 20 Millions of records on database

  • Performance/Speed: Small hash size so database will not have heavy load checking for duplicates (there is going to be index of course on that field)

  • Tolerance: I don't care if i get 1 collision on every 100,000 records. My needs are more towards performance (small hash) rather than 0% collisions (big hash).

  • Chance of attack by malformed URLs to produce collisions on purpose: Extremely Low

  • Maximum damage possible in case of such a successful attack: Extremely Low

Questions:

  • Do you believe md5 is enough (Something better to suggest)?

  • Maybe md5 is even overkill for me and i could seriously can get performance benefits by using something simpler?

Thank you in advance guys!

Foi útil?

Solução

What about using md5 or some similar relatively inexpensive hash (maybe Quark?), and in the rare case of collision checking the full URL for the matching entries? This way the majority of the time you just have the inexpensive hash check but you also never actually insert a duplicate URL.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top