Question

How can the Google URL shortener generate a unique hash with five characters without collisions. Seems like there are bound to be collisions, where different urls generate the same hash.

stackoverflow.com => http://goo.gl/LQysz

What's also interesting, is the same URL, generates a completely different hash each time:

stackoverflow.com => http://goo.gl/Dl7sz

So, doing some math, using lower-case characters, upper-case characters, and digits, the total number of combinations are 62^5 = 916,132,832 clearly collisions bound to happen.

How does Google do this?

Was it helpful?

Solution

They have a database which tracks all previously generated URLs and the longer URL that each of those maps to. Easy to make sure that newly generated URLs don't already exist in that table. A little tricky to scale out (they surely have multiple servers so each one needs to be assigned a bucket of values from which it can give out to users). If they ever reach the point of having generated 916,132,832 URLs, they'll just add another character.

OTHER TIPS

They have a hash table with hash to url.

Count the number of rows in that table and encrypt it with a stream cipher then encode with base62.

Using a stream cipher instead of a hash will give you a short pseudo random output that doesn't collide with any previous output so you don't need to check the table.

  1. It keeps track of previously used long URLs. This means that, when someone goes to create a short URL, if the place they are pointing to already has a short URL, it will just give them the pre-existing short URL.

  2. Actually, it would be inefficient to have a system dedicated to creating 'hashes' based on a given set of data. Rather, the short URL is simply a random set of characters which has already been identified as ten digits, plus 26 lowercase letters, plus 26 uppercase letters = 916132832 permutations (not combinations). Random short URLs is the most efficient way to make it work, and that is why they are always different (though I suppose there could be some other component in the algorithm like the time of day, but I don't think it's worth it....there's no point in making it that complex; spending all of that processing power just to make a silly 5 character string which any monkey could do by pressing a button the right way on a permutation calculator).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top