Question

I am trying to understand crc32 to generate the unique url for web page.

If we use the crc32, what is the maximum number of urls can be used so that we can avoid duplicates?

What could be the approximative string length to keep the checksum to be 2^32?

When I tried UUID for an url and convert the uuid bytes to base 64, I could reduce to 22 chars long. I wonder I can reduce still further.

Mostly I want to convert the url (maximum 1024 chars) to shorted id.

Was it helpful?

Solution

There is no such number as the "maximum number of urls can be used so that we can avoid duplicates" for CRC32.

The problem is that CRC32 can produce duplicates, and it's not a function of how many values you throw at it, it's a function of what those values look like.

So you might have a collision on the second url, if you're unlucky.

You should not base your algorithm on producing a unique hash, instead produce a unique value for each url manually.

OTHER TIPS

If you're already storing the full URL in a database table, an integer ID is pretty short, and can be made shorter by converting it to base 16, 64, or 85. If you can use a UUID, you can use an integer, and you may as well, since it's shorter and I don't see what benefit a UUID would provide in your lookup table.

The right way to make a short URL is to store the full one in the database and publish something that maps to the row index. A compact way is to use the Base64 of the row ID, for example. Or you could use a UID for the primary key and show that.

Do not use a checksum, because it's too small and very likely to conflict. A cryptographic hash is larger and less likely, but it's still not the right way to go.

CRC32 means cyclic redundancy check with 32 bits where any arbitrary amount of bits is summed up to a 32 bit check sum. And check sum functions are surjective, that means multiple input values have the same output value. So you cannot inverse the function.

No, even you use md5, or any other check sum, the URL CAN BE duplicate, it all depends on your luck.

So don't make an unique url base on those check sum

The quickest (and perhaps best!) way to solve things may be to simply use a hash of the local path and query of a given URI, as follows:

using System;

namespace HashSample
{
    class Program
    {
        static void Main(string[] args)
        {
            Uri uri = new Uri(
                "http://host.com/folder/file.jpg?code=ABC123");

            string hash = GetPathAndQueryHash(uri);

            Console.WriteLine(hash);
        }

        public static string GetPathAndQueryHash(Uri uri)
        {
            return uri.PathAndQuery.GetHashCode().ToString();
        }
    }
}

The above presumes that the URI scheme and host remain the same. If not GetHashCode will work with any string.

For a great discussion on CRC32 Hash Collision visit: http://episteme.arstechnica.com/eve/forums/a/tpc/f/6330927813/m/821008399831

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top