Question

I would like to generate a short, unique ID without having to check for collisions.

I currently do something like this, but the ID I currently generate is random and checking for collisions in a loop is annoying and will get expensive if the number of records grows significantly.

Normally worrying about collisions isn't an issue, but the unique ID I want to generate is a short unique string 5-8 characters, alpha-numeric, like tinyurl does.

EDIT: I would like to start out with 5 characters and if I hit 60 million entries, then go to 6.. so on and so forth.

To this end, I was thinking I could use an auto_increment value that is hidden from the users, and present them instead with an MD5 or some other method to generate a unique string from that.

Generated strings should not appear to be linear, so simply converting the auto_incremented ID into base 36 [0-9A-Z] is a bit too simplistic, but a function something like that is where I'm going with this.

EDIT: Security is not an issue as this will not be used to secure information. It is simply a shortcut to a longer string. Thank you.

Thank you for your suggestions and sorry for the delay. Dentist..

Was it helpful?

Solution

You'll need something that's correct by construction, i.e. a permutation function: this is a function that does a one-to-one, reversible mapping of one integer (your sequential counter) to another. Some examples (any combination of these should also work):

  • inverting some of the bits (f.i. using an XOR, ^ in PHP)
  • swapping the places of bits (($i & 0xc) >> 2 | ($i & 0x3) << 2), or just reversing the order of all bits
  • adding a constant value modulo your maximum range (must be a factor of two, if you're combining this with the ones above)

Example: this function will convert 0, 1, 2, 3, 5, .. into 13, 4, 12, 7, 15, .. for numbers up to 15:

$i=($input+97) & 0xf;
$result=((($i&0x1) << 3) + (($i&0xe) >> 1)) ^ 0x5;

EDIT

An easier way would to use a linear congruential generator (LCG, which is usually used for generating random numbers), which is defined by a formula of the form:

X_n+1 = (a * X_n + c) mod m

For good values of a, c and m, the sequence of X_0, X_1 .. X_m-1 will contain all numbers between 0 and m-1 exactly once. Now you can start from a linearly increasing index, and use the next value in the LCG sequence as your "secret" key.

EDIT2

Implementation: You can design your own LCG parameters, but if you get it wrong it won't cover the full range (and thus have duplicates) so I'll use a published and tried set of parameters here from this paper:

a = 16807, c = 0, m = 2147483647

This gives you a range of 2**31. With pack() you can get the resulting integer as a string, base64_encode() makes it a readable string (of up to 6 significant characters, 6 bits per byte) so this could be your function:

substr(base64_encode(pack("l", (16807 * $index) % 2147483647)), 0, 6)

OTHER TIPS

You could probably generate a MD5 hash of the current datetime/random number and truncate it to the length you need (5-8 characters) and store it as the id field.

If you are using storing this information in a database, you don't need to use a for loop to do the collision check, but you could just do a select statement - something like

SELECT count(1) c FROM Table WHERE id = :id

where :id would be the newly generated id. If c is greater than 0 then you know it already exists.

EDIT

This may may not be the best way to go about it. But I'll give it a shot, so I guess what you need is someway of converting a numbers into a unique short string and that is not in sequence.

I guess as you said, base64 encoding already does the number to short string conversion. To avoid the sequence problem you could have some mapping between your auto-generated id's to some "random" value (unique mapping). Then you can base64 encode this unique value.

You could generate this mapping as follows. Have a temporary table store values from 1 - 10,000,000. Sort it in random order and store it into you Map table.

INSERT INTO MappingTable (mappedId) SELECT values FROM TemporaryTable ORDER BY RAND()

Where MappingTable would have the 2 fields id (your auto-generated id would look up against this) and mappedId (which is what you would generate the base64 encoding for).

As you get closer to 10,000,000 you could rerun the above code again and change the values in the temporary table with 10,000,001-20,000,000 or something like that.

you can use a bitwise XOR to scramble some of the bits:

select thefield ^ 377 from thetable;

+-----+---------+
| a   | a ^ 377 |
+-----+---------+
| 154 |     483 |
| 152 |     481 |
|  69 |     316 |
|  35 |     346 |
|  72 |     305 |
| 139 |     498 |
|  96 |     281 |
|  31 |     358 |
|  11 |     370 |
| 127 |     262 |
+-----+---------+

I think this will never be really secure, as you only need to find the encryption method behind the short unique string to hijack an ID. Is checking for collisions in a loop really that problematic in your setting?

An MD5 of an incrementing number should be fine, but I worry that if you're truncating your MD5 (which is normally 128 bits) down to 5-8 characters, you will almost certainly be damaging it's capability to act as a unique signature...

Completely true. Especially if you reach your 80% collision chance a truncated MD5 will be as good as any random number to guarantee uniqueness by itself, i.e. worthless.

But since you're using a database anyway, why not just use a UNIQUE INDEX ? This way the uniquness check is done (in a much more efficient way than using a loop) by MySQL itself. Just try to do the INSERT with your MD5-generated key, and if it fails, try again...

If you cannot use an auto increment field, and want an absolutely unique value, use UUID. If you decide to use anything else (besides auto increment), you would be silly to NOT check for collisions.

An MD5 of an incrementing number should be fine, but I worry that if you're truncating your MD5 (which is normally 128 bits) down to 5-8 characters, you will almost certainly be damaging it's capability to act as a unique signature...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top