i'm looking for optimal string sequence generator that could be used as user id generator. As far as i get it should implement following features:

  • Length is restricted to 8 characters.
  • Consists from latin symbols and digits
  • Public and thus should be obfuscated

For now i came up with following algorithm:

def idgen(begin, end):
    assert type(begin) is int
    assert type(end) is int
    allowed = reduce(
        lambda L, ri: L + map(chr, range(ord(ri[0]), ord(ri[1]) + 1)),
        (('a', 'z'), ('0', '9')), list()
    )
    shift = lambda c, i: allowed[(allowed.index(c) + i) % len(allowed)]
    for cur in xrange(begin, end):
        cur = str(cur).zfill(8)
        cur = izip(xrange(0, len(cur)), iter(cur))
        cur = ''.join([shift(c, i) for i, c in cur])
        yield cur

But it would give pretty similar ids, 0-100 example: '01234567', '01234568', '01234569', '0123456a' etc.

So what are best practices? How, i suppose url shorteners should use some kind of similar algorithm?

有帮助吗?

解决方案

Since you need "obfuscated" ids, you want something that looks random, but isn't. Fortunately, almost all computer generated randomness fulfills this criteria, including the one generated by the random python module.

You can generate a fixed sequence of numbers by settings the PRNG seed, like so:

import random
import string

valid_characters= string.lowercase+string.digits

def get_random_id():
    return ''.join(random.choice(valid_characters) for x in range(8))

random.seed(0)
for x in range(10):
    print get_random_id()

This will always print the same sequence of 10 "random" ids.

You probably want to generate ids on demand, instead of all at once. To do so, you need persist the PRNG state:

random.setstate( get_persisted() )
get_random_id()
persist( random.getstate )
#repeat ad infinitum

This is somewhat expensive, so you'll want to generate a couple of random ids at once, and keep them on a queue. You could also use random.jumpahead, but I suggest you try both techniques and see which one is faster.

Avoiding collisions for this task is not trivial. Even using hash functions will not save you from collisions.

The random module's PRNG has a period of 2**19937-1. You need a period of at least (26+10)**8, which is about 2**42. However, the fact that the period you need is lower than the period of the PRNG does not guarantee that there will be no collisions, because the size of your output (8 bytes, or 64 bits) may not match the output size of the PRNG. Also, depending on the particular implementation of choice, it's possible that, even if the sizes matched, the PRNG simply advances the PRNG on each call.

You could try to iterate over the keyspace, checking if something repeats:

random.seed(0)
space=len(valid_characters)**8
found=set()
x=0
while x<space:
    id= get_random_id()
    if id in found:
        print "cycle found after",x,"iterations"
    found.update(id)
    if not x% 1000000:
        print "progress:",(float(x)/space)*100,"%"
    x+=1

But this will take a long, long while.

You could also handcraft a PRNG with the desired period, but it's a bit out of scope on stackoverflow. You could try the math stackexchange.

Ultimately, and depending on your purpose, I think your best bet may be simply keeping track of already generated ids, and skipping repeated ones.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top