Ensure unicity of a 10 characters alphanumeric String

https://softwareengineering.stackexchange.com/questions/393940

27-02-2021
|

Pergunta

I have a requirement of developing a service that must generate each day few millions of random and unique alphanumeric String of length 10 (I can't increase this length, this is a customer prerequisite). Having a ID, the next ID must not be guessable.

I want to be sure that all IDs will be unique because thoses IDs will be later used in a database (not of my own) and will represent the ID of a product, so it MUST be unique. I know that the probability of a collision is low but I want to prevent this.

I have no problem to generate a random String, but I want to ensure that all generated IDs are really unique. I was thinking about using a local SQL database and inserting few millions per day in this database (which will ensure unicity as this ID would be a primary key). I will then pick up those IDs, mark them as "processed" and send them.

To improve insertion performance, I was thinking about having a table for each year (a partition?).

Do you think this is a good solution to ensure unicity? Would you have any other solution, may be better than a SQL database?

Thanks.

Solução

Re-Think

You did not specify how hard the "not guessable" requirement is. Of course you would not use simple serial numbers, but I would not rule out a mechanism that can create unique strings using simple cryptographic techniques and does not need a database at all.

Risk evaluation

What are the IDs used for? Just identification of an item? Or are they used as some kind of license key or access token?
What is at stake? What damage happens when an ID is guessed? With millions of products per day, I don't assume that the value of each is in the hundreds of dollars (if it were, your customer would pay an expert who wouldn't ask on stack exchange :-) )
What level of adversaries do you expect? Ordinary users, script kiddies, determined hackers, state-level actors?
What profit can they gain, i.e. what amount of effort would they invest to hack your scheme?

If there is a real risk of being hacked and the possible damage is high, this scheme is probably too simple, but from the limited information you provide, I don't think so.

The mechanism

10 alphanumeric characters give roughly 60 bits (if you use uppercase, lowercase and digits). That's close enough to 64 that a 64-bit block cipher could be used here. So you start with just sequential numbers and encrypt them using a 64-bit block cipher (and a key that needs to be kept secret, of course), skipping those numbers which result in out-of-bounds encrypted values - on average, every 24th number is good.

Since the block encryption is reversible, the resulting bit patterns are unique, and you can safely use them without storing them anywhere. By decrypting a given ID, you can verify that it was generated using your algorithm and key (if the decrypted value is less than the current value of the generator, it was created by this algorithm.) If you store just the last value of the sequence number generator for each day, you can even determine on which day a key was generated.

Of course, once your algorithm and the block cipher being used is known, breaking the key requires "just" a number of IDs known to be sequential, and enough time and computation power. State level actors could probably do this, I'm not sure about determined hackers.

Outras dicas

Your approach is fine. Only using a new table for each year is debatable, this will trade insertion speed for query speed (you will have to query then multiple tables to make sure your uniqueness requirement is fulfilled), so it is not inherently clear if this will really bring the benefit you expect from it.

I would actually keep away from this premature optimization and start thinking about improvements if performance degradation will become a real issue. Then it is probably soon enough to reorganize the db, and you have numbers at hand to make a better decision on how to partition the db. Maybe it is faster to have a table per month. Maybe it is better to use a DB which can do the partitioning for you "under the hood". Maybe it is better to do the partitioning by splitting tables according the first letter(s) of your randomly generated string, who knows. But I am pretty sure currently it is too early to make the decision for "partitions per year".

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange