Is a phrase a good code?

https://stackoverflow.com/questions/12308747

30-06-2021
|

Question

Introduction

I am working at a "prize codes" system for my website: the administrators can assign credits to an user, giving them a code that they can redeem.
The idea itself is quite easy to implement: generate (I'll assume the algorithm to be hardly-predictable) the codes, store them in a safe database and give administrators the power to read one random code. But I found that some features I want, do collide:

The code must reasonably resist to attack with a reasonable computation power;
There must be a large number of codes, 1 million;
The code must be easy to type.

I thought of these solutions, until now:

After some thinking, I came up with this solution: using phrases of N words [I'm not talking about THE "n" word. I mean, "N times a random word"]. This seems to be a good idea, but I'm not experienced in this field.

Question 1

Is it safe to use phrases for codes? If yes, how many words should I use in order to be safe enough? [Assume an average length and the words an 8-years-old children knows.]

Question 2

Given a list of words, what is the best way to store them into the database?

Desktop-side, generate a file with the desired phrases, then read and put them into the database using PHP?
Server-side, reading a list of words and writing the phrases directly into the database?
Other ways?

Please note that I can't just let the administrators read the phrases files, as it is harder to code than using a database. Also notice that the words must be in common Italian: I estimate from 4.000 to 6.000 words to be used.

Solution

I'm going to assume throughout that your codes will be generated from proper (pseudo)random numbers. If you generate lots of them, make sure to provide enough entropy to the random number generator of the computer that generates them.

Codes like you are talking about are often created as strings of random printable characters. One way to do it, for example, would be to generate a 144-bit random number and base64-encode it. This will give you 144 bits of information with strings of 24 characters. Usually people don't mind that the strings read as garbage because they're either copy&pasted or embedded in URLs. You can think about this string as a sequence of 24 symbols each generated from an alphabet of 64 characters, or as sequences of 144 symbols each generated from an alphabet of 2 bits. It comes out to the same.

Generating a list of WORDS instead of a list of CHARACTERS (or BITs) is similar. Instead of an alphabet of 2 bits or 64 characters, you're using an alphabet of 4000 to 6000 words. That's a much bigger alphabet (more information), but you expect to use fewer or them in your phrase (less information).

Using n words, you will get log₂((4000 to 6000)ⁿ) bits. For simplicity, let's say you choose an alphabet of 4096 words. To contain the same amount of information as 144-bit tokens, your phrases will need to contain 12 words each.

Your requirement "There must be a large number of codes, 1 million" is unclear. Do you mean that the codespace must be 1 million phrases, or that you intend to generate 1 million different phrases that are each part of a codespace that is much larger.

If your codespace needs to be 1 million, that's only log₂(1000000) = 20 bits. Phrases of 2 words will do. I guess that's probably a bit of a small codespace though... but it depends how many chances an adversary is likely to be able to get to guess correct phrases and how quickly they are going to be able to make guesses. And it also depends how damaging it would be for an adversary to guess a correct code. Without knowing your requirements I can only guess how big of a codespace you want to have. Perhaps twice as many bits (1 trillion codes)?

Question 2: It doesn't really make any difference where you generate the phrases so long as the computer that does it has a good source of randomness. If in doubt, use your desktop. Then it doesn't really matter how you load them into the database, be it RPCs, remote database access, or copying the file to the server.

If you generate a lot of phrases, be aware that storing them as plain strings in the database is going to be costly. You can save a lot of space by encoding them as a sequence of integers that are indices into your reference wordlist.

EXTRA: Note that if you use more than 2 or 3 words in each phrase, you may find that the phrases are just as hard to remember and type as random sequences of characters. This is because even though they will be made up of real words, the phrases will mostly be nonsense, including words that grammatically cannot fit next to one another.

If you want to mitigate this, you can use a statistical model of how often certain words appear next to each other in a corpus of text (say, a collection of literature) in order to make sequences of words that are likely to seem natural to a human. Modelling adjacent parts, triples, or 4-tuples of adjacent words is called second, third, or n-level Markov chains, respectively.

Obviously, generating your phrases like this is going to REDUCE the amount of information (in the information-theoretic sense) in your phrases so you will have to compensate by making them longer. The exact amount of information reduction that results from using a particular statistical model is left as an exercise for you to calculate :-)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow