Is there a way to encode binary into natural language?

https://softwareengineering.stackexchange.com/questions/419145

18-03-2021
|

Pergunta

I was wondering if there was a way to encode binary data, such as Crypto addresses into a natural language format, in a similar fashion to how what3words.com can encode locations into three easy-to-remember words.

Would it be possible to use something like a recurrent neural network with a deterministic word sampling function to do this? Is there an easier way to go about accomplishing this?

I did some basic calculations and with 171,146 words, it would take a minimum of 15 words to cover a 256-bit-address. Is there a fast way to do this?

Solução

So Meriam Webster has some 470,000 English words. More than enough for this task.

Take some subset of those words arrange them in a list from 0 to N, try and make N = 2^K -1 to make this easy.

For each K bits pick the associated word. Its not necessarily going to be easy to remember.

eg:

Cat
Fire
Tree
House

10 = Tree. 1001 = Tree Fire.

Okay lets try and leverage English phraseology a little.

Simple sentence: Noun Verb Noun.
More complex sentence: Adjective Noun Adverb Verb Adjective Noun.
And More so: Preposition Article Noun, Adjective Noun Adverb Verb Article Adjective Noun.

You will obviously loose total number of bits encoded per word, but this has the obvious benefit of providing some self-checking. If its not an adjective they missed a word, etc...

But a similar scheme can be devised, for each kind of word create a list from 0 to 2^K-1 for some K on each word. Simply pick the word corresponding to the bits at each location.

You may want to consider adding checksums and self-healing to the scheme.

Checksums are simple enough. Calculate some set of parity bits, or a crc (Cylcic Redundancy Check) code and as with the other data select the word representing those bits.

Self-healing has two forms the first form is along the lines of the checksum, calculate a self-healing code as per a well established self-healing check code scheme. Then encode these extra bits as another word.

The alternate form of self-healing comes from the english language itself. Doubtlessly the human is going to make mistakes. But if they truly tried to commit the phrase to memory these mistakes will come in the form of substituted words following similar semantic, local spelling variance, or whom sound similar.

eg:

substituting tragedy for disaster (following a semantic meaning).
neighbour and neighbor (depending upon if you speak one english dialect or another)
whether the weather or the wether? (following the phonetics).
is, was (verb tense)
lamb, lambs (pluralisation)

Just to point out a few, It would pay to consider these words and even phrases. One solution is to simply make them substitutes for each other.

Similarly it might pay to ensure that words are only used once. To do this though you would need an adaptive scheme.

A simple adaptive scheme would be to have a list of substitute words, and as a word is used replace it with the next from the list.

On adaptive schemes - compression. Most bit sequences are not at maximal information density. A suitable compression algorithm could shorten the overall sentence length required/number of sentences.

Run length compression
Huffman encoding
Lempel-Ziv encoding

are a few that might be applicable.

But there is one more encoding scheme, Arithmetic encoding.

Instead of chopping the message up into fields, consider the message in its entirety. It is a fraction real bits in message / 2^(total bits in message).

Like above assemble a list of words but this time don't align ourselves to a power of 2. 171,324 words is fine.

Imagine that there is a number line from 0 to 1 and each word is spread out along that line an equal distance apart (but with no word on 1). Find the interval to which our message fraction belongs, select the word representing that interval.

eg:

Cat (0 - 1/5)
Fire (1/5 - 2/5)
Tree (2/5 - 3/5)
House (3/5 - 4/5)
Plural (4/5 - 1)

1001: is 9/16 which fits in the 'Tree' range. 1010: is 10/16 which fits in the 'House' range.

Now we zoom in on that interval and again pick our list. But first you may want to substitute for another list (changing nouns for verbs), or removing some words from the list to avoid repetition. Again arrange them along this interval and select the next word.

Cat (2/5 - 11/25)
Fire (11/25 - 12/25)
Tree (12/25 - 13/25)
House (13/25 - 14/25)
Plural (14/25 - 3/5)

1001: is 9/16 which fits in the 'Tree Plural' range.

Cat (3/5 - 16/25)
Fire (16/25 - 17/25)
Tree (17/25 - 18/25)
House (18/25 - 19/25)
Plural (19/25 - 4/5)

1010: is 10/16 which fits in the 'House Cat' range.

Keep repeating until the fraction at the start of the range describes the required bits. You may need to encode the message length as part of the message so that you return the exact number of bits feed in. But if you know or can otherwise determine the number of bits needed thats not necessary.

'Tree Plural' is 14/25. The bit sequence this represents is found by reversing the process. it is either in the 0-1/2 range or in the 1/2 to 1 range. The second range is what its in so the first bit is 1. Is it in the 1/2-3/4 range or the 3/4-1 range? its in the first so the bit sequence is 10. Stop generating bits when you have generated enough. We need 4 so: that bit sequence is 1001 which is exactly what we wanted to encode.

Obviously the fixed width encoding scheme gave equal results for this toy example, but mileage should improve with more words.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange