Combine sequences of numbers with “variable bitlengths” into short unique strings

https://softwareengineering.stackexchange.com/questions/321461

17-12-2020
|

Question

It is not unlikely that what I want to do is not possible, but it doesn't hurt to ask.

Imagine a set of lists, each containing positive integers (in my case, a list always consists of four integers, but that shouldn't make any difference).

a = [35, 2, 123684, 647]
b = [453, 346457546457, 6, 0]
c = ...

Then there is an alphabet, e.g.

alphabet = [A, .., Z, a, .., z, 0, .., 9]

What I want to do is to create a function that transforms the lists into strings using the alphabet.

f: (list[Int], alphabet) -> string

The requirements for f are as follows:

f should be injective in the sense that two different lists always result in two different strings and one specific list always results in the same string each time f is called.

Notes:
- Two lists are equal if both contain the same elements in the same order.
- It is ok if transformations of different lists using different alphabets result in the same string. The unique-requirement only applies to the transformation of different lists using the same alphabet.
- An inverse function is not required.
Now the hard part: the resulting strings must be as short as possible.

All of the numbers are 32 bit integers. But the fact that they vary greatly in size (the possible range is from 0 to Int.max) should be taken into account. Just chaining the 32 bit representations together (or doing something else that uses chunks of a fixed size) is not a viable solution.

One approach could be to choose one character of the alphabet and use it as a separator. This is basically what hashids does. E.g. if 'A' is the separator, all resulting strings would look like this: "...A...A...A...".

What I don't like about this solution:

The effective size of the alphabet is reduced by one, since one of the characters can only be used as a separator, not for encoding numbers. This results in longer strings, especially when using small alphabets
The separatoritself also extends the string. Encoding a list of four integers means three additional characters in the result.

I'm wondering if there is a less obvious solution to the problem, maybe a more mathematical approach? Essentially, the problem is to "merge" multiple numbers into a single (unique) number.

La solution

I think Doc Brown hit upon the real problem: it is a compression problem. As such, first streaming the numbers to a binary representation (compressed bits, a byte array) and then encoding using the letters is probably the best solution. When you stream to a binary representation, include the length of any variable sized collection as the header before the body. The length can be 7bit encoded, or use a a fixed sized integer, as appropriate. If all of your lists are the same size, you don't need a length for them.

How I would stream this myself:

the number of lists (length as described above)
each list (fixed size of 4 integers)
each item in the list 7 bit encoded -- this is a variable sized encoding
no separator is needed

After streaming to a memory stream, I would then take the resulting bytes, and use LZ4 or your favorite compression algorithm. Then encode that using your character set.

Note: at the time I wrote the answer above, I thought the arrays were going to be combined before compression. It turns out each must be compressed separately. This is a more difficult topic and requires more information about the source and usage of the arrays. You can find some information here: http://encode.ru/threads/343-Compressing-small-bits-of-data

Licencié sous: CC-BY-SA avec attribution

Non affilié à softwareengineering.stackexchange