Fast and simple hex compression

https://softwareengineering.stackexchange.com/questions/333065

30-12-2020
|

Question

I'm working on a project that requires a TCP connection between a client and server. The current protocol encodes the data into hex and then sends it. However, hex increases the length of the payload which isn't really optimal from a networking point of view.

Hex follows a pretty predictable pattern with repeating characters etc. I was looking for a fast and preferably simple compression algorithm that works very well with Hex encoded strings. I've looked around for a while but I haven't been able to find a decent solution. Any thoughts?

Solution

The current protocol encodes the data into hex

What protocol? TCP certainly doesn't do that.

I believe you're actually referring to binary to text encoding. This is a method of converting binary data to textual data so that it can be sent over systems that only allow text.

Hex itself is just a way to present a number for viewing, not really meant for storing or transmitting. Everything stored in your computer or transmitted to your computer can be presented as HEX (regardless of encoding).

That's what a HEX editor does. It looks at the number and shows it in HEX. It doesn't decode. It doesn't even understand how the file is meant to be seen. It see's the file as a number and shows you the number in a way that makes it easy to see the byte boundaries. It could just as easily show it as 1's and 0's. It would just take up more screen space.

The two ideas collide when you take the HEX presentation AS an encoding. This can loop forever. For example, a binary 1 is 1 in hex. But to show that 1 in ascii you use the ascii code for 1 which is a completely different number (49 or 31 in hex). So now you're storing 31 when you mean one. But if you want to show 31 then you're storing 33 and 31. And so on and so on.

Some means of transmitting data only allow text. TCP allows full binary so whatever protocol you're talking about, it isn't TCP. This could mean, for example that the extended ascii characters beyond 127 can't be used.

A way to work around that limitation is to encode the binary data in a way that avoids the extended characters. Encoding in HEX means you only use 16 symbols. When you have 127 symbols available to you, using only 16 isn't very efficient.

Please understand that text IS binary. Text is simply a certain set of encodings (ascii, ebcdic, UTF-8 from unicode and many others). Binary is all that and anything else. The only reason a text editor knows to display a text file as a text file is because it assumes it's a text file and trys to decode it. Open an executable file in notepad or vi sometime. You'll see some interesting nonsense on the screen.

Hex follows a pretty predictable pattern with repeating characters etc. I was looking for a fast and preferably simple compression algorithm that works very well with Hex encoded strings. I've looked around for a while but I haven't been able to find a decent solution. Any thoughts?

The ideal solution for this would be to stop encoding in hex and transmit in binary. TCP can do that just fine. If you're stuck going over some text only protocol there are certainly better encodings than HEX, provided the protocol allows more than hex's 16 symbols.

A typical example of a text only transport protocol is email (certainly not TCP). I can't type a binary file in the body of an email. I can encode one in base64 and paste that into the body of an email. The only reason that's better than hex encoding is because it uses more of the symbols available to me. Heck I could type nothing but 1's and 0's but that would be even less efficient. Ideally you want to be able to use as many symbols as you have available to you.

From your comment:

By protocol I mean the high level communication protocol that was implemented, not TCP. The reason for encoding is the message based system that was implemented basically just uses a terminating character. I could change that and then just send it as raw bytes maybe. – Eclipse

In that case use an escape sequence to ensure you remove the terminating character. Say your terminating character was !. Every time there is an actual ! in your data replace it with a ~@. An actual ~ replace with ~~. An actual @ will be expressed as @.

    String rawData = "abc~@!xyz";
    String encodedData = rawData
            .replaceAll("~", "~~")
            .replaceAll("!", "~@")
            ;
    String decodedData = encodedData
            // (?<!~) is a negitive look behind that ensures ~@! won't be taken as !!
            // see http://stackoverflow.com/a/7594029/1493294
            .replaceAll("(?<!~)~@", "!")
            .replaceAll("~~", "~")
            ;
    System.out.println(encodedData);
    assertEquals(rawData, decodedData);

Displays

abc~~@~@xyz

OTHER TIPS

Hex follows a pretty predictable pattern with repeating characters etc.

That's not true, unless the underlying data (represented in hex) has a predictable pattern.

If the data has a predictable pattern then it's compressible. You could (should) compress the data first (using any suitable compression algorithm, not necessarily an algorithm that's hex-specific), and then (if you need hex-encoding) hex-encode the compressed data.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange