Make custom string encoder .net

https://stackoverflow.com/questions/2303830

21-09-2019
|

Question

I know .net supports base64 encoding of byte arrays. But i thought that i could save even more space if use a higher number of characters. I read somewhere that Unicode supports thousands of different characters so why not use base1024 encoding for example? And if this is possible can you give some guidelines on how to implement it. Thanks

Solution

Depending on whether you use 2 byte Unicode encoding (UCS2) or multi byte (UTF-8). Base 1024 would be only slightly better or even more wasteful of space than base64, since base 64 uses 6 bits out of an 8 bit byte. Raw binary data converted to base64 becomes 4/3 larger. (about 1.333x growth)

But base1024 using UCS-2 (16 bit) Unicode characters would use only 10 of 16 bits, so it would take 8/5 the space. raw binary data converted to base1024 using UCS-2 would grow to 1.6 times its original size. This is worse than base64.

If you used UTF-8 Unicode instead, and were careful to use only unicode characters that had 1 or 2 byte encodings, you could get at most 1920 more unique code points out of 2 characters, which works out to a slight improvement in data density. (UTF-8 encoding only uses 6 bits of each additional * bit byte to indicate code points, the other 2 bits are used to indicate that there are more bytes to follow)

So this is not going to help, You should look into the possibility of compressing on your data before converting it to base64.

OTHER TIPS

Base64 is there for a purpose: to store/transfer binary data in a format that fits in 6 bits/character to circumvent restrictions imposed by some protocols. If you don't have such a restriction, base64 is not for you. It's never designed for saving space. If you need to save space and you are free to use anything, then simply store the array as binary data.

The point of base64 is to avoid encoding issues. Practically all machines still running agree on the ASCII character set. Although there's probably still a few EBCDIC machines out there consuming kilowatts. ASCII only encodes 96 unambiguous characters. Base64 uses 64 of those, plus a padding character. Base128 is already too much.

There's nothing unambiguous about Unicode, common encodings in use are UTF7, UTF8, UTF16, UTF32, UCS-2 and their least-endian and big-endian varieties. Base1024 would require 1024 unambiguous characters, way too much for anybody to ever agree on. Note that it can't just be an encoded range, the Unicode charts have lots of holes in them and they are randomly distributed.

As the others already mentioned, base64 doesn't save any space. It even blows up the number of characters needed to contain the same informations (take a look at wikipedia to see that three bytes needs four characters for representation).

If you really need to save some space and want to compress a byte array you should take a look into the LZMA algorithm. And if you need an implementation of this algorithm in C, C++, C# or Java take a look at the 7zip page.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow