Is there a standard technique for packing binary data into a UTF-16 string?

https://stackoverflow.com/questions/646974

22-07-2019
|

Question

(In .NET) I have arbitrary binary data stored in in a byte[] (an image, for example). Now, I need to store that data in a string (a "Comment" field of a legacy API). Is there a standard technique for packing this binary data into a string? By "packing" I mean that for any reasonably large and random data set, bytes.Length/2 is about the same as packed.Length; because two bytes are more-or-less a single character.

The two "obvious" answers don't meet all the criteria:

string base64 = System.Convert.ToBase64String(bytes)

doesn't make very efficient use of the string since it only uses 64 characters out of roughly 60,000 available (my storage is a System.String). Going with

string utf16 = System.Text.Encoding.Unicode.GetString(bytes)

makes better use of the string, but it won't work for data that contains invalid Unicode characters (say mis-matched surrogate pairs). This MSDN article shows this exact (poor) technique.

Let's look at a simple example:

byte[] bytes = new byte[] { 0x41, 0x00, 0x31, 0x00};
string utf16 = System.Text.Encoding.Unicode.GetString(bytes);
byte[] utf16_bytes = System.Text.Encoding.Unicode.GetBytes(utf16);

In this case bytes and utf16_bytes are the same, because the orginal bytes were a UTF-16 string. Doing this same procedure with base64 encoding gives 16-member base64_bytes array.

Now, repeat the procedure with invalid UTF-16 data:

byte[] bytes = new byte[] { 0x41, 0x00, 0x00, 0xD8};

You'll find that utf16_bytes do not match the original data.

I've written code that uses U+FFFD as an escape before invalid Unicode characters; it works, but I'd like to know if there is a more standard technique than something I just cooked up on my own. Not to mention, I don't like catching the DecoderFallbackException as the way of detecting invalid characters.

I guess you could call this a "base BMP" or "base UTF-16" encoding (using all the characters in the Unicode Basic Multilingual Plane). Yes, ideally I'd follow Shawn Steele's advice and pass around byte[].

~~I'm going to go with Peter Housel's suggestion as the "right" answer because he's the only that came close to suggesting a "standard technique".~~

Edit base16k looks even better. Jim Beveridge has an implementation.

Solution

I stumbled onto Base16k after reading your question. Not strictly a standard but it seems to work well and was easy enough to implement in C#.

OTHER TIPS

May I suggest you do use base64? It may not be the most efficient way to do it storagewise, but it does have its benefits:

Your worries about the code are over.
You'll have the least compatibility problems with other players, if there are any.
Should the encoded string ever be considered as ASCII during conversion, export, import, backup, restore, whatever, you won't have any problems either.
Should you ever drop dead or end up under a bus or something, any programmer who ever gets their hands on the comment field will instantly know that it's base64 and not assume it's all encrypted or something.

Firstly, remember that Unicode doesn't mean 16 bits. The fact that System.String uses UTF-16 internally is neither here nor there. Unicode characters are abstract - they only gain bit representations through encodings.

You say "my storage is a System.String" - if that's the case, you cannot talk about bits and bytes, only Unicode characters. System.String certainly has it's own internal encoding, but (in theory) that could be different.

Incidentally, if you believe that the internal representation of System.String is too memory-inefficient for Base64-encoded data, why aren't you also worrying about Latin/Western strings?

If you want to store binary data in a System.String, you need a mapping between collections of bits and characters.

Option A: There's a pre-made one in the shape of Base64 encoding. As you've pointed out, this encodes six bits of data per character.

Option B: If you want to pack more bits per character, then you'll need to make an array (or encoding) of 128, 256, 512, etc Unicode characters, and pack 7, 8, 9, etc bits of data per character. Those characters need to be real Unicode characters.

To answer your question simply, yes there is a standard, it's Base64-encoding.

Is this a real problem? Do you have perf data to support your idea of not using Base64?

You could treat the binary data as UTF-8b. The UTF-8b encoding assumes that the bytes are UTF-8 multibyte sequences, but has a fallback encoding for things that are not.

Here's a C# version of Jim Beveridge's C++ implementation:

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.Linq;


//
// Base16k.cpp : Variant of base64 used to efficiently encode  binary into Unicode UTF16 strings. Based on work by
// Markus Scherer at https://sites.google.com/site/markusicu/unicode/base16k
//
// This code is hereby placed in the Public Domain.
// Jim Beveridge, November 29, 2011.
//
// C# port of http://qualapps.blogspot.com/2011/11/base64-for-unicode-utf16.html
// This code is hereby placed in the Public Domain.
// J. Daniel Smith, February 23, 2015
//

namespace JDanielSmith
{
    public static partial class Convert
    {
        /// <summary>
        /// Encode a binary array into a Base16k string for Unicode.
        /// </summary>
        public static string ToBase16kString(byte[] inArray)
        {
            int len = inArray.Length;

            var sb = new StringBuilder(len*6/5);
            sb.Append(len);

            int code = 0;

            for (int i=0; i<len; ++i)
            {
                byte byteValue = inArray[i];
                switch (i%7)
                {
                case 0:
                    code = byteValue<<6;
                    break;

                case 1:
                    code |= byteValue>>2;
                    code += 0x5000;
                    sb.Append(System.Convert.ToChar(code));
                    code = (byteValue&3)<<12;
                    break;

                case 2:
                    code |= byteValue<<4;
                    break;

                case 3:
                    code |= byteValue>>4;
                    code+=0x5000;
                    sb.Append(System.Convert.ToChar(code));
                    code = (byteValue&0xf)<<10;
                    break;

                case 4:
                    code |= byteValue<<2;
                    break;

                case 5:
                    code|=byteValue>>6;
                    code+=0x5000;
                    sb.Append(System.Convert.ToChar(code));
                    code=(byteValue&0x3f)<<8;
                    break;

                case 6:
                    code|=byteValue;
                    code+=0x5000;
                    sb.Append(System.Convert.ToChar(code));
                    code=0;
                    break;
                }
            }

            // emit a character for remaining bits
            if (len%7 != 0) {
                code += 0x5000;
                sb.Append(System.Convert.ToChar(code));
            }

            return sb.ToString();
        }

        /// <summary>
        ///  Decode a Base16k string for Unicode into a binary array.
        /// </summary>
        public static byte[] FromBase16kString(string s)
        {
            // read the length
            var r = new Regex(@"^\d+", RegexOptions.None, matchTimeout: TimeSpan.FromMilliseconds(100));
            Match m = r.Match(s);
            if (!m.Success)
                return null;

            int length;
            if (!Int32.TryParse(m.Value, out length))
                return null;

            var buf = new List<byte>(length);

            int pos=0;  // position in s
            while ((pos < s.Length) && (s[pos] >= '0' && s[pos] <= '9'))
                ++pos;

            // decode characters to bytes
            int i = 0;    // byte position modulo 7 (0..6 wrapping around)
            int code=0;
            byte byteValue=0;

            while (length-- > 0)
            {
                if (((1<<i)&0x2b)!=0)
                {
                    // fetch another Han character at i=0, 1, 3, 5
                    if(pos >= s.Length)
                    {
                        // Too few Han characters representing binary data.
                        System.Diagnostics.Debug.Assert(pos < s.Length);
                        return null;
                    }

                    code=s[pos++]-0x5000;
                }

                switch (i%7)
                {
                case 0:
                    byteValue = System.Convert.ToByte(code>>6);
                    buf.Add(byteValue);
                    byteValue = System.Convert.ToByte((code&0x3f)<<2);
                    break;

                case 1:
                    byteValue |= System.Convert.ToByte(code>>12);
                    buf.Add(byteValue);
                    break;

                case 2:
                    byteValue = System.Convert.ToByte((code>>4)&0xff);
                    buf.Add(byteValue);
                    byteValue = System.Convert.ToByte((code&0xf)<<4);
                    break;

                case 3:
                    byteValue |= System.Convert.ToByte(code>>10);
                    buf.Add(byteValue);
                    break;

                case 4:
                    byteValue = System.Convert.ToByte((code>>2)&0xff);
                    buf.Add(byteValue);
                    byteValue = System.Convert.ToByte((code&3)<<6);
                    break;

                case 5:
                    byteValue |= System.Convert.ToByte(code>>8);
                    buf.Add(byteValue);
                    break;

                case 6:
                    byteValue = System.Convert.ToByte(code&0xff);
                    buf.Add(byteValue);
                    break;
                }

                // advance to the next byte position
                if(++i==7)
                    i=0;
            }

            return buf.ToArray();
        }
    }
}

namespace Base16kCS
{
    class Program
    {
        static void Main(string[] args)
        {
            var drand = new Random();

            // Create 500 different binary objects, then encode and decode them.
            // The first 16 objects will have length 0,1,2 ... 16 to test boundary conditions.
            for (int loop = 0; loop < 500; ++loop)
            {
                Console.WriteLine("{0}", loop);

                int dw = drand.Next(128000);
                var org = new List<byte>(dw);
                for (int i = 0; i < dw; ++i)
                    org.Add(Convert.ToByte(drand.Next(256)));

                if (loop < 16)
                    org = org.Take(loop).ToList();

                string wstr = JDanielSmith.Convert.ToBase16kString(org.ToArray());

                byte[] bin = JDanielSmith.Convert.FromBase16kString(wstr);

                System.Diagnostics.Debug.Assert(org.SequenceEqual(bin));
            }
        }
    }
}

I fooled around with direct char arrays, and your one failing case works with my implementation. The code has been tested well: so do your tests first.

You could speed this up by using unsafe code. But I am sure UnicodeEncoding is just as slow (if not slower).

/// <summary>
/// Represents an encoding that packs bytes tightly into a string.
/// </summary>
public class ByteEncoding : Encoding
{
    /// <summary>
    /// Gets the Byte Encoding instance.
    /// </summary>
    public static readonly Encoding Encoding = new ByteEncoding();

    private ByteEncoding()
    {
    }

    public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
    {
        for (int i = 0; i < chars.Length; i++)
        {
            // Work out some indicies.
            int j = i * 2;
            int k = byteIndex + j;

            // Get the bytes.
            byte[] packedBytes = BitConverter.GetBytes((short) chars[charIndex + i]);

            // Unpack them.
            bytes[k] = packedBytes[0];
            bytes[k + 1] = packedBytes[1];
        }

        return chars.Length * 2;
    }

    public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex)
    {
        for (int i = 0; i < byteCount; i += 2)
        {
            // Work out some indicies.
            int j = i / 2;
            int k = byteIndex + i;

            // Make sure we don't read too many bytes.
            byte byteB = 0;
            if (i + 1 < byteCount)
            {
                byteB = bytes[k + 1];
            }

            // Add it to the array.
            chars[charIndex + j] = (char) BitConverter.ToInt16(new byte[] { bytes[k], byteB }, 0);
        }

        return (byteCount / 2) + (byteCount % 2); // Round up.
    }

    public override int GetByteCount(char[] chars, int index, int count)
    {
        return count * 2;
    }

    public override int GetCharCount(byte[] bytes, int index, int count)
    {
        return (count / 2) + (count % 2);
    }

    public override int GetMaxByteCount(int charCount)
    {
        return charCount * 2;
    }

    public override int GetMaxCharCount(int byteCount)
    {
        return (byteCount / 2) + (byteCount % 2);
    }
}

Here is some test code:

    static void Main(string[] args)
    {
        byte[] original = new byte[256];

        // Note that we can't tell on the decode side how
        // long the array was if the original length is
        // an odd number. This will result in an
        // inconclusive result.
        for (int i = 0; i < original.Length; i++)
            original[i] = (byte) Math.Abs(i - 1);

        string packed = ByteEncoding.Encoding.GetString(original);
        byte[] unpacked = ByteEncoding.Encoding.GetBytes(packed);

        bool pass = true;

        if (original.Length != unpacked.Length)
        {
            Console.WriteLine("Inconclusive: Lengths differ.");
            pass = false;
        }

        int min = Math.Min(original.Length, unpacked.Length);
        for (int i = 0; i < min; i++)
        {
            if (original[i] != unpacked[i])
            {
                Console.WriteLine("Fail: Invalid at a position {0}.", i);
                pass = false;
            }
        }

        Console.WriteLine(pass ? "All Passed" : "Failure Present");

        Console.ReadLine();
    }

The test works, but you are going to have to test it with your API function.

There is another way to work around this limitation: although I am not sure how well it would work.

Firstly, you will need to figure out what type of string the API call is expecting - and what the structure of this string is. If I take a simple example, lets consider the .Net string:

Int32 _length;
byte[] _data;
byte _terminator = 0;

Add an overload to your API call, thus:

[DllImport("legacy.dll")]
private static extern void MyLegacyFunction(byte[] data);

[DllImport("legacy.dll")]
private static extern void MyLegacyFunction(string comment);

Then when you need to call the byte version you can do the following:

    public static void TheLegacyWisperer(byte[] data)
    {
        byte[] realData = new byte[data.Length + 4 /* _length */ + 1 /* _terminator */ ];
        byte[] lengthBytes = BitConverter.GetBytes(data.Length);
        Array.Copy(lengthBytes, realData, 4);
        Array.Copy(data, 0, realData, 4, data.Length);
        // realData[end] is equal to 0 in any case.
        MyLegacyFunction(realData);
    }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow