Question

I would like to put a string into a byte array, but the string may be too big to fit. In the case where it's too large, I would like to put as much of the string as possible into the array. Is there an efficient way to find out how many characters will fit?

Was it helpful?

Solution

In order to truncate a string to a UTF8 byte array without splitting in the middle of a character I use this:

static string Truncate(string s, int maxLength) {
    if (Encoding.UTF8.GetByteCount(s) <= maxLength)
        return s;
    var cs = s.ToCharArray();
    int length = 0;
    int i = 0;
    while (i < cs.Length){
        int charSize = 1;
        if (i < (cs.Length - 1) && char.IsSurrogate(cs[i]))
            charSize = 2;
        int byteSize = Encoding.UTF8.GetByteCount(cs, i, charSize);
        if ((byteSize + length) <= maxLength){
            i = i + charSize;
            length += byteSize;
        }
        else
            break;
    }
    return s.Substring(0, i);
}

The returned string can then be safely transferred to a byte array of length maxLength.

OTHER TIPS

You should be using the Encoding class to do your conversion to byte array correct? All Encoding objects have an overridden method GetMaxCharCount, which will give you "The maximum number of characters produced by decoding the specified number of bytes." You should be able to use this value to trim your string and properly encode it.

Efficient way would be finding how much (pessimistically) bytes you will need per character with

Encoding.GetMaxByteCount(1);

then dividing your string size by the result, then converting that much characters with

public virtual int Encoding.GetBytes (
 string s,
 int charIndex,
 int charCount,
 byte[] bytes,
 int byteIndex
)

If you want to use less memory use

Encoding.GetByteCount(string);

but that is a much slower method.

The Encoding class in .NET has a method called GetByteCount which can take in a string or char[]. If you pass in 1 character, it will tell you how many bytes are needed for that 1 character in whichever encoding you are using.

The method GetMaxByteCount is faster, but it does a worst case calculation which could return a higher number than is actually needed.

Cookey, your code doesn't do what you apparent think it does. Pre-allocating the byte buffer in your case is pure waste because it will not be used. Rather, your assignment drops the allocated memory and resets the arr reference to point to another buffer because Encoding.GetBytes returns a new array.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top