Impact of UTF-16 on debugging .NET in WinDbg?

https://stackoverflow.com/questions/20893953

23-09-2022
|

Question

.NET uses UTF-16 to represent strings, which is usually 2 bytes per character.

Many debugging tricks (including my own answers) will use the output of !do to get the address of the first character and then use the string length*2 in order to get the end address of the string.

Some examples where this can be useful:

du to dump the string, because !do will not dump the complete string
.writemem to write strings to a file so that it can be processed by other tools
s to search for strings containing specific substrings

However, UTF-16 also has 4 bytes characters (U+10000 to U+10FFFF), which might screw up everything.

string length is counted in characters and a 4 byte character is probably only counted as 1 character, so any length*2 calculations are incorrect
du might stop at characters which end on 00 00

So, how safe is it to use such scripts debugging .NET applications in WinDbg?

Solution

Short version: yes, it is safe to do string calculations in WinDbg using String.Length and it is safe to use du to dump them.

UTF-16 4 byte characters ending on 00 00

The unicode specification defines that the first 6 bits of byte 1 are 110110 and the first 6 bits of byte 3 are 110111. This means that the first nibble (4 bits) is always a D so that a 4 byte UTF-16 character always looks like this: D? ?? D? ?? and will never end with 00 00.

Therefore it is safe to use du commands on UTF-16 strings.

Using string.Length for calculating the range

Before answering my own question, I wanted to try the behavior in C# and therefore asked the question about how to create 4-byte characters in C#.

Unexpectedly, this already pointed me to the answer: string.Length is the string length in code units, not characters. To get the Unicode character length, we should use the System.Globalization.StringInfo class.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow