Question

.NET uses UTF-16 to represent strings, which is usually 2 bytes per character.

Many debugging tricks (including my own answers) will use the output of !do to get the address of the first character and then use the string length*2 in order to get the end address of the string.

Some examples where this can be useful:

However, UTF-16 also has 4 bytes characters (U+10000 to U+10FFFF), which might screw up everything.

  • string length is counted in characters and a 4 byte character is probably only counted as 1 character, so any length*2 calculations are incorrect
  • du might stop at characters which end on 00 00

So, how safe is it to use such scripts debugging .NET applications in WinDbg?

Was it helpful?

Solution

Short version: yes, it is safe to do string calculations in WinDbg using String.Length and it is safe to use du to dump them.

UTF-16 4 byte characters ending on 00 00

The unicode specification defines that the first 6 bits of byte 1 are 110110 and the first 6 bits of byte 3 are 110111. This means that the first nibble (4 bits) is always a D so that a 4 byte UTF-16 character always looks like this: D? ?? D? ?? and will never end with 00 00.

Therefore it is safe to use du commands on UTF-16 strings.

Using string.Length for calculating the range

Before answering my own question, I wanted to try the behavior in C# and therefore asked the question about how to create 4-byte characters in C#.

Unexpectedly, this already pointed me to the answer: string.Length is the string length in code units, not characters. To get the Unicode character length, we should use the System.Globalization.StringInfo class.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top