Question

I'm making some pretty string-manipulation-intensive code in C#.NET and got curious about some Joel Spolsky articles I remembered reading a while back:

http://www.joelonsoftware.com/articles/fog0000000319.html
http://www.joelonsoftware.com/articles/Unicode.html

So, how does .NET do it? Two bytes per char? There ARE some Unicode chars^H^H^H^H^H code points that need more than that. And how is the length encoded?

Was it helpful?

Solution

Before Jon Skeet turns up here is a link to his excellent blog on strings in C#.

In the current implementation at least, strings take up 20+(n/2)*4 bytes (rounding the value of n/2 down), where n is the number of characters in the string. The string type is unusual in that the size of the object itself varies

OTHER TIPS

.NET uses UTF-16.

From System.String on MSDN:

"Each Unicode character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char object."

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top