String equivalence in C# requires encoding match?

https://stackoverflow.com/questions/23409331

13-07-2023
|

Вопрос

I've been struggling with a problem for a few days and have finally worked out what's going wrong but I've only been able to find contradicting answers on StackOverflow (et al) so would like to ask for an explanation of what's going on.

For example this link (in common with many other reference for example this one, or these seemingly go-to references on the topic by Jon Skeet here and here) states that "A string in C# is always UTF-16 [Unicode?], there is no way to "convert" it. The encoding is irrelevant as long as you manipulate the string in memory, it only matters if you write the string to a stream (file, memory stream, network stream...)."

The much simplified Test case I've built to demonstrate my issue is as below, it's probably not copy paste replicable as it depends on some of the strings to have a different encoding, but believe me the test passes as written. I'm using VS2012 Update 4.

The oddity is that the following two lines pass.

Assert.IsFalse(copiedFromXmlDoubleQuote == copiedFromXmlEscapedQuote);
Assert.AreNotEqual(copiedFromXmlDoubleQuote, copiedFromXmlEscapedQuote);

The identical strings fail equivalency as they are encoded differently (copiedFromXmlDoubleQuote had the \ replaced by " in the editor).

All this suggests that the Visual Studio editor is encoding aware, and the strings that the code declares are also encoding aware. My question is, have I done something stupid or can anyone please concur with my findings and if possible refer me to something that will help clarify what the story is with string encoding equivalence... As I'm going to be working in an Xml world a lot is it best practice to explicitly convert everything to Unicode at point of deserialization, and recode it as required when serializing out again?

[TestMethod]
public void EscapedCharacterDoesNotEqualLiteralString()
{
  string actual = "\"";
  Assert.AreEqual("\"", actual);
  Assert.AreEqual(@"""", actual);
  string typedEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
  string typedDoubleQuote = @"<?xml version=""1.0"" encoding=""utf-16""?>";
  Assert.IsTrue(typedDoubleQuote == typedEscapedQuote);
  Assert.AreEqual(typedDoubleQuote, typedEscapedQuote);
  string copiedFromXmlEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
  string copiedFromXmlDoubleQuote = @"<?xml version=""1.0"" encoding=""utf-16""?>";
  Assert.IsFalse(copiedFromXmlDoubleQuote == copiedFromXmlEscapedQuote);
  Assert.AreNotEqual(copiedFromXmlDoubleQuote, copiedFromXmlEscapedQuote);
  Assert.IsTrue(copiedFromXmlDoubleQuote.ToUnicode() == copiedFromXmlEscapedQuote.ToUnicode());
  Assert.AreEqual(copiedFromXmlDoubleQuote.ToUnicode(), copiedFromXmlEscapedQuote.ToUnicode());
}

private static string BytesToString(byte[] bytes, Encoding encoding)
{
  using (MemoryStream ms = new MemoryStream(bytes))
  {
    using (StreamReader sr = new StreamReader(ms, encoding))
    {
      string s = sr.ReadToEnd();
      sr.Close();
      return s;
    }
  }
}

public static string ToUnicode(this string s)
{
  return BytesToString(new UnicodeEncoding().GetBytes(s), Encoding.Unicode);
}

I've loaded an example Vs2012 sln in a zip here

Решение

My initial check of your ZIP file shows that

   static string copiedFromXmlEscapedQuote = "<?xml version=\"1.0\" encoding=\"utf-16\"?>";
   static string copiedFromXmlDoubleQuote = @"<?xml version=""1.0"" encoding=""utf-16""?>";

   ? copiedFromXmlEscapedQuote.Length
   39
   ? copiedFromXmlDoubleQuote.Length
   40

The first check for string equivalence in the .net framework is length check - it doesn't bother checking the content if the strings are different lengths.

Further checking;

 ? copiedFromXmlDoubleQuote.Last()
   62 '>'
   ? copiedFromXmlEscapedQuote.Last()
   62 '>'
   ? copiedFromXmlEscapedQuote.First()
   60 '<'
   ? copiedFromXmlDoubleQuote.First()
   65279 ''

So its the first char which is different. The value of 65279 is covered in this article. What is this char? 65279 ''.

It seems you are correct - it is the VS.net editor which is preserving the BOM char, and opening the program file in the binary editor shows these are different, so I'm guessing the use of @ in VS.net tells the compiler to open the following bytes using a different encoder.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow