How to tell ASCIIEncoding class not to decode the byte order mark

https://stackoverflow.com/questions/5098757

06-12-2019
|

Question

When decoding a byte array into a string using the .net ASCIIEncoding class do I need to write some code to detect and remove the byte order mark, or is it possible to tell ASCIIEncoding to not decode byte order mark into the string?

Here's my problem, when I do this:

string someString = System.Text.ASCIIEncoding.Default.GetString(someByteArray)

someString will look like this:

ï»¿<?xml version="1.0"?>.......

Then when I call this:

XElement.Parse(someString)

an exception is thrown because of the first three bytes: EF BB BF - the UTF8 byte order mark. So I thought that if I specify UTF8 encoding, rather than Default, like this:

System.Text.ASCIIEncoding.UTF8.GetString(someByteArray)

ASCIIEncoding would not attempt to decode the byte order mark into the string. When I copy the returned string into notepad++, I can see a ? character in front of the XML tag. So now the byte order mark is being decoded into a single garbage character. What is the best way to stop the byte order mark being decoded in this case?

Solution

Please don't use

ASCIIEncoding.UTF8

That's really just

Encoding.UTF8

It's not using ASCIIEncoding at all. It just looks like it in your source code.

Fundamentally, the problem is that your file is UTF-8, it's not ASCII. That's why it's got a UTF-8 byte order mark. I strongly suggest that you use Encoding.UTF8 to read the UTF-8 file, one way or the other.

If you read the file with File.ReadAllText, I suspect it'll remove the BOM automatically. Or you could just trim it afterwards, before calling XElement.Parse. Using the wrong encoding (either ASCII or Encoding.Default) is not the right approach. Likewise it's not a garbage character. It's a perfectly useful character, giving a very strong indication that it really is a UTF-8 file - it's just you don't want it in this particular context. "Garbage" gives the impression that it's corrupt data which shouldn't be present in the file, and that's definitely not the case.

Another approach would be to avoid converting it into text at all. For example:

XElement element;
using (XmlReader reader = XmlReader.Create(new MemoryStream(bytes))
{
    element = XElement.Load(reader);
}

That way the encoding will be auto-detected.

OTHER TIPS

System.Text.Encoding.GetString() preserves the BOM if it is present and converts it to the UTF-16 BOM (U+FEFF). Consider this a feature. Strictly speaking, it's the proper thing to do as tossing the BOM would make the conversion lossy and not round-trippable. Bit surprising, though, that they didn't provide a flag to let you specify the desired behaviour, but there you are. So...you've got two options:

Convert to a string, look for the BOM and remove it prior to invoking XElement.Parse() on the string. Or...
wrap the byte[] in a MemoryStream, the MemoryStream in a StreamReader and use XElement.Load() to do the parse.

Your choice. Here's some sample code that will work:

using System.IO;
using System.Text;
using System.Xml.Linq;

namespace TestDrive
{
    class Program
    {
        public static void Main()
        {
            byte[] octets = File.ReadAllBytes( "utf8-encoded-document-with-BOM.xml" ) ;

            // -----------------------------------------------
            // option 1: use a memory stream and stream reader
            // -----------------------------------------------
            using ( MemoryStream ms = new MemoryStream( octets) )
            using ( StreamReader sr = new StreamReader( ms , Encoding.UTF8 , true )   )
            {
                XElement element1 = XElement.Load( sr ) ;
            }

            // --------------------------------------------------------------------
            // option 2: convert to string, then look for and remove BOM if present
            // 
            // The .Net framework Encoding.GetString() methods preserve the BOM if
            // it is present. Since the internal format of .Net string is UTF-16,
            // the BOM is converted to the UTF-16 encoding (U+FEFF).
            // 
            // Consider this a feature.
            // --------------------------------------------------------------------
            // convert to UTF-16 string
            string       xml       = Encoding.UTF8.GetString( octets ) ;
            // Two different ways of getting the BOM
            //string UTF16_BOM = Encoding.Unicode.GetString(Encoding.Unicode.GetPreamble()).ToCharArray() ;
            const string UTF16_BOM = "\uFEFF" ; 
            // parse the element, removing the BOM if we see it.
            XElement element2 = XElement.Parse( xml.StartsWith( UTF16_BOM ) ? xml.Substring(1) : xml ) ;

            return ;
        }
    }
}

This isn't an answer, but code in comments is horrible, and it felt a bit rude to put this in your question. Are you really trying to do this:

Byte[] bytes = new byte [] { 0xEF,0xBB,0xBF, 0x57, 0x44 };
String txt = Encoding.UTF8.GetString(bytes);
Console.WriteLine("String length {0}", txt.Length);
Console.WriteLine("String '{0}'", txt);
Console.WriteLine("Chars '{0}'", String.Join(",", txt.Select(chr => ((int)chr).ToString("x2"))));

And wondering why you get:

String length 3
String 'WD'
String 'feff,57,44'

I certainly am...

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow