C++ ifstream UTF8 first characters

https://stackoverflow.com/questions/3329827

29-09-2020
|

Question

Why does a file saved as UTF8 (in Notepad++) have this character in the beginning of the fstream I opened to it in my c++ program?

´╗┐

I have no idea what it is, I just know that it's not there when I save to ASCII. UPDATE: If I save it to UTF8 (without BOM) it's not there.
How can I check the encoding of a file (ASCII or UTF8, everything else will be rejected ;) ) in c++. Is it exactly these characters?

Thanks!

Solution

When you save a file as UTF-16, each value is two bytes. Different computers use different byte orders. Some put the most significant byte first, some put the least significant byte first. Unicode reserves a special codepoint (U+FEFF) called a byte-order mark (BOM). When a program writes a file in UTF-16, it puts this special codepoint at the beginning of the file. When another program reads a UTF-16 file, it knows there should be a BOM there. By comparing the actual bytes to the expected BOM, it can tell if the reader uses the same byte order as the writer, or if all the bytes have to be swapped.

When you save a UTF-8 file, there's no ambiguity in byte order. But some programs, especially ones written for Windows still add a BOM, encoded as UTF-8. When you encode the BOM codepoint as UTF-8, you get three bytes, 0xEF 0xBB 0xBF. Those bytes correspond to box-drawing characters in most OEM code pages (which is the default for a console window on Windows).

The argument in favor of doing this is that it marks the files as truly UTF-8, as opposed to some other native encoding. For example, lots of text files on western Windows are in codepage 1252. Tagging the file with the UTF-8-encoded BOM makes it easier to tell the difference.

The argument against doing this is that lots of programs expect ASCII or UTF-8 regardless, and don't know how to handle the extra three bytes.

If I were writing a program that reads UTF-8, I would check for exactly these three bytes at the beginning. If they're there, skip them.

Update: You can convert the U+FEFF ZERO WIDTH NO BREAK characters into U+2060 WORD JOINER except at the beginning of a file [Gillam, Richard, Unicode Demystified, Addison-Wesley, 2003, p. 108]. My personal code does this. If, when decoding UTF-8, I see the 0xEF 0xBB 0xBF at the beginning of the file, I take it as a happy sign that I indeed have UTF-8. If the file doesn't begin with those bytes, I just proceed decoding normally. If, while decoding later in the file, I encounter a U+FEFF, I emit U+2060 and proceed. This means U+FEFF is used only as a BOM and not as its deprecated meaning.

OTHER TIPS

Without knowing what those characters really are (i.e., without a hex dump) it's only a guess, but my immediate guess would be that what you're seeing is the result of taking a byte order mark (BOM) and (sort of) encoding it as UTF-8. Technically, you're not allowed to/supposed to do that, but in practice it's actually fairly common.

Just to clarify, you should realize that this not really a byte-order mark. The basic idea of a byte-order mark simply doesn't apply to UTF-8. Theoretically, UTF-8 encoding is never supposed to be applied to a BOM -- but you can ignore that, and apply the normal UTF-8 encoding rules to the values that make up a BOM anyway, if you want to.

Why does a file saved as UTF8 not have this character in the beginning [...] I have no idea what it is, I just know that it's not there when I save to ASCII.

I suppose you are referring to the Byte Order Mark (BOM) U+FEFF, a zero-width, non-breaking space character. Here (notepad++ 5.4.3) a file saved as UTF-8, has the characters EF BB BF at the beginning. I suppose that's what's a BOM encoded in UTF-8.

How can I check the encoding of a file

You cannot. You have to know what encoding your file was written in. While Unicde encoded files might start with a BOM, I don't think there's a requirement that they do so.

Regarding your second point, every valid ASCII string is also a valid UTF-8 string, so you don't have to check for ASCII explicitly. Simply read the file using UTF-8, if the file doesn't contain a valid UTF-8 string, you will get an error.

I'm guessing you meant to ask, why does it have those characters. Those characters are probably the byte order mark, which according to that link in UTF-8 are the bytes EF BB BF.

As for knowing what encoding a file is in, you cannot derive that from the file itself. You have to know it ahead of time (or ask the user who supplies you with the file). For a better understanding of encoding without having to do a lot of reading, I highly recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow