Typographic apostrophe + wide string literal broke my wofstream (C++)
-
03-07-2019 - |
Question
I’ve just encountered some strange behaviour when dealing with the ominous typographic apostrophe ( ’ ) – not the typewriter apostrophe ( ' ). Used with wide string literal, the apostrophe breaks wofstream.
This code works
ofstream file("test.txt");
file << "A’B" ;
file.close();
==> A’B
This code works
wofstream file("test.txt");
file << "A’B" ;
file.close();
==> A’B
This code fails
wofstream file("test.txt");
file << L"A’B" ;
file.close();
==> A
This code fails...
wstring test = L"A’B";
wofstream file("test.txt");
file << test ;
file.close();
==> A
Any idea ?
Solution
You should "enable" locale before using wofstream:
std::locale::global(std::locale()); // Enable locale support
wofstream file("test.txt");
file << L"A’B";
So if you have system locale en_US.UTF-8
then the file test.txt
will include
utf8 encoded data (4 byes), if you have system locale en_US.ISO8859-1
, then it would encode it as 8 bit encoding (3 bytes), unless ISO 8859-1 misses such character.
wofstream file("test.txt");
file << "A’B" ;
file.close();
This code works because "A’B"
is actually utf-8 string and you save utf-8
string to file byte by byte.
Note: I assume you are using POSIX like OS, and you have default locale different from "C" that is the default locale.
OTHER TIPS
Are you sure it's not your compiler's support for unicode characters in source files that is "broken"? What if you use \x
or similar to encode the character in the string literal? Is your source file even in whatever encoding might might to a wchar_t
for your compiler?
Try wrapping the stream insertion character in a try-catch
block and tell us what, if any, exception it throws.
I am not sure what is going on here, but I'll harass a guess anyway. The typographic apostrophe probably has a value that fits into one byte. This works with "A’B"
since it blindly copies bytes without bothering about the underlying encoding. However, with L"A’B"
, an implementation dependent encoding factor comes into play. It probably doesn't find the proper UTF-16 (if you are on Windows) or UTF-32 (if you are on *nix/Mac) value to store for this particular character.