Typographic apostrophe + wide string literal broke my wofstream (C++)

https://stackoverflow.com/questions/816092

03-07-2019
|

Question

I’ve just encountered some strange behaviour when dealing with the ominous typographic apostrophe ( ’ ) – not the typewriter apostrophe ( ' ). Used with wide string literal, the apostrophe breaks wofstream.

This code works

ofstream file("test.txt");
file << "A’B" ;
file.close();

==> A’B

This code works

wofstream file("test.txt");
file << "A’B" ;
file.close();

==> A’B

This code fails

wofstream file("test.txt");
file << L"A’B" ;
file.close();

==> A

This code fails...

wstring test = L"A’B";
wofstream file("test.txt");
file << test ;
file.close();

==> A

Any idea ?

Solution

You should "enable" locale before using wofstream:

std::locale::global(std::locale()); // Enable locale support 
wofstream file("test.txt");
file << L"A’B";

So if you have system locale en_US.UTF-8 then the file test.txt will include utf8 encoded data (4 byes), if you have system locale en_US.ISO8859-1, then it would encode it as 8 bit encoding (3 bytes), unless ISO 8859-1 misses such character.

wofstream file("test.txt");
file << "A’B" ;
file.close();

This code works because "A’B" is actually utf-8 string and you save utf-8 string to file byte by byte.

Note: I assume you are using POSIX like OS, and you have default locale different from "C" that is the default locale.

OTHER TIPS

Are you sure it's not your compiler's support for unicode characters in source files that is "broken"? What if you use \x or similar to encode the character in the string literal? Is your source file even in whatever encoding might might to a wchar_t for your compiler?

Try wrapping the stream insertion character in a try-catch block and tell us what, if any, exception it throws.

I am not sure what is going on here, but I'll harass a guess anyway. The typographic apostrophe probably has a value that fits into one byte. This works with "A’B" since it blindly copies bytes without bothering about the underlying encoding. However, with L"A’B", an implementation dependent encoding factor comes into play. It probably doesn't find the proper UTF-16 (if you are on Windows) or UTF-32 (if you are on *nix/Mac) value to store for this particular character.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow