Why is it necessary to imbue a stream with fixed length encoding? Also, how can I prevent leaking memory when imbuing?

https://stackoverflow.com/questions/19719467

02-07-2022
|

Question

The other day, I was writing code similar to:

wchar_t buffer[1024];
std::wifstream input(L"input.txt");

while (input.good())
{
    input::getline(buffer, 1024);
    // ... do stuff...
}

input.close();

I found that, after the first call to getline, buffer contained the correct data (UTF-16 LE) bytes, but instead of buffer being seen as a wchar_t array, it had magically transformed into a byte array. I reinterpret_cast<wchar_t *>(buffer) and obtained the result I wanted.

Then the next call to getline... this time, buffer was again seen as a byte array, but the bytes were skewed. I expected to see 0x31 0x00 0x32 0x00 0x33 0x00, but instead I saw 0x00 0x31 0x00 0x32 0x00 0x33

Now, I can understand how things might get skewed if characters had variable length encodings...but ALL the characters in my input.txt file are ASCII and consequently can be encoded with 2 bytes each (using UTF16-LE). Why the skew?

An answerer on SO informed me that I should imbue the stream as so:

std::wifstream fin("text.txt", std::ios::binary);
// apply facet
fin.imbue(std::locale(fin.getloc(),
          new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));

Indeed, this resolved my issue completely. I don't understand why imbuing is necessary if all the characters you are dealing with have a fixed length encoding?

Secondly, the second parameter to imbue seems to be leaking memory?! If I allocate a std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian> object on the stack and pass its address to imbue, everything seems to work until my stack variable goes out of scope (right before main close brace). The application crashes, complaining that some function is calling a pure virtual function. I see the same behavior if I use the provided code and instead call delete on the memory before main returns.

Thank you in advance for your comments and answers.

Solution

Your file contains something like 31 00 32 00 33 00 0A 00 34 00 .... 0A is the line feed character.

With the default codecvt facet, each byte is individually converted to Unicode. So 31 becomes U+0031, 00 becomes U+0000 and so on. getline stops at 0A byte.

The next getline call continues where the previous one left off: 00 becomes U+0000 and so on.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow