Can _wfopen read UCS2-LE encoded file into a wide char string buffer?

https://stackoverflow.com/questions/13617786

03-12-2021
|

Question

I wanted to read a .reg file exported from windows registry. And I found that .reg file is encoded as a Windows UNICODE file ( which I think it's UCS2-LE encoded, because first two bytes are : FFFE ).

So I read file as this:

fp = _wfopen(lpszRegFilePath, _T("r, ccs=UNICODE"));
if ( NULL == fp)
{
    dwErr = ERROR_NOT_FOUND;
    break;
}
szData = new WCHAR[8192];
ZeroMemory(szData, 8192);

fgetws(szData, 8192, fp);
//........

here is the szData result: enter image description here

can _wfopen recognize BOM ? if so why it just ignore the FFFE BOM ?

Solution

The "css" parameter allows _wfopen() to detect BOMs and mark the FILE* according so it decodes the rest of the file correctly (if a BOM is present, it overrides the "css" value), but it does not discard the BOM, and there is nothing in the documentation to say it does. So you will just have to check the first 2 WCHARs read from the file to see if they are the UTF-16LE BOM (a UTF-8 BOM would get decoded into a UTF-16LE BOM) and ignore them if needed.

Update: something just occurred to me. fgetws() is returning the individual bytes of the BOM as individual WCHAR values in your buffer. It should not be doing that if it is respecting the BOM, which means it is parsing the file as Ansi/MBCS and not as UTF-16LE. Are you using Visual C++? The "css" parameter is a VC++-specific extension to _wfopen(). Non-Microsoft compiler vendor do not support it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow