이진 모드로 UTF16을 작성합니다

https://stackoverflow.com/questions/207662

03-07-2019
|

문제

이진 모드에서 Ofstream으로 파일을 wstring을 작성하려고하지만 뭔가 잘못하고 있다고 생각합니다. 이것이 제가 시도한 것입니다.

ofstream outFile("test.txt", std::ios::out | std::ios::binary);
wstring hello = L"hello";
outFile.write((char *) hello.c_str(), hello.length() * sizeof(wchar_t));
outFile.close();

Opening Test.txt 예를 들어 UTF16으로 인코딩이 설정된 Firefox에서 다음과 같이 표시됩니다.

h .e�l .l .o

왜 이런 일이 일어나는지 말해 줄 수 있습니까?

편집하다:

16 진 편집기에서 파일 열기 :

FF FE 68 00 00 00 65 00 00 00 6C 00 00 00 6C 00 00 00 6F 00 00 00

어떤 이유로 든 모든 캐릭터 사이에 두 개의 여분의 바이트가있는 것 같습니다.

해결책

환경에서 Sizeof (WCHAR_T)가 4라고 생각합니다. 즉, UTF-16 대신 UTF-32/UCS-4를 기록하고 있습니다. 그것은 확실히 Hex 덤프가 어떻게 생겼는지입니다.

테스트하기에 충분히 쉽지만 (크기 (WCHAR_T) 만 인쇄).

UTF-32 WSTRING에서 UTF-16으로 이동하려면 대리 쌍이 작동하므로 적절한 인코딩을 적용해야합니다.

다른 팁

Here we run into the little used locale properties. If you output your string as a string (rather than raw data) you can get the locale to do the appropriate conversion auto-magically.

N.B.This code does not take into account edianness of the wchar_t character.

#include <locale>
#include <fstream>
#include <iostream>
// See Below for the facet
#include "UTF16Facet.h"

int main(int argc,char* argv[])
{
   // construct a custom unicode facet and add it to a local.
   UTF16Facet *unicodeFacet = new UTF16Facet();
   const std::locale unicodeLocale(std::cout.getloc(), unicodeFacet);

   // Create a stream and imbue it with the facet
   std::wofstream   saveFile;
   saveFile.imbue(unicodeLocale);


   // Now the stream is imbued we can open it.
   // NB If you open the file stream first. Any attempt to imbue it with a local will silently fail.
   saveFile.open("output.uni");
   saveFile << L"This is my Data\n";


   return(0);
}

The File: UTF16Facet.h

 #include <locale>

class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {
       // Loop over both the input and output array/
       for(;(from < from_end) && (to < to_limit);from += 2,++to)
       {
           /*Input the Data*/
           /* As the input 16 bits may not fill the wchar_t object
            * Initialise it so that zero out all its bit's. This
            * is important on systems with 32bit wchar_t objects.
            */
           (*to)                               = L'\0';

           /* Next read the data from the input stream into
            * wchar_t object. Remember that we need to copy
            * into the bottom 16 bits no matter what size the
            * the wchar_t object is.
            */
           reinterpret_cast<char*>(to)[0]  = from[0];
           reinterpret_cast<char*>(to)[1]  = from[1];
       }
       from_next   = from;
       to_next     = to;

       return((from > from_end)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {
       for(;(from < from_end) && (to < to_limit);++from,to += 2)
       {
           /* Output the Data */
           /* NB I am assuming the characters are encoded as UTF-16.
            * This means they are 16 bits inside a wchar_t object.
            * As the size of wchar_t varies between platforms I need
            * to take this into consideration and only take the bottom
            * 16 bits of each wchar_t object.
            */
           to[0]     = reinterpret_cast<const char*>(from)[0];
           to[1]     = reinterpret_cast<const char*>(from)[1];

       }
       from_next   = from;
       to_next     = to;

       return((to > to_limit)?partial:ok);
   }
};

It is easy if you use the C++11 standard (because there are a lot of additional includes like "utf8" which solves this problems forever).

But if you want to use multi-platform code with older standards, you can use this method to write with streams:

Read the article about UTF converter for streams
Add stxutif.h to your project from sources above

Open the file in ANSI mode and add the BOM to the start of a file, like this:

std::ofstream fs;
fs.open(filepath, std::ios::out|std::ios::binary);

unsigned char smarker[3];
smarker[0] = 0xEF;
smarker[1] = 0xBB;
smarker[2] = 0xBF;

fs << smarker;
fs.close();

Then open the file as UTF and write your content there:

std::wofstream fs;
fs.open(filepath, std::ios::out|std::ios::app);

std::locale utf8_locale(std::locale(), new utf8cvt<false>);
fs.imbue(utf8_locale); 

fs << .. // Write anything you want...

On windows using wofstream and the utf16 facet defined above fails becuase the wofstream converts all bytes with the value 0A to 2 bytes 0D 0A, this is irrespective of how you pass the 0A byte in, '\x0A', L'\x0A', L'\x000A', '\n', L'\n' and std::endl all give the same result. On windows you have to open the file with an ofstream (not a wofsteam) in binary mode and write the output just like it is done in the original post.

The provided Utf16Facet didn't work in gcc for big strings, here is the version that worked for me... This way the file will be saved in UTF-16LE. For UTF-16BE, simply invert the assignments in do_in and do_out, e.g. to[0] = from[1] and to[1] = from[0]

#include <locale>
#include <bits/codecvt.h>


class UTF16Facet: public std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type>
{
   typedef std::codecvt<wchar_t,char,std::char_traits<wchar_t>::state_type> MyType;
   typedef MyType::state_type          state_type;
   typedef MyType::result              result;


   /* This function deals with converting data from the input stream into the internal stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result  do_in(state_type &s,
                           const char  *from,const char *from_end,const char* &from_next,
                           wchar_t     *to,  wchar_t    *to_limit,wchar_t*    &to_next) const
   {

       for(;from < from_end;from += 2,++to)
       {
           if(to<=to_limit){
               (*to)                               = L'\0';

               reinterpret_cast<char*>(to)[0]  = from[0];
               reinterpret_cast<char*>(to)[1]  = from[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }



   /* This function deals with converting data from the internal stream to a C/C++ file stream.*/
   /*
    * from, from_end:  Points to the beginning and end of the input that we are converting 'from'.
    * to,   to_limit:  Points to where we are writing the conversion 'to'
    * from_next:       When the function exits this should have been updated to point at the next location
    *                  to read from. (ie the first unconverted input character)
    * to_next:         When the function exits this should have been updated to point at the next location
    *                  to write to.
    *
    * status:          This indicates the status of the conversion.
    *                  possible values are:
    *                  error:      An error occurred the bad file bit will be set.
    *                  ok:         Everything went to plan
    *                  partial:    Not enough input data was supplied to complete any conversion.
    *                  nonconv:    no conversion was done.
    */
   virtual result do_out(state_type &state,
                           const wchar_t *from, const wchar_t *from_end, const wchar_t* &from_next,
                           char          *to,   char          *to_limit, char*          &to_next) const
   {

       for(;(from < from_end);++from, to += 2)
       {
           if(to <= to_limit){

               to[0]     = reinterpret_cast<const char*>(from)[0];
               to[1]     = reinterpret_cast<const char*>(from)[1];

               from_next   = from;
               to_next     = to;
           }
       }

       return((to != to_limit)?partial:ok);
   }
};

You should look at the output file in a hex editor such as WinHex so you can see the actual bits and bytes, to verify that the output is actually UTF-16. Post it here and let us know the result. That will tell us whether to blame Firefox or your C++ program.

But it looks to me like your C++ program works and Firefox is not interpreting your UTF-16 correctly. UTF-16 calls for two bytes for every character. But Firefox is printing twice as many characters as it should, so it is probably trying to interpret your string as UTF-8 or ASCII, which generally just have 1 byte per character.

When you say "Firefox with encoding set to UTF16" what do you mean? I'm skeptical that that work work.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow