Buffer size for reading a UTF-8-encoded file using ICU (ICU4C)

https://stackoverflow.com/questions/17511783

02-06-2022
|

题

I am trying to read a UTF-8-encoded file using ICU4C on Windows with msvc11. I need to determine the size of the buffer to build a UnicodeString. Since there is no fseek-like function in the ICU4C API I thought I could use an underlying C-file:

#include <unicode/ustdio.h>
#include <stdio.h>
/*...*/
UFILE *in = u_fopen("utfICUfseek.txt", "r", NULL, "UTF-8");
FILE* inFile = u_fgetfile(in);
fseek(inFile,  0, SEEK_END); /* Access violation here */
int size = ftell(inFile);
auto uChArr = new UChar[size];

There are two problems with this code:

It "throws" access violation at the fseek() line for some reason (Unhandled exception at 0x000007FC5451AB00 (ntdll.dll) in test.exe: 0xC0000005: Access violation writing location 0x0000000000000024.)
The size returned by the ftell function will not be the size I want because UTF-8 can use up to 4 bytes for a code point (a u8"tю" string will be of length 3).

So the questions are:

How do I determine a buffer size for a UnicodeString if I know that the input file is UTF-8-encoded?
Is there a portable way to use iostream/fstream for both reading and writing ICU's UnicodeStrings?

Edit: Here is the possible solution (tested on msvc11 and gcc 4.8.1) based on the first answer and C++11 Standard. A few things from ISO IEC 14882 2011:

"The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set (2.3) and the eight-bit code units of the Unicode UTF-8 encoding form..."
"The basic source character set consists of 96 characters...", - 7 bits needed already
"The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set..."
"Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set."

So, to make this portable for platforms where the implementation defined size of char is 1 byte = 8 bits (don't know where this isn't true) we can read Unicode characters into chars using unformatted input operation:

std::ifstream is;
is.open("utfICUfSeek.txt");
is.seekg(0, is.end);
int strSize = is.tellg();
auto inputCStr = new char[strSize + 1];
inputCStr[strSize] = '\0'; //add null-character at the end
is.seekg(0, is.beg);
is.read(inputCStr, strSize);
is.seekg(0, is.beg);
UnicodeString uStr = UnicodeString::fromUTF8(inputCStr);
is.close();

What troubles me is that I have to create an additional buffer for chars and only then convert them to the required UnicodeString.

解决方案

This is an alternative to using ICU.

Using the standard std::fstream you can read the whole/ part of the file into a standard std::string then iterate over that with a unicode aware iterator. http://code.google.com/p/utf-iter/

std::string get_file_contents(const char *filename)
{
    std::ifstream in(filename, std::ios::in | std::ios::binary);
    if (in)
    {
        std::string contents;
        in.seekg(0, std::ios::end);
        contents.reserve(in.tellg());
        in.seekg(0, std::ios::beg);
        contents.assign((std::istreambuf_iterator<char>(in)), std::istreambuf_iterator<char>());
        in.close();
        return(contents);
    }
    throw(errno);
}

Then in your code

std::string myString = get_file_contents( "foobar" );
unicode::iterator< std::string, unicode::utf8 /* or utf16/32 */ > iter = myString.begin();

while ( iter != myString.end() )
{
    ...
    ++iter;
}

其他提示

Well, either you want to read in the whole file at once for some kind of postprocessing, in which case icu::UnicodeString is not really the best container...

#include <iostream>
#include <fstream>
#include <sstream>

int main()
{
    std::ifstream in( "utfICUfSeek.txt" );
    std::stringstream buffer;
    buffer << in.rdbuf();
    in.close();
    // ...
    return 0;
}

...or what you really want is to read into icu::UnicodeString just like into any other string object but went the long way around...

#include <iostream>
#include <fstream>

#include <unicode/unistr.h>
#include <unicode/ustream.h>

int main()
{
    std::ifstream in( "utfICUfSeek.txt" );
    icu::UnicodeString uStr;
    in >> uStr;
    // ...
    in.close();
    return 0;
}

...or I am completely missing what your problem really is about. ;)

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow