fstream file size in codepoints

Question 1

If I understand you right, you expect that:

`std::basic_fstream<CharT,Traits>::seekg`

(which by inheritance is basic_istream<CharT,Traits>::seekg), ought to perform the stream-positioning operation in units that are the intern_type of whatever codecvt with which the stream is imbued.

Template basic_istream is declared:

template< 
    class CharT, 
    class Traits = std::char_traits<CharT>
> class basic_istream;

In the declaration of the member function:

basic_istream & basic_istream<CharT,Traits>::seekg(pos_type pos)

pos_type is std::char_traits<CharT>::pos_type which therefore is a type determined in any implementation solely by the CharT template argument of the basic_istream class and without reference to any codecvt.

A basic_fstream<char>, for instance remains a basic_fstream<char>, and its pos_type remains basic_fstream<char>::pos_type, regardless of the encoding that is chosen to read or write it.

The declarations above are respectively as per C++11 Standard § 27.7.1 and § 27.7.2.1. The fact that pos_type is invariant with respect to any imbued codecvt, and hence also the behaviour of seekg(pos_type), are therefore consequences of the Standard.

Equivalent remarks apply for basic_istream& seekg( off_type off, std::ios_base::seekdir dir).

The std::codecvt::intern_type is the type of the elements of the internal sequence to which or from which the specified encoding will translate an external sequence of elements of type extern_type. The intern_type is the element type of the "in-program" sequence and the extern_type is the type of "in-file" sequence. The intern_type has got nothing to do with positioning operations on the file.

If you must find out the size of a file in codepoints, and presuming that the possible encodings of interest are UTF-8, UTF-16 and UTF-32, then for the first two of these you have no choice but to read the entire file, because they are variable-length encodings, with a UTF-8 codepoint consuming 1-4 bytes and a UTF-16 codepoint consuming 2 or 4 bytes. UTF-32 is a fixed-length 4-byte encoding, so in that case you might compute the number of complete codepoints as the byte-length of the file, minus BOM-length if any, divided by 4, if you discount the possibility of encoding errors except at end-of-file.

For the variable length encodings, the simplest way of counting the codepoints will be with a template function parameterized by an indicator of the presumed encoding. It will read the file, first consuming the BOM, if any, in units of char or char16_t as appropriate, identifying each unit that is the lead element of a codepoint in the presumed encoding; verifying the presence of the number of subsequent elements required by the lead element, and incrementing the codepoint count if they are found.

Question 2

The length function of std::char_traits returns the number of CharTcharacters, which isn't necessarily the number of bytes. So basically what you would need to do is read the buffer of your file into an std::string and print its size():

std::ofstream out("out.txt");
out.rdbuf()->pubimbue(std::locale("en_US.UTF8"));

std::streambuf* p = out.rdbuf();
p->pubseekoff(0, std::ios_base::beg);

std::string data; //  use std::u16string for UTF-16 data

data.assign(std::istreambuf_iterator<char>(out),
            std::istreambuf_iterator<char>());

std::cout << "We have " << data.size() << " codepoints";