Is this method for converting Unicode chars truly endian agnostic?

https://stackoverflow.com/questions/23621301

21-07-2023
|

题

I'm just a little skeptical of my own code (and tests, naturally) and would like someone to verify if this is approach truly endian agnostic.

Internally in a cross platform project I'm using UTF-32 (std::u32string) for the string type. But to make dealing with I/O on different platforms easier I'm converting UTF-32 to UTF-8 before sending any text to a file or over the wire.

I would say this approach is endian agnostic. UTF-8 is a byte orientated encoding which means endianness of the computer doesn't affect the byte stream. The large 32-bit characters are being converted to UTF-8 in the order they appear in the string before being sent to the stream.

Here's snippets of code from a Json String class to provide an example of what I'm doing

/**
*   Provides conversion facilities between UTF-8 and Unicode32 strings.
*/
typedef std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> UnicodeConverter;

/**
* Converts the JSON value to a valid JSON string.
* @param the UTF-8 stream to write to.
* @returns The UTF-8 stream.
*/
inline std::ostream& VToJson(std::ostream& os) const override { return EscapeJsonString(os << STRING_JSON_QUOTE) << STRING_JSON_QUOTE; }

/**
*   Streams a json string in its escaped form.
*   @param os the UTF-8 stream to write to.
*   @returns the original stream.
*/
std::ostream& JsonString::EscapeJsonString(std::ostream &os) const 
{
    UnicodeConverter conv;
    for each (char32_t c in i_Value)
    {
        // Check if character is a special character
        if (c == '\\')
            // Output escaped rev solidus
            os << "\\\\";
        else if (c == '/')
            // Output escaped solidus
            os << "\\/";
        else if (c == '\b')
            // Output escaped backspace
            os << "\\b";
        else if (c == '\f')
            // Output escaped formfeed
            os << "\\f";
        else if (c == '\n')
            // Output escaped new line
            os << "\\n";
        else if (c == '\r')
            // Output secaped carriage return
            os << "\\r";
        else if (c == '\t')
            // Output escaped tab
            os << "\\t";
        else if (is_unicode_control(c))
        {
            // Output escape
            os << "\\u";

            // Output hex representation
            std::stringstream str;
            str << std::setfill('0') << std::setw(4) << std::hex << c;
            os << str.str();
        }
        else
            // Normal character
            os << conv.to_bytes(c);
    }
    return os;
}

解决方案

In general this approach can be implemented endianness agnostic, since UTF-32 is only used on a system with the same endianness, while in every case where it interfaces with a system that may have different endianness UTF-8 is used - and UTF-8 is built on a byte stream (therefore there is no endianness).

However, the conversion itself is endian-sensitive and must be implemented correctly so that endianness does not become a problem (e.g. no memcopy but arithmetic shifts instead). It should be reasonable to assume that your standard library implementation does this conversion correctly.

To add some clarification as to why this code shall be unaffected by endianess (22.5/4):

For the facet codecvt_utf8:
- The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.
- Endianness shall not affect how multibyte sequences are read or written.
- The multibyte sequences may be written as either a text or a binary file.

The endianess member of the codecvt_mode enumeration type is only intended for reading/writing UTF-16 and UTF-32 multibyte sequences.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow