TinyXML parsing multi-byte characters but skipping following [x] chars

https://stackoverflow.com/questions/15436988

24-03-2022
|

题

I've got a c++ program which receives some xml from a server and then attempts to parse it in order to populate some combo-boxes, for example

<?xml version="1.0"?>
    <CustomersMachines>
        <Customer name="bob" id="1">
            <Machine name="office1" id="1" />
            <Machine name="officeserver" id="2" />
        </Customer>
     </CustomersMachines>

For these values, TinyXML parses fine and the resulting combo-boxes populate as intended. The problem arises when a multi-byte character is placed at (or near, depending on how many bytes) the end of the name element.

<Customer name="boß" id="3">

will result in the combo-box being populated with the value boß" id=

From stepping through the debugger I see that when a multi-byte character gets passed to ReadText() the following 1-3 single-byte characters in the element get skipped over but automatically included, so tinyXML doesn't register the closing quote and keeps parsing until it reaches the next one. The application running on the server sending the xml predominantly uses ISO-8859-1 encoding, whereas tinyXML is defaulting to UTF-8.

I've tried tweaking tinyxml to default to use TIXML_ENCODING_UNKNOWN which appears to solve the problem but causes a substantial number of issues elsewhere in the program. Other things I've tried are to utf8_encode the xml server-side before sending it (but this causes strange characters to display in the combo boxes where the multi-byte char should be), and forcing the encoding into the xml being sent to the client program to no avail.

Anyone have any idea on how to prevent multi-byte characters from automatically ignoring the following 1-3 characters in this case?

解决方案

The <?xml?> prolog is not specifying an encoding. If the encoding is not available outside of the XML through out-of-band means then the encoding has to be guessed through analysis of the XML's starting bytes, per the rules outlined in Appendix F of the XML spec. In this case, that likely leads to UTF-8 being selected. If the XML is not actually UTF-8 encoded, that would account for the behavior you are seeing.

In ISO-8859-1, ß is encoded as byte octet 0xDF, and " is encoded as byte octet 0x22.

In UTF-8, 0xDF is a starting byte of a 2-byte octet sequence, which accounts for the " being skipped. However, 0xDF 0x22 is not a valid UTF-8 2-octet byte sequence, so TinyXml should have failed the parse with an error. If it does not, then that is a bug in TinyXml.

If the XML is actually ISO-8859-1 encoded, the server must provide that info. If it is not, then that is a bug in the server.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow