Question

For what I understand, given a sequence of bytes without any further information, it's not generally possible to understand which encoding we are talking about. Of course we can guess (e.g. perl's Encode::Guess and similar tools), but sometimes this is just not feasible.

In my case, I've got a byte array which is \xe2\x80\xa1, and I can guess from the context (as a human) that it should correspond either to a à character (\xc3\xa0 in UTF-8) or to a á (\xc3\xa1 in UTF-8). It comes from an xml which declares iso8859-1 in the header, and is produced by a third party tool, which is clearly broken, but I have to deal with it. As you can guess the decoding fails, and no luck with Encode::Guess.

How would you face this kind of problem? I know there's no silver bullet, but is there a tool which outstands the others?

Was it helpful?

Solution

XML encoding can be tricky, because some XML generators might hard-code a generic content type such as ISO-8859-1 even if the document contains e.g. UTF-8. Part of the reason is most text is ASCII, and valid (7-bit) ASCII is also valid as most other encodings. Developers might not understand character encoding or might not care (works with my test data!).

One general approach is to attempt to decode the XML using the provided content-type. As with HTML, this is located near the top of the document and no non-ASCII characters should be between the start of the document and this encoding (except for byte order marks which implicitly provide a content-type).

If this encoding fails, then try with one or more predefined content-types. Good candidates are UTF-8, UTF-16, and ISO-8859-1. Plain old (7-bit) ASCII also decodes fine as UTF-8 and ISO-8859-1 so that is implied here as well.

Please note that depending on the language and XML implementation you may be able to "plow through" errors or not: it is best to have it fail fast on errors so you know to try another encoding.

See Also

Licensed under: CC-BY-SA with attribution
scroll top