Question

My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure. When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header.

Here's a piece of code trying to visualize the problem:

use strict; use diagnostics;    use feature 'unicode_strings';
use utf8;   use v5.14;      use encoding::warnings;
binmode(STDOUT, ":encoding(UTF-8)");    use open qw( :encoding(UTF-8) :std );
use XML::LibXML

# Simulate actual data source with a UTF-8 encoded file containing '¿Üßıçñíïì'
open( IN, "<", "./input" ); my $string = <IN>; close( IN ); chomp( $string );
$string = "Value of '" . $string . "' has no meaning";

# create example XML document as <response><result>$string</result></response>
my $xml = XML::LibXML::Document->new( "1.0", "UTF-8" );
my $rsp = $xml->createElement( "response" );    $xml->setDocumentElement( $rsp );
$rsp->appendTextChild( "result", $string );

# Try to forward the resulting XML to a receiver. Using STDOUT here, but files/sockets etc. yield the same results
# This will not warn and be encoded correctly but lack the XML header
print( "Just the root document looks good: '" . $xml->documentElement->serialize() . "'\n" );
# This will include the header but wide chars are mangled
print( $xml->serialize() );
# This will even issue a warning from encoding::warnings
print( "The full document looks mangled: '" . $xml->serialize() . "'\n" );

Spoiler 1: Good case:

<response><result>Value of '¿Üßıçñíïì' has no meaning</result></response>

Spoiler 2: Bad case:

<?xml version="1.0" encoding="UTF-8"?><response><result>Value of '¿ÃÃıçñíïì' has no meaning</result></response>

The root element and its contents are already UTF-8 encoded. XML::LibXML accepts the input and is able to work on it and output it again as valid UTF-8. As soon as I try to serialize the whole XML document, the wide characters inside get mangled. In a hex dump, it looks as if the already UTF-8 encoded string gets passed through a UTF-8 encoder again. I've searched, tried and read a lot, from Perl's own Unicode tutorial all the way through tchrist's great answer to the Why does modern Perl avoid UTF-8 by default? question. I don't think this is a general Unicode problem, though, but rather a specific issue between me and XML::LibXML.

What do I need to do to be able to output a full XML document including the header so that its contents remain correctly encoded? Is there a flag/property/switch to set?

(I'll gladly accept links to the corresponding part(s) of TFM that I should have R for as long as they are actually helpful ;)

Was it helpful?

Solution

ikegami is correct, but he didn't really explain what's wrong. To quote the docs for XML::LibXML::Document:

IMPORTANT: unlike toString for other nodes, on document nodes this function returns the XML as a byte string in the original encoding of the document (see the actualEncoding() method)!

(serialize is just an alias for toString)

When you print a byte string to a filehandle marked with an :encoding layer, it gets encoded as if it were ISO-8859-1. Since you have a string containing UTF-8 bytes, it gets double encoded.

As ikegami said, use binmode(STDOUT) to remove the encoding layer from STDOUT. You could also decode the result of serialize back into characters before printing it, but that assumes the document is using the same encoding you have set on your output filehandle. (Otherwise, you'll emit a XML document whose actual encoding doesn't match what its header claims.) If you're printing to a file instead of STDOUT, open it with '>:raw' to avoid double encoding.

OTHER TIPS

Since XML documents are parsed without needing any external information, they are binary files rather than text files.

You're telling Perl to encode anything sent to STDOUT[1], but then you proceed to output an XML document to it. You can't apply a character encoding to a binary file as it corrupts it.

Replace

binmode(STDOUT, ":encoding(UTF-8)");

with

binmode(STDOUT);

Note: This assumes the rest of the text you are outputting is just temporary debugging information. The output doesn't otherwise make sense.


  1. In fact, you do this twice! Once using use open qw( :encoding(UTF-8) :std );, and then a second time using binmode(STDOUT, ":encoding(UTF-8)");.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top