Question

Ages ago, I found some Perl online which neatly formatted valid XML (tabs and newlines) when it was a single-line. The code is below.

It uses XML::Twig to do that. It creates the XML::Twig object without keep_encoding ($twig = XML::Twig->new()) but if I give it a UTF-8 encoded XML file with a non-ASCII character in it, it produces a file which is not valid UTF-8 according to the isutf8 command on Ubuntu. Opening the files in xxd, I can see the character goes from 2-byte to 1.

If I use my $twig= XML::Twig->new(keep_encoding=>1); the same input produces valid UTF-8 and two bytes are preserved.

According to the Perldoc for keep_encoding

This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and you want to keep it that way, then setting keep_encoding will use theExpat original_string method for character, thus keeping the original encoding, as well as the original entities in the strings.

Why is a non-UTF-8 doc being produced without that option and why does setting it cause the UTF-8-ness to be preserved?

The non-ASCII character is a non-breaking space (c2 a0) by the way.

use strict;
use warnings;
use XML::Twig;
my  $sXML  = join "", (<>);
my  $params = [qw(none nsgmls nice indented record record_c)];
my  $sPrettyFormat  = $params->[3] || 'none';
my $twig = XML::Twig->new();
$twig->set_indent(" "x4);
$twig->parse( $sXML );
$twig->set_pretty_print( $sPrettyFormat );
$sXML      = $twig->sprint;
print $xXML;
Was it helpful?

Solution

It's hard to test without your data, but I would guess that this is due to Perl printing the file as an ISO-8859-1 file, since it doesn't have any information about its encoding (it gets it "raw" from XML::Parser). Try binmode STDOUT, ':utf8'; before printing.

Also, it may not be a great idea to read the file first and then pass a string to the parser. Using parsefile (on the file name) is safer. You potentially avoid encoding problems.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top