XML::Writer doesn't support anything but US-ASCII and UTF-8 (as mentioned in the documentation of its ENCODING
constructor argument). Creating an UCS-2be XML document using XML::Writer is tricky, but not impossible.
use XML::Writer qw( );
# XML::Writer doesn't encode for you, so we need to use :encoding.
# The :raw avoids a problem with CRLF conversion on Windows.
open(my $fh, '>:raw:encoding(UCS-2be)', $qfn)
or die("Can't create \"$qfn\": $!\n");
# This prints the BOM. It's optional, but it's useful when using an
# encoding that's not a superset of US-ASCII (such as UCS-2be).
print($fh "\x{FEFF}");
my $writer = XML::Writer->new(
OUTPUT => $fh,
ENCODING => 'US-ASCII', # Use entities for > U+007F
);
$writer->xmlDecl('UCS-2be');
$writer->startTag('root');
$writer->characters("\x{00041}");
$writer->characters("\x{000C9}");
$writer->characters("\x{10000}");
$writer->endTag();
$writer->end();
Downside: All characters above U+007F will be present as XML entities. In the above example,
- U+00041 will be present as "
A
" (00 41
). Good. - U+000C9 will be present as "
É
" (00 26 00 23 00 78 00 43 00 39 00 3B
). Suboptimal, but ok. - U+10000 will be present as "
𐀀
" (00 26 00 23 00 78 00 31 00 30 00 30 00 30 00 30 00 3B
). Good, XML entities are needed to store U+10000 withUCB-2e
.
You can avoid the downside mentioned above if and only if you can guarantee that no character above U+FFFF will be provided to the writer.
use XML::Writer qw( );
# XML::Writer doesn't encode for you, so we need to use :encoding.
# The :raw avoids a problem with CRLF conversion on Windows.
open(my $fh, '>:raw:encoding(UCS-2be)', $qfn)
or die("Can't create \"$qfn\": $!\n");
# This prints the BOM. It's optional, but it's useful when using an
# encoding that's not a superset of US-ASCII (such as UCS-2be).
print($fh "\x{FEFF}");
my $writer = XML::Writer->new(
OUTPUT => $fh,
ENCODING => 'UTF-8', # Don't use entities.
);
$writer->xmlDecl('UCS-2be');
$writer->startTag('root');
$writer->characters("\x{00041}");
$writer->characters("\x{000C9}");
#$writer->characters("\x{10000}"); # This causes a fatal error
$writer->endTag();
$writer->end();
- U+00041 will be present as "
A
" (00 41
). Good. - U+000C9 will be present as "
É
" (00 C9
). Good. - U+10000 causes a fatal error.
And here's how you can do it without any of the downsides:
use Encode qw( decode encode );
use XML::Writer qw( );
my $xml;
{
# XML::Writer doesn't encode for you, so we need to use :encoding.
open(my $fh, '>:encoding(UTF-8)', \$xml);
# This prints the BOM. It's optional, but it's useful when using an
# encoding that's not a superset of US-ASCII (such as UCS-2be).
print($fh "\x{FEFF}");
my $writer = XML::Writer->new(
OUTPUT => $fh,
ENCODING => 'UTF-8', # Don't use entities.
);
$writer->xmlDecl('UCS-2be');
$writer->startTag('root');
$writer->characters("\x{00041}");
$writer->characters("\x{000C9}");
$writer->characters("\x{10000}");
$writer->endTag();
$writer->end();
close($fh);
}
# Fix encoding.
$xml = decode('UTF-8', $xml);
$xml =~ s/([^\x{0000}-\x{FFFF}])/ sprintf('&#x%X;', ord($1)) /eg;
$xml = encode('UCS-2be', $xml);
open(my $fh, '>:raw', $qfn)
or die("Can't create \"$qfn\": $!\n");
print($fh $xml);
- U+00041 will be present as "
A
" (00 41
). Good. - U+000C9 will be present as "
É
" (00 C9
). Good. - U+10000 will be present as "
𐀀
" (00 26 00 23 00 78 00 31 00 30 00 30 00 30 00 30 00 3B
). Good, XML entities are needed to store U+10000 withUCB-2e
.