Question

I'm writing a perl script that creates an xml file "settings.xml". (Using XML::Writer). I'd like the file to be encoded in UCS-2 big endian, but I'm unsure of how.

I've tried things like: open(my $output, "> :encoding(UCS-2BE)", "settings.xml");, but all that does is make the file output a big mess,(e.g. either http://i.imgur.com/p9cruCf.png or a series of chinese characters) while keeping the encoding of the file as ANSI.

Any idea how to fix this, or alternatively, how to convert a file into UCS-2?

I'm a beginner at Perl, sorry if some of this doesn't make sense.

EDIT: for anyone else encountering this problem, please see the answers below, they provide a thorough explanation of how to fix it.

Was it helpful?

Solution

XML::Writer doesn't support anything but US-ASCII and UTF-8 (as mentioned in the documentation of its ENCODING constructor argument). Creating an UCS-2be XML document using XML::Writer is tricky, but not impossible.

use XML::Writer qw( );

# XML::Writer doesn't encode for you, so we need to use :encoding.
# The :raw avoids a problem with CRLF conversion on Windows.
open(my $fh, '>:raw:encoding(UCS-2be)', $qfn)
   or die("Can't create \"$qfn\": $!\n");

# This prints the BOM. It's optional, but it's useful when using an
# encoding that's not a superset of US-ASCII (such as UCS-2be).
print($fh "\x{FEFF}");

my $writer = XML::Writer->new(
   OUTPUT   => $fh,
   ENCODING => 'US-ASCII',   # Use entities for > U+007F
);
$writer->xmlDecl('UCS-2be');
$writer->startTag('root');
$writer->characters("\x{00041}");
$writer->characters("\x{000C9}");
$writer->characters("\x{10000}");
$writer->endTag();
$writer->end();

Downside: All characters above U+007F will be present as XML entities. In the above example,

  • U+00041 will be present as "A" (00 41). Good.
  • U+000C9 will be present as "É" (00 26 00 23 00 78 00 43 00 39 00 3B). Suboptimal, but ok.
  • U+10000 will be present as "𐀀" (00 26 00 23 00 78 00 31 00 30 00 30 00 30 00 30 00 3B). Good, XML entities are needed to store U+10000 with UCB-2e.

You can avoid the downside mentioned above if and only if you can guarantee that no character above U+FFFF will be provided to the writer.

use XML::Writer qw( );

# XML::Writer doesn't encode for you, so we need to use :encoding.
# The :raw avoids a problem with CRLF conversion on Windows.
open(my $fh, '>:raw:encoding(UCS-2be)', $qfn)
   or die("Can't create \"$qfn\": $!\n");

# This prints the BOM. It's optional, but it's useful when using an
# encoding that's not a superset of US-ASCII (such as UCS-2be).
print($fh "\x{FEFF}");

my $writer = XML::Writer->new(
   OUTPUT   => $fh,
   ENCODING => 'UTF-8',   # Don't use entities.
);
$writer->xmlDecl('UCS-2be');
$writer->startTag('root');
$writer->characters("\x{00041}");
$writer->characters("\x{000C9}");
#$writer->characters("\x{10000}");  # This causes a fatal error
$writer->endTag();
$writer->end();
  • U+00041 will be present as "A" (00 41). Good.
  • U+000C9 will be present as "É" (00 C9). Good.
  • U+10000 causes a fatal error.

And here's how you can do it without any of the downsides:

use Encode      qw( decode encode );
use XML::Writer qw( );

my $xml;
{
   # XML::Writer doesn't encode for you, so we need to use :encoding.
   open(my $fh, '>:encoding(UTF-8)', \$xml);

   # This prints the BOM. It's optional, but it's useful when using an
   # encoding that's not a superset of US-ASCII (such as UCS-2be).
   print($fh "\x{FEFF}");

   my $writer = XML::Writer->new(
      OUTPUT   => $fh,
      ENCODING => 'UTF-8',   # Don't use entities.
   );
   $writer->xmlDecl('UCS-2be');
   $writer->startTag('root');
   $writer->characters("\x{00041}");
   $writer->characters("\x{000C9}");
   $writer->characters("\x{10000}");
   $writer->endTag();
   $writer->end();
   close($fh);
}

# Fix encoding.
$xml = decode('UTF-8', $xml);
$xml =~ s/([^\x{0000}-\x{FFFF}])/ sprintf('&#x%X;', ord($1)) /eg;
$xml = encode('UCS-2be', $xml);

open(my $fh, '>:raw', $qfn)
   or die("Can't create \"$qfn\": $!\n");

print($fh $xml);
  • U+00041 will be present as "A" (00 41). Good.
  • U+000C9 will be present as "É" (00 C9). Good.
  • U+10000 will be present as "𐀀" (00 26 00 23 00 78 00 31 00 30 00 30 00 30 00 30 00 3B). Good, XML entities are needed to store U+10000 with UCB-2e.

OTHER TIPS

You don't describe what goes wrong, but you may be running into a bug some perl versions had on Windows with bad interaction between the encoding and crlf layers. If so, this should work:

open(my $output, "> :raw:perlio:encoding(UCS-2BE):crlf:utf8", "settings.xml");

(See http://www.perlmonks.org/?node_id=608532 for an explanation.)

If not, please provide more information than "all that does is make the file output a big mess". A short script demonstrating the problem would be helpful.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top