Question

I have a file with the following content with some characters are UTF-8 hex encoded in the string literal:

<root>
<element type=\"1\">\"Hello W\xC3\x96rld\"</element>
</root>

I want to read the file and decode the UTF-8 hex encoded characters in the file to the actual unicode characters they represent and then write to a new file. Given the above content, the new file should look like the following when you open it in a text editor with UTF-8 encoding:

<root>
<element type=\"1\">\"Hello WÖrld\"</element>
</root>

Notice the double quotes are still escaped and the UTF-8 hex encoded \xC3\x96 has now become Ö (U+00D6 LATIN CAPITAL LETTER O WITH DIAERESIS).

I have got code that is partially working, as follows:

#! /usr/bin/perl -w

use strict;
use Encode::Escape;

while (<>)
{
    # STDOUT is redirected to a new file.
    print decode 'unicode-escape', $_;
}

The problem however, all the other escape sequences such as \" are being decoded as well by decode 'unicode-escape', $_. So in the end, I get the following:

<root>
<element type="1">"Hello WÖrld"</element>
</root>

I have tried reading the file in UTF-8 encoding and/or using Unicode::Escape::unescape such as

open(my $UNICODESFILE, "<:encoding(UTF-8)", shift(@ARGV));
Unicode::Escape::unescape($line);

but neither of them decode the \xhh escape sequences.

Basically all I want is the behavior of decode 'unicode-escape', $_, but that it should only decode on \xhh escape sequences and ignore other escape sequences.

Is this possible? Is using decode 'unicode-escape', $_ appropriate for this case? Any other way? Thanks!

Was it helpful?

Solution

Find groups of \xNN characters and process them, I guess:

s{((?:\\x[0-9A-Fa-f]{2})+)}{decode 'unicode-escape', $1}ge
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top