Perl regex replace with UTF-8 characters

https://stackoverflow.com/questions/21092427

27-09-2022
|

Domanda

I despair on a function that I try to write with Perl. My function is to filter a string for specific characters. I allow some characters like A-Z, a-z, 0-9 and I want also allow some german umlauts. But every time I define them in my regular expression, the replacement fails.

My encoding is UTF-8 (server, perl, scripts).

This is my function:

sub cleanXSS{

    my $string = shift;

    $string =~ s/[^A-Za-z0-9öäü]//g;

    return $string;
}

My script looks like this:

my $scalar = "áéíóúÁÉÍüÓÚâêÄîôßû()ÂÊÎÔÛabcäüöÄÜÖý#µzdjheäöü";
print cleanXSS($scalar)."\n";

So it should replace all characters except A-Z, a-z, 0-9 and lower case umlauts. The replacement for german umlauts in my test string works fine, but it seems that all other latin characters were only replaced partially.

The console output looks like this:

▒▒▒▒▒▒▒▒▒ü▒▒▒▒▒▒▒▒▒▒▒▒▒▒abcäüö▒▒▒▒zdjheäöü

I've tried many solution approaches like "use locale", other encodings, explicit encoding via "use Encode" and so on.

It seems that in a character like á only 1 of the 2 bytes is replaced. If I change my replacement to this:

$string =~ s/[^A-Za-z0-9öäü]/_/g;

I get the following output:

▒_▒_▒_▒_▒_ö▒_▒_▒_ü▒_▒_▒_▒_▒_▒_▒_▒_▒___▒_▒_▒_▒_▒_abcäüö▒_▒_▒_▒____zdjheäöü

How can I achieve this ?

Soluzione

It seems that in a character like "á" only 1 of the 2 bytes is replaced.

Decode inputs.

You didn't tell Perl your script is encoded using UTF-8. Add
```
use utf8;
```
Encode output.

You'll also need the following to encode the output:
```
use open ':std', ':encoding(UTF-8)';
```

Altri suggerimenti

Put this line at the begining of the script:

binmode STDOUT, ":encoding(UTF-8)";

See the doc

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow