I despair on a function that I try to write with Perl
. My function is to filter a string for specific characters. I allow some characters like A-Z, a-z, 0-9
and I want also allow some german umlauts. But every time I define them in my regular expression, the replacement fails.
My encoding is UTF-8
(server, perl, scripts).
This is my function:
sub cleanXSS{
my $string = shift;
$string =~ s/[^A-Za-z0-9öäü]//g;
return $string;
}
My script looks like this:
my $scalar = "áéíóúÁÉÍüÓÚâêÄîôßû()ÂÊÎÔÛabcäüöÄÜÖý#µzdjheäöü";
print cleanXSS($scalar)."\n";
So it should replace all characters except A-Z, a-z, 0-9
and lower case umlauts. The replacement for german umlauts in my test string works fine, but it seems that all other latin characters were only replaced partially.
The console output looks like this:
▒▒▒▒▒▒▒▒▒ü▒▒▒▒▒▒▒▒▒▒▒▒▒▒abcäüö▒▒▒▒zdjheäöü
I've tried many solution approaches like "use locale", other encodings, explicit encoding via "use Encode" and so on.
It seems that in a character like á
only 1 of the 2 bytes is replaced. If I change my replacement to this:
$string =~ s/[^A-Za-z0-9öäü]/_/g;
I get the following output:
▒_▒_▒_▒_▒_ö▒_▒_▒_ü▒_▒_▒_▒_▒_▒_▒_▒_▒___▒_▒_▒_▒_▒_abcäüö▒_▒_▒_▒____zdjheäöü
How can I achieve this ?