Pergunta

I have this regex:

/\「(?>[^\「\」\\]+|\\{2}|\\.)*\」/

(with # -*- encoding : utf-8 -*- in my file), which runs without any errors in my application. When I use the brakeman gem to check my application, it returns the following:

WARNING: invalid multibyte character: /\「(?>[^\「\」\\]+|\\{2}|\\.)*\」/ for "\\「(?>[^\\「\\」\\\\]+|\\\\{2}|\\\\.)*\\」" ""

+Errors+
+------------------------------------------------------------------------------------------------------>>
| Error                                                                                                     >>
+------------------------------------------------------------------------------------------------------->>
| /.../n has a non escaped non ASCII character in non ASCII-8BIT script: /\「(?>[^\「\」\\]+|\\{2}|\\.)*\」/>>
+------------------------------------------------------------------------------------------------------->>

1) Why is the warning displayed? (Isn't the bracket character escaped?)
2) Will anything bad happen if I ignore the warning?
3) Is there anyway to change my code so that it achieves the same objective but does not have this issue?

Foi útil?

Solução

I do not know anything about brakeman. But as your file is encoded in UTF-8, the byte stream of your regular expression is read in ASCII/ANSI with code page Windows-1252

/\「(?>[^\「\ã€\\]+|\\{2}|\\. )*\ã€/

which is with hexadecimal values

2F 5C E3 80 8C 28 3F 3E 5B 5E 5C E3 80 8C 5C E3 80 8D 5C 5C 5D 2B 7C 5C 5C 7B 32 7D 7C 5C 5C 2E 29 2A 5C E3 80 8D 2F

As you can see there are many "characters" (bytes) with a code value greater 127 decimal (hexadecimal 7F) without a backslash before if the byte stream is not first converted from UTF-8 to Unicode (usually UTF-16 Little Endian).

It is possible to write Perl regular expressions always without any character with a code value greater 127 even if the expression should find characters in full Unicode range.

In the scripts forum of text editor UltraEdit there is the topic Creating a Perl regular expression string with ANSI/Unicode characters which explains how such expression can be created and contains additionally a link to an UltraEdit script which uses mainly JavaScript code to convert a regular expression with ANSI or Unicode characters inside to an expression using their hexadecimal representations and therefore only ASCII characters.

Using this UltraEdit script within UltraEdit on your regular expression after removing the not necessary backslahes before the Unicode characters puts into clipboard the Perl regular expression string

/\x{300c}(?>[^\x{300c}\x{300d}\\]+|\\{2}|\\.)*\x{300d}/

For a Ruby script \u must be used instead of \x resulting in the expression:

/\u{300c}(?>[^\u{300c}\u{300d}\\]+|\\{2}|\\.)*\u{300d}/

And this regular expression string should find the same as your string without producing any warning by brakeman as it consists now only of ASCII characters with a code value smaller than 128 decimal.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top