Python regex with unicode characters bug?
-
25-09-2019 - |
Question
Long story short:
>>> re.compile(r"\w*").match(u"Français")
<_sre.SRE_Match object at 0x1004246b0>
>>> re.compile(r"^\w*$").match(u"Français")
>>> re.compile(r"^\w*$").match(u"Franais")
<_sre.SRE_Match object at 0x100424780>
>>>
Why doesn't it match the string with unicode characters with ^
and $
in the regex? As far as I understand ^
stands for the beginning of the string(line) and $
- for the end of it.
Solution
You need to specify the UNICODE
flag, otherwise \w
is just equivalent to [a-zA-Z0-9_]
, which does not include the character 'ç
'.
>>> re.compile(r"^\w*$", re.U).match(u"Fran\xe7ais")
<_sre.SRE_Match object at 0x101474168>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow