Python regex with unicode ranges matches characters not in range

https://stackoverflow.com/questions/21516335

06-10-2022
|

Question

I'm using a regex to strip "bullet points" from text. These bullet points are often symbols found in unicode ranges such as geometric shape (\u25a0-\u25ff) or similar. Below is an example of such bullets:

 ◉ This is a bullet
 ♦︎ This is also a bullet
 ☉ And so is this

This is not a bullet.

I'm using the following regular expression to match these bullet points:

This works in Ruby (see an example at http://rubular.com/r/O7ZObURmlt), but in Python it matches the first character of any string. For example the T character in the string This is not a bullet is matched. You can copy the above regex and example text to http://www.pythonregex.com/ to see this for yourselves.

The regex is compiled with the UNICODE flag.

How can I make Python's regex engine play nice with this expression?

Solution

Make the string that generates your expression be in unicode, so that the sequences are interpreted as unicode characters, instead of plain u, 2, 0, and so on. Try the following:

regex = re.compile(u"\s*([\u00a4\u00b7]|[\u2010-\u2017]|" + \
    "[\u2020-\u206f]|[\u2300-\u23f3]|[\u25a0-\u25ff]|" + \
    "[\u2600-\u26ff]|[\u2700-\u27bf]|[\u2b00-\u2bff])\s*", re.UNICODE)

And you're most probably not using Python 3.*, in which all strings are unicode AFAIK.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow