Regular expression and unicode literals

https://stackoverflow.com/questions/8245392

07-03-2021
|

Pergunta

I'd like to remove some characters from a string (either byte string or unicode string) using a regular expression like this:

pattern = re.compile(ur'\u00AE|\u2122', re.UNICODE)

If the characters are specified as unicode literals the resulting regexp does not work properly on byte string.

q = 'Canon\xc2\xae  EOS  7D'
pattern.sub('', q)  # 'Canon\xc2  EOS  7D'

If I convert the argument of the substitution to a unicode string, however, it works as expected...

pattern.sub('', unicode(q))  # u'Canon  EOS  7D'

Can someone please explain to me why this is the case?

thanks,

Peter

Solução

Because a standard (byte) string is not a Unicode string. Python does not know what encoding it's in (or if it's even Unicode at all!), and so has no way to determine whether a particular Unicode character matches some character in it. The solution is to tell Python it's Unicode, using the unicode() function, as you have figured out.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow