문제

I'm trying to use Python's sub function from the regex module to recognize and change a pattern in a string. Below is my code.

old_string = "afdëhë:dfp"
newString = re.sub(ur'([aeiouäëöüáéíóúàèìò]|ù:|e:|i:|o:|u:|ä:|ë:|ö:|ü:|á:|é:|í:|ó:|ú:|à:|è:|ì:|ò:|ù:)h([aeiouäëöüáéíóúàèìòù])', ur'\1\2', old_string)

So what I'm looking to get after the code is applied is afdëë:dfp (without the h). So I'm trying to match a vowel (sometimes with accents, sometimes with a colon after it) then the h then another vowel (sometimes with accents). So a few examples...

ò:ha becomes ò:a
ä:hà becomes ä:hà
aha becomes aa
üha becomes üa
ëhë becomes ëë

So I'm trying to remove the h when it is between two vowels and also remove the h when it follows a volume with a colon after it then another vowel (ie a:ha). Any help is greatly appreciated. I've been playing around with this for a while.

도움이 되었습니까?

해결책

A single user-perceived character may consist of multiple Unicode codepoints. Such characters can break u'[abc]'-like regex that sees only codepoints in Python. To workaround it, you could use u'(?:a|b|c)' regex instead. In addition, don't mix bytes and Unicode strings i.e., old_string should be also Unicode.

Applying the last rule fixes your example.

You could write your regex using lookahead/lookbehind assertions:

# -*- coding: utf-8 -*-
import re
from functools import partial

old_string = u"""
  ò:ha becomes ò:a
  ä:hà becomes ä:à
  aha becomes aa
  üha becomes üa
  ëhë becomes ëë"""

# (?<=a|b|c)(:?)h(?=a|b|c)
chars = u"a e i o u ä ë ö ü á é í ó ú à è ì ò".split()
pattern = u"(?<=%(vowels)s)(:?)h(?=%(vowels)s)" % dict(vowels=u"|".join(chars))
remove_h = partial(re.compile(pattern).sub, ur'\1')
# remove 'h' followed and preceded by vowels
print(remove_h(old_string))

Output

  ò:a becomes ò:a
  ä:à becomes ä:à
  aa becomes aa
  üa becomes üa
  ëë becomes ëë

For completeness, you could also normalize all Unicode strings in the program using unicodedata.normalize() function (see the example in the docs, to understand why you might need it).

다른 팁

It is encoding issue. Different combinations of file encoding and old_string being non-unicode behave differently for different pythons.

For example your code works fine for python for 2.6 to 2.7 this way (all data below is cp1252 encoded):

# -*- coding: cp1252 -*-
old_string = "afdëhë:dfp"

but fails with SyntaxError: Non-ASCII character '\xeb' if no encoding specified in file.

However, those lines fails for python 2.5 with

`UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)` for python 2.5

While for all pythons fails to remove h with old_string being non-unicode:

# -*- coding: utf8 -*-
old_string = "afdëhë:dfp"

So you have to provide correct encoding and define old_unicode being unicode string as well, for example this one will do:

# -*- coding: cp1252 -*-
old_string = u"afdëhë:dfp"
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top