A single user-perceived character may consist of multiple Unicode codepoints. Such characters can break u'[abc]'
-like regex that sees only codepoints in Python. To workaround it, you could use u'(?:a|b|c)'
regex instead. In addition, don't mix bytes and Unicode strings i.e., old_string
should be also Unicode.
Applying the last rule fixes your example.
You could write your regex using lookahead/lookbehind assertions:
# -*- coding: utf-8 -*-
import re
from functools import partial
old_string = u"""
ò:ha becomes ò:a
ä:hà becomes ä:à
aha becomes aa
üha becomes üa
ëhë becomes ëë"""
# (?<=a|b|c)(:?)h(?=a|b|c)
chars = u"a e i o u ä ë ö ü á é í ó ú à è ì ò".split()
pattern = u"(?<=%(vowels)s)(:?)h(?=%(vowels)s)" % dict(vowels=u"|".join(chars))
remove_h = partial(re.compile(pattern).sub, ur'\1')
# remove 'h' followed and preceded by vowels
print(remove_h(old_string))
Output
ò:a becomes ò:a
ä:à becomes ä:à
aa becomes aa
üa becomes üa
ëë becomes ëë
For completeness, you could also normalize all Unicode strings in the program using unicodedata.normalize()
function (see the example in the docs, to understand why you might need it).