Python 2.5 sub function from regex module not recognizing a pattern

Question 1

A single user-perceived character may consist of multiple Unicode codepoints. Such characters can break u'[abc]'-like regex that sees only codepoints in Python. To workaround it, you could use u'(?:a|b|c)' regex instead. In addition, don't mix bytes and Unicode strings i.e., old_string should be also Unicode.

Applying the last rule fixes your example.

You could write your regex using lookahead/lookbehind assertions:

# -*- coding: utf-8 -*-
import re
from functools import partial

old_string = u"""
  ò:ha becomes ò:a
  ä:hà becomes ä:à
  aha becomes aa
  üha becomes üa
  ëhë becomes ëë"""

# (?<=a|b|c)(:?)h(?=a|b|c)
chars = u"a e i o u ä ë ö ü á é í ó ú à è ì ò".split()
pattern = u"(?<=%(vowels)s)(:?)h(?=%(vowels)s)" % dict(vowels=u"|".join(chars))
remove_h = partial(re.compile(pattern).sub, ur'\1')
# remove 'h' followed and preceded by vowels
print(remove_h(old_string))

Output

  ò:a becomes ò:a
  ä:à becomes ä:à
  aa becomes aa
  üa becomes üa
  ëë becomes ëë

For completeness, you could also normalize all Unicode strings in the program using unicodedata.normalize() function (see the example in the docs, to understand why you might need it).

Question 2

It is encoding issue. Different combinations of file encoding and old_string being non-unicode behave differently for different pythons.

For example your code works fine for python for 2.6 to 2.7 this way (all data below is cp1252 encoded):

# -*- coding: cp1252 -*-
old_string = "afdëhë:dfp"

but fails with SyntaxError: Non-ASCII character '\xeb' if no encoding specified in file.

However, those lines fails for python 2.5 with

`UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)` for python 2.5

While for all pythons fails to remove h with old_string being non-unicode:

# -*- coding: utf8 -*-
old_string = "afdÃ«hÃ«:dfp"

So you have to provide correct encoding and define old_unicode being unicode string as well, for example this one will do:

# -*- coding: cp1252 -*-
old_string = u"afdëhë:dfp"