Extract Only Unicode Characters from a String using Regular Expressions

Question 1

As stated above, ASCII is a subset of Unicode, so the question doesn't quite make sense as-is. If you really want to remove all codepoints below U+0080 from the string, that's easy:

re.sub(r"[\x00-\x7f]+", "", mystring)

If you want to keep only certain "whitelisted" characters, you need to specify precisely which codepoints to keep.

For example, to keep Devanagari codepoints (used for writing Hindi), you can use

re.sub(r"[^\u0900-\u097F]+", "", mystring)

or (Python 2, thanks @bobince for the heads-up!)

re.sub(ur"[^\u0900-\u097F]+", "", mystring)

You do need to make sure that you're working on a Unicode string, so don't forget to decode/encode your input string:

url = 'http://www.bhaskar.com/'
data = urllib2.urlopen(url).read().decode("utf-8-sig")
regex = re.compile(ur"[^\u0900-\u097F]+")
hindionly = regex.sub("foo", data)
print hindionly.encode("utf-8")

Question 2

Using the third-party regex module, you could express the pattern using unicode scripts:

import regex
print(repr(regex.sub(ur'[^\p{Devanagari}\p{Han}]', u'', u'abc123\u0900'))) 
# u'\u0900'