Question

I want to extract Unicode characters from a String using Regular Expressions, removing ASCII, Numbers and Special Symbols from a String or a text file, is it possible using Regular Expression. For instance i want only Hindi or Chinese characters from a text taken from a news article.

Was it helpful?

Solution

As stated above, ASCII is a subset of Unicode, so the question doesn't quite make sense as-is. If you really want to remove all codepoints below U+0080 from the string, that's easy:

re.sub(r"[\x00-\x7f]+", "", mystring)

If you want to keep only certain "whitelisted" characters, you need to specify precisely which codepoints to keep.

For example, to keep Devanagari codepoints (used for writing Hindi), you can use

re.sub(r"[^\u0900-\u097F]+", "", mystring)

or (Python 2, thanks @bobince for the heads-up!)

re.sub(ur"[^\u0900-\u097F]+", "", mystring)

You do need to make sure that you're working on a Unicode string, so don't forget to decode/encode your input string:

url = 'http://www.bhaskar.com/'
data = urllib2.urlopen(url).read().decode("utf-8-sig")
regex = re.compile(ur"[^\u0900-\u097F]+")
hindionly = regex.sub("foo", data)
print hindionly.encode("utf-8")

OTHER TIPS

Using the third-party regex module, you could express the pattern using unicode scripts:

import regex
print(repr(regex.sub(ur'[^\p{Devanagari}\p{Han}]', u'', u'abc123\u0900'))) 
# u'\u0900'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top