Pregunta

I want to extract Unicode characters from a String using Regular Expressions, removing ASCII, Numbers and Special Symbols from a String or a text file, is it possible using Regular Expression. For instance i want only Hindi or Chinese characters from a text taken from a news article.

¿Fue útil?

Solución

As stated above, ASCII is a subset of Unicode, so the question doesn't quite make sense as-is. If you really want to remove all codepoints below U+0080 from the string, that's easy:

re.sub(r"[\x00-\x7f]+", "", mystring)

If you want to keep only certain "whitelisted" characters, you need to specify precisely which codepoints to keep.

For example, to keep Devanagari codepoints (used for writing Hindi), you can use

re.sub(r"[^\u0900-\u097F]+", "", mystring)

or (Python 2, thanks @bobince for the heads-up!)

re.sub(ur"[^\u0900-\u097F]+", "", mystring)

You do need to make sure that you're working on a Unicode string, so don't forget to decode/encode your input string:

url = 'http://www.bhaskar.com/'
data = urllib2.urlopen(url).read().decode("utf-8-sig")
regex = re.compile(ur"[^\u0900-\u097F]+")
hindionly = regex.sub("foo", data)
print hindionly.encode("utf-8")

Otros consejos

Using the third-party regex module, you could express the pattern using unicode scripts:

import regex
print(repr(regex.sub(ur'[^\p{Devanagari}\p{Han}]', u'', u'abc123\u0900'))) 
# u'\u0900'
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top