Extract Only Unicode Characters from a String using Regular Expressions

https://stackoverflow.com/questions/23633796

21-07-2023
|

Вопрос

I want to extract Unicode characters from a String using Regular Expressions, removing ASCII, Numbers and Special Symbols from a String or a text file, is it possible using Regular Expression. For instance i want only Hindi or Chinese characters from a text taken from a news article.

Решение

As stated above, ASCII is a subset of Unicode, so the question doesn't quite make sense as-is. If you really want to remove all codepoints below U+0080 from the string, that's easy:

re.sub(r"[\x00-\x7f]+", "", mystring)

If you want to keep only certain "whitelisted" characters, you need to specify precisely which codepoints to keep.

For example, to keep Devanagari codepoints (used for writing Hindi), you can use

re.sub(r"[^\u0900-\u097F]+", "", mystring)

or (Python 2, thanks @bobince for the heads-up!)

re.sub(ur"[^\u0900-\u097F]+", "", mystring)

You do need to make sure that you're working on a Unicode string, so don't forget to decode/encode your input string:

url = 'http://www.bhaskar.com/'
data = urllib2.urlopen(url).read().decode("utf-8-sig")
regex = re.compile(ur"[^\u0900-\u097F]+")
hindionly = regex.sub("foo", data)
print hindionly.encode("utf-8")

Другие советы

Using the third-party regex module, you could express the pattern using unicode scripts:

import regex
print(repr(regex.sub(ur'[^\p{Devanagari}\p{Han}]', u'', u'abc123\u0900'))) 
# u'\u0900'

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow