''.join(filter(lambda character:ord(character) < 0x3000,my_unicode_string))
I think would work ...
or maybe you want to limit it byte size characters
''.join(filter(lambda character:ord(character) < 0xff,my_unicode_string))
basically its pretty easy to filter out any ranges you want ... (realistically its probably safe to filter out codepoint < 0x100
)
for example
>>> test_text = u'\u30e62\u30fcX\u30ba\u30c9T'
>>> ''.join(filter(lambda character:ord(character) < 0x3000,test_text))
u'2XT'
with respect to your problem as linked in your question
dirtyname = parsed_body.xpath() ... #this returns a list ... not a string so we will put out own list as a stand in to demonstrate the issue
dirtyname = [u"hello\u2345world"]
you were then calling unicode on that list
dirtyname = unicode(dirtyname)
now if you were to print the repr as I suggested in my comment you would see
>>> print repr(dirtyname)
u'[u"Hello\\u2345world"]'
>>> for item in dirtyname:
... print item
[
u
"
H
#and so on
notice now it is just a string ... it is not a list and there is no unicode characters in the string,since the backslash is escaped
you can easily fix this, by simply getting the element in the array rather than the whole array .... parsed_body.xpath(...)[0]
>>> dirtyname = parsed_body.xpath("//h1[contains(@class, 'b-ttl-main')]/text()")[0]
>>> #notice that we got the unicode element that is in the array
>>> print repr(dirtyname)
u"Hello\u2345world"
>>> cleanname = ''.join(filter(lambda character:ord(character) < 0x3000, dirtyname))
>>> print repr(clean_name)
u"Helloworld"
>>> #notice that everything is correctly filtered