remove unicode formated japanese strings from python output

Question

''.join(filter(lambda character:ord(character) < 0x3000,my_unicode_string))

I think would work ...

or maybe you want to limit it byte size characters

 ''.join(filter(lambda character:ord(character) < 0xff,my_unicode_string))

basically its pretty easy to filter out any ranges you want ... (realistically its probably safe to filter out codepoint < 0x100)

for example

>>> test_text = u'\u30e62\u30fcX\u30ba\u30c9T'
>>> ''.join(filter(lambda character:ord(character) < 0x3000,test_text))
u'2XT'

with respect to your problem as linked in your question

dirtyname = parsed_body.xpath() ... #this returns a list ... not a string so we will put out own list as a stand in to demonstrate the issue


dirtyname = [u"hello\u2345world"]

you were then calling unicode on that list

dirtyname = unicode(dirtyname)

now if you were to print the repr as I suggested in my comment you would see

>>> print repr(dirtyname)
u'[u"Hello\\u2345world"]' 
>>> for item in dirtyname:
...    print item
[
u
"
H
#and so on

notice now it is just a string ... it is not a list and there is no unicode characters in the string,since the backslash is escaped

you can easily fix this, by simply getting the element in the array rather than the whole array .... parsed_body.xpath(...)[0]

>>> dirtyname = parsed_body.xpath("//h1[contains(@class, 'b-ttl-main')]/text()")[0]
>>> #notice that we got the unicode element that is in the array
>>> print repr(dirtyname)
u"Hello\u2345world"
>>> cleanname =  ''.join(filter(lambda character:ord(character) < 0x3000, dirtyname))
>>> print repr(clean_name)
u"Helloworld" 
>>> #notice that everything is correctly filtered