Question

I have a script that gathers some text elements from the web; the content in question is machine translated and leaves some remnants in a mix of the original language and english. I would like to strip out any non latin characters but I haven't been able to find a good sub to do so. Here is an example of the string and desired output: I want to remove this: \u30e6\u30fc\u30ba\u30c9 but keep everything else. >> I want to remove this: but keep everything else.

here is my current code to demonstrate the problem

import requests
from lxml import html
from pprint import pprint
import os
import re
import logging

header = {'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36', 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language' : 'en-US,en;q=0.8', 'Cookie' : 'search_layout=grid; search.ab=test-A' }
# necesary to perform the http get request

def main():
    # get page content
    response = requests.get('http://global.rakuten.com/en/store/wanboo/item/w690-3/', headers=header)
    # return parsed body for the lxml module to process
    parsed_body = html.fromstring(response.text)
    # get the title tag
    dirtyname = unicode(parsed_body.xpath("//h1[contains(@class, 'b-ttl-main')]/text()"))
    # test that this tag returns undesired unicode output for the japanese characters
    print dirtyname
    # attempt to clean the unicode using a custom filter to remove any characters in this paticular range
    clean_name = ''.join(filter(lambda character:ord(character) < 0x3000, unicode(dirtyname)))
    # output of the filter should return no unicode characters but currently does not
    print clean_name
    # the remainder of the script is uncessary for the problem in question so I have removed it

if __name__ == '__main__':
    main()
Was it helpful?

Solution

''.join(filter(lambda character:ord(character) < 0x3000,my_unicode_string))

I think would work ...

or maybe you want to limit it byte size characters

 ''.join(filter(lambda character:ord(character) < 0xff,my_unicode_string))

basically its pretty easy to filter out any ranges you want ... (realistically its probably safe to filter out codepoint < 0x100)

for example

>>> test_text = u'\u30e62\u30fcX\u30ba\u30c9T'
>>> ''.join(filter(lambda character:ord(character) < 0x3000,test_text))
u'2XT'

with respect to your problem as linked in your question

dirtyname = parsed_body.xpath() ... #this returns a list ... not a string so we will put out own list as a stand in to demonstrate the issue


dirtyname = [u"hello\u2345world"]

you were then calling unicode on that list

dirtyname = unicode(dirtyname)

now if you were to print the repr as I suggested in my comment you would see

>>> print repr(dirtyname)
u'[u"Hello\\u2345world"]' 
>>> for item in dirtyname:
...    print item
[
u
"
H
#and so on 

notice now it is just a string ... it is not a list and there is no unicode characters in the string,since the backslash is escaped

you can easily fix this, by simply getting the element in the array rather than the whole array .... parsed_body.xpath(...)[0]

>>> dirtyname = parsed_body.xpath("//h1[contains(@class, 'b-ttl-main')]/text()")[0]
>>> #notice that we got the unicode element that is in the array
>>> print repr(dirtyname)
u"Hello\u2345world"
>>> cleanname =  ''.join(filter(lambda character:ord(character) < 0x3000, dirtyname))
>>> print repr(clean_name)
u"Helloworld" 
>>> #notice that everything is correctly filtered 
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top