Domanda

I am trying to scrape cell phones from a website. The format of the cell phone is like this

+971553453301‪ 

Here is the piece of code for the task

try:
    phone=soup.find("div", "phone-content")
    for a in phone:
        phone_result= str(a).get_text().strip().encode("utf-8")
    print "Phone information:", phone_result
except StandardError as e:
    phone_result="Error was {0}".format(e)
    print phone_result

The error I am getting is:

'ascii' codec can't encode character u'\u202a' in position 54: ordinal not in range(128)

Any help?

È stato utile?

Soluzione

There are several things awkward with this line of code:

phone_result= str(a).get_text().strip().encode("utf-8")

First of all, BeautifulSoup works with unicode so in Python2 casting its text to str is error prone. There's where I think is the mistake because even if the cast work, you're calling get_text() to a str object which will raise NameError.

To end, you call encode to the str which in Python 2 is already encoded and it can potentially fail because Python 2 will decode it first (with a default encoding) and then encode it again.

So try with this fix assuming the web page is encoded in utf8:

phone_result= a.get_text().strip().encode("utf-8")

There is also a problem with this line:

phone=soup.find("div", "phone-content")

find will just return a single result, a Tag object, you should better use find_all which will return a list of Tag objects. The difference is that when you iterate through the result of a single Tag object you'll get NavigableStrings which doesn't have get_text method. When you iterate through a list of Tag objects you get Tag objects in your iteration which have the get_text method.

Hope this helps!

Altri suggerimenti

Try replacing str(a) with unicode(a) and skip the .encode()

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top