Question

I am trying to figure out how to handle Unicode Errors in Python and skip them. I guess I have to use the structure try, except UnicodeError but I don't know what to place in the UnicodeError condition to skip that URL and continue scraping. Here is my code:

  File "imagescraper.py", line 24, in <module>
    urllib.urlretrieve(image, "image0"+str(page)+str(i)+".jpg")
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 94, in urlretrieve
    return _urlopener.retrieve(url, filename, reporthook, data)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 228, in retrieve
    url = unwrap(toBytes(url))
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1055, in toBytes
    " contains non-ASCII characters")
UnicodeError: URL u'http://blogging.com/wp-content/uploads/2013/11/design-p\xe1gina-de-fans.png' contains non-ASCII characters

Any ideas?

Was it helpful?

Solution

Something like this seems to be what you desire:

except UnicodeError:
    pass

pass is basically just a placeholder, it does nothing. The exception is handled by the except block, and then pass will allow you to move on to your next URL.

Note that if you are doing it within a loop, use the continue keyword instead- this moves on to the next cycle of the loop within which it is immediately.

OTHER TIPS

Instead of skipping the URL, just encode it to a valid URL:

import urllib, urlparse

parts = urlparse.urlsplit(image)
parts = parts._replace(path=urllib.quote(parts.path.encode('utf8')))
image = parts.geturl()

This turns:

http://blogging.com/wp-content/uploads/2013/11/design-página-de-fans.png

into

http://blogging.com/wp-content/uploads/2013/11/design-p%C3%A1gina-de-fans.png
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top