How to skip Unicode Errors in a url
-
21-12-2019 - |
Question
I am trying to figure out how to handle Unicode Errors in Python and skip them. I guess I have to use the structure try, except UnicodeError but I don't know what to place in the UnicodeError condition to skip that URL and continue scraping. Here is my code:
File "imagescraper.py", line 24, in <module>
urllib.urlretrieve(image, "image0"+str(page)+str(i)+".jpg")
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 94, in urlretrieve
return _urlopener.retrieve(url, filename, reporthook, data)
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 228, in retrieve
url = unwrap(toBytes(url))
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1055, in toBytes
" contains non-ASCII characters")
UnicodeError: URL u'http://blogging.com/wp-content/uploads/2013/11/design-p\xe1gina-de-fans.png' contains non-ASCII characters
Any ideas?
Solution
Something like this seems to be what you desire:
except UnicodeError:
pass
pass
is basically just a placeholder, it does nothing. The exception is handled by the except
block, and then pass
will allow you to move on to your next URL.
Note that if you are doing it within a loop, use the continue
keyword instead- this moves on to the next cycle of the loop within which it is immediately.
OTHER TIPS
Instead of skipping the URL, just encode it to a valid URL:
import urllib, urlparse
parts = urlparse.urlsplit(image)
parts = parts._replace(path=urllib.quote(parts.path.encode('utf8')))
image = parts.geturl()
This turns:
http://blogging.com/wp-content/uploads/2013/11/design-página-de-fans.png
into
http://blogging.com/wp-content/uploads/2013/11/design-p%C3%A1gina-de-fans.png
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow