Python urllib, minidom and parsing international characters

https://stackoverflow.com/questions/1407874

05-07-2019
|

Question

When I try to retrieve information from Google weather API with the following URL,

http://www.google.com/ig/api?weather=Munich,Germany&hl=de

and then try to parse it with minidom, I get error that the document is not well formed.

I use following code

sock = urllib.urlopen(url) # above mentioned url
doc = minidom.parse(sock)

I think the German characters in the response is the cause of the error.

What is the correct way of doing this ?

Solution

The encoding sent in the headers is iso-8859-1 according to python's urllib.urlopen (although firefox's live http headers seems to disagree with me in this case - reports utf-8). In the xml itself there is no encoding specified --> that's why xml.dom.minidom assumes it's utf-8.

So the following should fix this specific issue:

import urllib
from xml.dom import minidom

sock = urllib.urlopen('http://www.google.com/ig/api?weather=Munich,Germany&hl=de')
s = sock.read()
encoding = sock.headers['Content-type'].split('charset=')[1] # iso-8859-1
doc = minidom.parseString(s.decode(encoding).encode('utf-8'))

Edit: I've updated this answer after the comment of Glenn Maynard. I took the liberty of taking one line out of the answer of Lennert Regebro.

OTHER TIPS

This seems to work:

sock = urllib.urlopen(url)
# There is a nicer way for this, but I don't remember right now:
encoding = sock.headers['Content-type'].split('charset=')[1]
data = sock.read()
dom = minidom.parseString(data.decode(encoding).encode('ascii', 'xmlcharrefreplace'))

I guess minidom doesn't handle anything non-ascii. You might want to look into lxml instead, it does.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow