Question

When accessing this Glosbe.com via their API the following code isn't able to decode special characters or apostrophes.

As an example, it prints perché, instead of perché. When inspecting the website source it says that the charset is utf-8. Any ideas?

# -*- coding: utf-8 -*-
import urllib.request
import json

url = ' http://glosbe.com/gapi/translate?from=fra&dest=eng&format=json&phrase=chat&pretty=true'


weburl = urllib.request.urlopen(url)
data = weburl.read().decode('utf-8') 

theJSON = json.loads(data)
print(theJSON)
Was it helpful?

Solution

That site appears to give you data with HTML entities. Decode the HTML entities with:

from html.parser import HTMLParser

def unescape_entities(value, parser=HTMLParser()):
    return parser.unescape(value)

def process(ob):
    if isinstance(ob, list):
        return [process(v) for v in ob]
    elif isinstance(ob, dict):
        return {k: process(v) for k, v in ob.items()}
    elif isinstance(ob, str):
        return unescape_entities(ob)
    return ob

theJSON = process(theJSON)

Demo:

>>> theJSON['tuc'][0]['meanings'][-1]
{'language': 'fra', 'text': 'Mammifère carnivore, félin de taille moyenne au museau court et arrondi, domestiqué ou encore à l'état sauvage (Felis silvestris).'}
>>> theJSON = process(theJSON)
>>> theJSON['tuc'][0]['meanings'][-1]
{'language': 'fra', 'text': "Mammifère carnivore, félin de taille moyenne au museau court et arrondi, domestiqué ou encore à l'état sauvage (Felis silvestris)."}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top