سؤال

I am trying to return JSON from the API service from musicbrainz, the returned data for some songs have unicode characters which I am having trouble converting them to regular symbols etc. Kindly let me know what I should be doing here.

JSON:

{
    "status": "ok",
    "results": [{
        "recordings": [{
            "duration": 402,
            "tracks": [{
                "duration": 402,
                "position": 6,
                "medium": {
                    "release": {
                        "id": "dde6ecee-8e9b-4b46-8c28-0f8d659f83ac",
                        "title": "Tecno Fes, Volume 2"
                    },
                    "position": 1,
                    "track_count": 11
                },
                "artists": [{
                    "id": "57c1e5ea-e08f-413a-bcb1-f4e4b675bead",
                    "name": "Gigi D\u2019Agostino"
                }],
                "title": "You Spin Me Round"
            }],
            "id": "2e0a7bce-9e44-4a63-a789-e8c4d2a12af9"
        }, ....

Failed Code (example):

string = '\u0420\u043e\u0441\u0441\u0438\u044f'
print string.encode('utf-8')

I am using this on a windows 7 machine and have python 2.7 and running this code on a command line terminal.. I have the output I get below:

C:\Python27>python junk.py Gigi DGÇÖAgostino Gigi D?Agostino Gigi D\u2019Agostino

I am expecting the output to be Gigi D' Agostino

هل كانت مفيدة؟

المحلول 2

You are using the cmd in Windows? In that case it might be a bit of a hack to get Unicode working at all to display correctly. You might want to think about using another "terminal" to test your scripts. MSYS provides a nice terminal/shell and IDLE is included in the Windows Python distribution and has a Python Shell (right click, open in IDLE, F5).

If you really want to make it work in the cmd:

You have to set Lucida Console as font in cmd. Then:

> chcp
Active code page: 850
> chcp 65001

Then you should have unicode output in the cmd. Your "Active code page" might be different. Note that somewhere, because you might want to change it back afterwards:

> chcp 850

Otherwise you will run into other problems (starting .bat files doesn't work). (See also batch-file-encoding)

In your script you also need this:

import codecs

def cp65001(name):
    """This might be buggy, but better than just a LookupError
    """
    if name.lower() == "cp65001":
        return codecs.lookup("utf-8")

codecs.register(cp65001)

Otherwise python will crash. (see windows-cmd-encoding-change-causes-python-crash)

I had a similar bug report for my script.


You might also consider using a library to access the MusicBrainz Web Service. Python-musicbrainzngs works with the current ws/2.

نصائح أخرى

Unicode escape only works with unicode strings, to convert your regular string to unicode use str.decode('unicode-escape'):

In [1]: s='\u0420\u043e\u0441\u0441\u0438\u044f'

In [2]: s
Out[2]: '\\u0420\\u043e\\u0441\\u0441\\u0438\\u044f'

In [3]: s.decode('unicode-escape')
Out[3]: u'\u0420\u043e\u0441\u0441\u0438\u044f'

In [4]: print s.decode('unicode-escape')
Россия

In [5]: s2="Gigi D\u2019Agostino"

In [6]: s2
Out[6]: 'Gigi D\\u2019Agostino'

In [7]: print s2.decode('unicode-escape')
Gigi D’Agostino

You should use json parser that returns Unicode string as any valid json parser does. Your failing example shows a bytestring i.e., you haven't used a json parser.

For example, to parse json data:

obj = json.load(urllib2.urlopen(request))

To pretty print obj without using Unicode escapes:

print json.dumps(obj, indent=4, ensure_ascii=False)

It is also useful to understand the difference between:

print unicode_string

And:

print repr(unicode_string)
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top