을 얻는 방법을 문자열체를 대신에서는 유니코드 JSON?

https://stackoverflow.com/questions/956867

12-09-2019
|

문제

내가 사용하는 Python2 분석 JSON 서 ASCII 인코딩 텍스트 파일입니다.

적재할 때 이러한 파일 중 하나와 json 나 simplejson, 내 모든 문자열 값을 캐스팅하는 유니코드 개체를 대신의 문자열이다.문제는 내가 사용하여 데이터와 일부 라이브러리는 것만 받아들이 문자열을 개체입니다.나 을 변경할 수 없습니다 라이브러리 도 업데이트합니다.

은 그것을 얻기 위해 가능한 문자열체를 대신에 유니코드를 것입니까?

예

>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b']  # I want these to be of type `str`, not `unicode`

업데이트

이 질문을 묻 오래 전에, 을 때,내가 붙어 Python2.중 하나 간단하고 깨끗한 솔루션을 위해 오늘날 사용하는 최신 버전의 Python—즉 Python3 니다.

해결책

해결책 `object_hook`

import json

def json_load_byteified(file_handle):
    return _byteify(
        json.load(file_handle, object_hook=_byteify),
        ignore_dicts=True
    )

def json_loads_byteified(json_text):
    return _byteify(
        json.loads(json_text, object_hook=_byteify),
        ignore_dicts=True
    )

def _byteify(data, ignore_dicts = False):
    # if this is a unicode string, return its string representation
    if isinstance(data, unicode):
        return data.encode('utf-8')
    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item, ignore_dicts=True) for item in data ]
    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict) and not ignore_dicts:
        return {
            _byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
            for key, value in data.iteritems()
        }
    # if it's anything else, return it in its original form
    return data

예제 사용 :

>>> json_loads_byteified ( '{ "hello": "world"}')
{'Hello': 'World'}
>>> json_loads_byteified ( ' "나는 최상위 스트링입니다"')
'I am a top-level string'
>>> json_loads_byteified ( '7')
7
>>> json_loads_byteified ( '[ "나는 목록 안에 있습니다"]')
['I am inside a list']
>>> json_loads_byteified ( '[[[[[[[ "나는"]]]]]]]])
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified ( '{ "foo": "bar", "things": [7, { "qux": "baz", "moo": { "cow": [ "milk"}}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified (open ( 'somefile.json'))
{'more json': 'from a file'}

이 작업은 어떻게 작동하며 왜 사용합니까?

Mark Amery의 기능 이 것보다 짧고 선명합니다. 그래서 그들의 요점은 무엇입니까? 왜 그것들을 사용하고 싶습니까?

순전히 성능. Mark의 답변은 JSON 텍스트를 유니 코드 문자열로 먼저 디코딩 한 다음 전체 디코딩 된 값을 통해 모든 문자열을 바이트 스트링으로 변환합니다. 이것은 바람직하지 않은 몇 가지 효과가 있습니다.

디코딩 된 전체 구조의 사본이 메모리에서 생성됩니다.
JSON 객체 인 경우 진짜 깊게 중첩 (500 레벨 이상)하면 Python의 최대 재귀 깊이에 도달합니다.

이 답변은 object_hook 매개 변수 json.load 그리고 json.loads. 에서 문서:

object_hook 객체 문자 그대로 디코딩 된 결과로 호출되는 선택적 함수입니다 (A dict). Object_hook의 반환 값은 대신 사용됩니다. dict. 이 기능은 사용자 정의 디코더를 구현하는 데 사용할 수 있습니다

사전은 다른 사전에서 깊은 곳에서 많은 수준을 중첩했기 때문에 object_hook 그들이 해독 될 때, 우리는 그 시점에서 그 내부의 문자열이나 목록을 바이트 화하고 나중에 깊은 재귀가 필요하지 않도록 할 수 있습니다.

Mark의 답변은 An으로 사용하기에 적합하지 않습니다 object_hook 그것은 중첩 된 사전으로 되풀이되기 때문에. 우리는이 답변에서 그 재귀를 방해합니다 ignore_dicts 매개 변수 _byteify, 그것은 항상 그것에 전달됩니다 제외하고 언제 object_hook 새로운 것을 통과시킵니다 dict 바이트 화하기 위해. 그만큼 ignore_dicts 깃발이 말한다 _byteify 무시합니다 dict그들은 이미 바이트 화 되었기 때문에.

마지막으로, 우리의 구현 json_load_byteified 그리고 json_loads_byteified 전화 _byteify (와 함께 ignore_dicts=True) 결과에서 돌아 왔습니다 json.load 또는 json.loads 디코딩되는 JSON 텍스트가 dict 최상위에.

다른 팁

여기에는 좋은 답변이 있지만 결국 사용하게되었습니다. pyyaml 내 JSON 파일을 구문 분석하려면 키와 값을 다음과 같이 str 대신 문자열을 입력하십시오 unicode 유형. JSON은 YAML의 하위 집합이기 때문에 잘 작동합니다.

>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']

메모

그래도 주목해야 할 사항 :

나는 얻다 문자열 객체 내 모든 항목이 있기 때문입니다 ASCII가 인코딩되었습니다. 유니 코드 인코딩 항목을 사용하려면 유니 코드 객체 - 전환이 없습니다!
당신은 (아마도 항상) pyyaml을 사용해야합니다 safe_load 기능; JSON 파일을로드하는 데 사용하는 경우 load 어쨌든 기능.
1.2 버전의 사양을 더 지원하는 Yaml 파서를 원한다면 (및 매우 낮은 숫자를 정확하게 구문 분석합니다) 노력하다 Ruamel Yaml: pip install ruamel.yaml 그리고 import ruamel.yaml as yaml 테스트에서 필요한 전부였습니다.

변환

언급 한 바와 같이, 전환은 없습니다! ASCII 값 만 다루지 못할 수 없다면 (대부분의 시간을 확신 할 수 없음) 변환 기능:

나는 하나를 사용했습니다 마크 아머 몇 번이나 잘 작동하며 사용하기가 매우 쉽습니다. 당신은 또한 비슷한 함수를 사용할 수 있습니다 object_hook 대신, 큰 파일에서 성능 향상을 입을 수 있습니다. 약간 더 관여하는 것을 참조하십시오 Mirec Miskuf의 답변 그에 대한.

없 내장 옵션을 json 모듈의 기능을 반환바이트 문자열을 대신 유니코드 문자열입니다.그러나,이 짧고 간단한 재귀적 함수에 변환 모든 디코딩 JSON 에서 사용하는 유니코드 문자열을 UTF-8 인코딩된 바이트는 문자열:

def byteify(input):
    if isinstance(input, dict):
        return {byteify(key): byteify(value)
                for key, value in input.iteritems()}
    elif isinstance(input, list):
        return [byteify(element) for element in input]
    elif isinstance(input, unicode):
        return input.encode('utf-8')
    else:
        return input

전화이에 출력에서 얻을 json.load 나 json.loads 전화입니다.

의 몇 가지 사항:

을 지원하는 Python2.6 또는 이전,대체 return {byteify(key): byteify(value) for key, value in input.iteritems()} 가 return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]), 이후,사전에 함축하지 않는 지원되지는 파이썬 2.7.
이 대답 recurses 을 통해 전체 객체를 디코딩,그것의 몇 가지 바람직하지 않은 성능 특성될 수 있는 매우 조심의 사용 object_hook 나 object_pairs_hook 매개 변수입니다. Mirec Miskuf 의 응답 지금까지 하나만을 관리하는이 올바르게 하지만 결과적으로,그것보다 훨씬 더 복잡하다 나 접근 방식이다.

당신은 사용할 수 있습니다 object_hook 매개 변수 json.loads 컨버터를 통과합니다. 사실 이후 전환을 할 필요는 없습니다. 그만큼 json 모듈은 항상 전달됩니다 object_hook 딕트 만, 그것은 중첩 된 icts를 재귀 적으로 통과하므로, 당신은 중첩 된 icts로 되돌릴 필요가 없습니다. 나는 유니 코드 문자열을 Wells 쇼와 같은 숫자로 변환 할 것이라고 생각하지 않습니다. 유니 코드 문자열 인 경우 JSON 파일의 문자열로 인용되었으므로 문자열이어야합니다 (또는 파일이 나쁘다).

또한, 나는 같은 일을 피하려고 노력합니다 str(val) a unicode 물체. 당신은 사용해야합니다 value.encode(encoding) 유효한 인코딩으로 외부 LIB가 기대하는 것에 따라.

예를 들면 다음과 같습니다.

def _decode_list(data):
    rv = []
    for item in data:
        if isinstance(item, unicode):
            item = item.encode('utf-8')
        elif isinstance(item, list):
            item = _decode_list(item)
        elif isinstance(item, dict):
            item = _decode_dict(item)
        rv.append(item)
    return rv

def _decode_dict(data):
    rv = {}
    for key, value in data.iteritems():
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        elif isinstance(value, list):
            value = _decode_list(value)
        elif isinstance(value, dict):
            value = _decode_dict(value)
        rv[key] = value
    return rv

obj = json.loads(s, object_hook=_decode_dict)

기 때문이다 json 가 사이에 차이 문자열체 및 유니코드 개체입니다.그들은 모두의 문자열에서 javascript.

나는 생각한 JSON 은 오른쪽을 반환하는 유니코드체.사실에서,나를 받아들이지 않을 것이 아무것도,자바스크립트는 문자열 사 unicode 체 (i.eJSON(javascript)문자열을 저장할 수 있다 어떤 종류 의 유니코드 문자)그래서 그것을 만들기 unicode 객체를 변환할 때 문자열에서 JSON.일반 문자열을 그냥 맞지 않을 것 때문에 라이브러리가 추측하고 인코딩을 원합니다.

It's better to use unicode 문자열에 개방.그래서 당신의 최선의 선택입 업데이트 라이브러리는 그래서 그들은 다룰 수 있는 유니코드 개체입니다.

그러나 당신이 정말로 원하는 경우 bytestrings,다만 인코딩하는 결과를 인코딩:당신의 선택

>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

쉬운 작업이 존재합니다.

tl; dr- 사용 ast.literal_eval() 대신에 json.loads(). 둘 다 ast 그리고 json 표준 라이브러리에 있습니다.

'완벽한'대답은 아니지만 계획이 유니 코드를 완전히 무시하려는 경우 꽤 멀어집니다. 파이썬 2.7

import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))

제공 :

JSON Fail:  {u'field': u'value'}
AST Win: {'field': 'value'}

일부 객체가 실제로 유니 코드 문자열 일 때 이것은 더 털이 있습니다. 전체 대답은 빠르게 털이 있습니다.

마이크학에의 응답 이 가까이 있지만 아무 이유도 없을 다시 통과 전체 구조입니다.당신이 사용하는 경우 object_hook_pairs (Python2.7+)매개변수:

object_pairs_hook 는 선택적인 함수 호출됩니다 결과가 어떤 객체 리터럴으로 디코딩 순서 있는 목록의 쌍이다.의 반환 값 object_pairs_hook 대신 사용됩의 dict.이 기능을 구현하는 데 사용할 수 있는 사용자 정의 디코더에 의존하는 순서는 키와 값의 쌍을 디코딩(예를 들어, collections.OrderedDict 기억의 순서를 삽입).는 경우 object_hook 또한,정의 object_pairs_hook 이 우선적으로 적용됩니다.

그것과 함께,당신이 얻을 각 JSON 손으로 당신에게,그래서 당신이 할 수 있는 디코딩을 위한 필요 없이 재귀:

def deunicodify_hook(pairs):
    new_pairs = []
    for key, value in pairs:
        if isinstance(value, unicode):
            value = value.encode('utf-8')
        if isinstance(key, unicode):
            key = key.encode('utf-8')
        new_pairs.append((key, value))
    return dict(new_pairs)

In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'                                        

In [53]: json.load(open('test.json'))
Out[53]: 
{u'1': u'hello',
 u'abc': [1, 2, 3],
 u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
 u'def': {u'hi': u'mom'}}

In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

나가지 못에 전화를 걸을 재귀적으로 이 모든 객체를 얻을 것이 손으로 후크를 사용할 때 object_pairs_hook.당신을 걱정하는 목록은,하지만 당신이 볼 수 있듯이,개체 내의 목록이 제대로 전환,그리고 당신은 없을 재귀와드릴 준비가 되었습니다.

편집:동료 지적 Python2.6 없 object_hook_pairs.여전히 사용할 수 있습이 Python2.6 함으로써 아주 작은 변화이다.에 걸이,위의 변경:

for key, value in pairs:

하기

for key, value in pairs.iteritems():

다음 사용 object_hook 대 object_pairs_hook:

In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]: 
{'1': 'hello',
 'abc': [1, 2, 3],
 'boo': [1, 'hi', 'moo', {'5': 'some'}],
 'def': {'hi': 'mom'}}

용 object_pairs_hook 결과를 하나 더 적은 사전에 인스턴스화에 대한 각 개체에 JSON 는 경우,당신이 구문 분석하는 거대한 문서를 가치가있을 수도 있습니다.

SimpleJSON 라이브러리 내에서 자동으로이를 달성 할 수있는 방법이 없습니다.

SimpleJSON의 스캐너와 디코더는 유니 코드 텍스트를 생성하도록 설계되었습니다. 이를 위해 라이브러리는 c_scanstring (사용 가능한 경우, 속도를 위해) 또는 py_scanstring C 버전을 사용할 수없는 경우 그만큼 scanstring SimpleJSON이 텍스트를 포함 할 수있는 구조를 해독하기 위해 거의 모든 루틴에 의해 기능이 여러 번 불립니다. 당신은 monkeypatch를해야합니다 scanstring simplejson.decoder 또는 서브 클래스의 값 JSONDecoder 텍스트를 포함 할 수있는 모든 것을 거의 자신의 전체 구현을 제공합니다.

그러나 Simplejson이 유니 코드를 출력하는 이유는 JSON 사양 구체적으로 "문자열은 0 이상의 유니 코드 문자 모음"이라고 언급합니다. 유니 코드에 대한 지원은 형식 자체의 일부로 가정됩니다. Simplejson 's scanstring 구현은 유니 코드 이스케이프를 스캔하고 해석하는 것까지 진행되므로 (기형 멀티 바이트 숯 표현에 대한 오류 확인조차도) 값을 안정적으로 반환 할 수있는 유일한 방법은 유니 코드입니다.

필요한 노화 된 라이브러리가있는 경우 str, 나는 구문 분석 후 중첩 데이터 구조를 힘들게 검색하는 것이 좋습니다 (나는 당신이 명시 적으로 피하고 싶다고 말한 것입니다 ... 죄송합니다), 아마도 입력 매개 변수를 마사지 할 수있는 어떤 종류의 외관으로 도서관을 감싸고 있습니다. 더 세분화 된 수준. 두 번째 접근 방식은 데이터 구조가 실제로 깊게 중첩 된 경우 첫 번째 접근 방식보다 더 관리하기 쉽습니다.

으로 표시(Amery)이 올바르게 사항:용 PyYaml's deserializer 에 json 덤프 작동이 있는 경우에만 ASCII 니다.적어도 있습니다.

두 개의 빠른에 대한 의견 PyYaml 접근 방식:

지 용어집니다..드에서 데이터 수 있습니다.그것의 기능에(!) 의 yaml 임의의 코드를 실행하는 내에서 숨겨진 구조입니다.

당신 수 그 작업은 또한 비 ASCII 을 통해 이:

def to_utf8(loader, node):
    return loader.construct_scalar(node).encode('utf-8')
yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)

그러나 성능이 현명한 그의 비교를 마크 Amery 의 대답:

던지는 몇 가지 중첩된 샘플 dicts 에는 두 가지 방법,나이(dt[j]=시간 델타의 json.드(json.덤프(m))):

     dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
     dt[byteify recursion(Mark Amery)] =~   5 * dt[j]

그래서 직렬화 포함하여 완전히 걷는 트 고 인코딩이 잘 안 크기의 순서의 json 의 C 기반 구현합니다.이 매우 빠르고 그보다 더 강력한 yaml 로드 중첩된 구조물입니다.고 덜 보안 오류가 발생하기 쉬운 보고,yaml.부하.

=>면 감사하겠 포인터 a C 만 기반 변환기 byteify 기능 해 기본값으로 대답합니다.

이 보유하는 경우에 특히 사실 당신의 json 구조은 분야에서 포함,사용자 입력이 있습니다.기 때문에 다음 아마 당신은 걸을 필요 어쨌든 의 구조-독립적인에서 원하는 내부터 데이터 구조('유니코드 샌드위치'또는 바이트는 문자열에만).

왜?

유니코드 정규화.에 대해 인식하지 못:을 읽고 진통제 이.

그래서 사용 byteify 재귀신 일 돌:

의 bytestrings 에서 중첩된 덤프 json
사용자 입력 값이 표준,그래서 당신을 찾기 위해서 저장합니다.

에서 테스트하는 것으로 밝혀졌 교체를 입력합니다.코딩('utf-8')이 unicodedata.정상화('NFC',입력).코딩('utf-8')이보다 더 빨리 w/o NFC 지만 그에 크게 의존하는 샘플 데이터를 것 같아요.

Gotcha는 그게 다 simplejson 그리고 json 최소한 유니 코드를 다루는 방식으로 두 가지 다른 모듈입니다. 당신은 가지고 있습니다 json Py 2.6+에서는 유니 코드 값을 제공하는 반면 simplejson 문자열 객체를 반환합니다. 환경에서 Easy_Install-ing SimpleJson을 사용해 보시고 그것이 작동하는지 확인하십시오. 그것은 나를 위해했다.

덤프 및로드를 위해 JSON 대신 피클을 사용하면 다음과 같습니다.

    import json
    import pickle

    d = { 'field1': 'value1', 'field2': 2, }

    json.dump(d,open("testjson.txt","w"))

    print json.load(open("testjson.txt","r"))

    pickle.dump(d,open("testpickle.txt","w"))

    print pickle.load(open("testpickle.txt","r"))

생성하는 출력은 (문자열과 정수는 올바르게 처리됩니다) :

    {u'field2': 2, u'field1': u'value1'}
    {'field2': 2, 'field1': 'value1'}

그래서 나는 같은 문제가 발생합니다. 첫 번째 Google 결과가 무엇인지 추측하십시오.

모든 데이터를 Pygtk에 전달해야하기 때문에 유니 코드 문자열도 나에게도 유용하지 않습니다. 그래서 또 다른 재귀 전환 방법이 있습니다. 실제로 TypeSafe JSON Conversion -JSON.DUMP ()는 파이썬 객체와 같은 비리석에 대해 구제 할 수 있습니다. 그래도 DICT 인덱스를 변환하지는 않습니다.

# removes any objects, turns unicode back into str
def filter_data(obj):
        if type(obj) in (int, float, str, bool):
                return obj
        elif type(obj) == unicode:
                return str(obj)
        elif type(obj) in (list, tuple, set):
                obj = list(obj)
                for i,v in enumerate(obj):
                        obj[i] = filter_data(v)
        elif type(obj) == dict:
                for i,v in obj.iteritems():
                        obj[i] = filter_data(v)
        else:
                print "invalid object in data, converting to string"
                obj = str(obj) 
        return obj

나는 끈으로 json dict를 가지고 있었다. 키와 값은 다음 예에서와 같이 유니 코드 객체입니다.

myStringDict = "{u'key':u'value'}"

나는 그것을 사용할 수있다 byteify 문자열을 a로 변환하여 위에서 제안한 함수 dict 객체 사용 ast.literal_eval(myStringDict).

후크를 사용하여 Python2 & 3을 지원합니다 (From https://stackoverflow.com/a/33571117/558397)

import requests
import six
from six import iteritems

requests.packages.urllib3.disable_warnings()  # @UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)

def _byteify(data):
    # if this is a unicode string, return its string representation
    if isinstance(data, six.string_types):
        return str(data.encode('utf-8').decode())

    # if this is a list of values, return list of byteified values
    if isinstance(data, list):
        return [ _byteify(item) for item in data ]

    # if this is a dictionary, return dictionary of byteified keys and values
    # but only if we haven't already byteified it
    if isinstance(data, dict):
        return {
            _byteify(key): _byteify(value) for key, value in iteritems(data)
        }
    # if it's anything else, return it in its original form
    return data

w = r.json(object_hook=_byteify)
print(w)

보고:

 {'three': '', 'key': 'value', 'one': 'two'}

이것은 게임에 늦었지만이 재귀 적 캐스터를 만들었습니다. 그것은 내 필요에 맞게 작동하며 비교적 완전하다고 생각합니다. 도움이 될 수 있습니다.

def _parseJSON(self, obj):
    newobj = {}

    for key, value in obj.iteritems():
        key = str(key)

        if isinstance(value, dict):
            newobj[key] = self._parseJSON(value)
        elif isinstance(value, list):
            if key not in newobj:
                newobj[key] = []
                for i in value:
                    newobj[key].append(self._parseJSON(i))
        elif isinstance(value, unicode):
            val = str(value)
            if val.isdigit():
                val = int(val)
            else:
                try:
                    val = float(val)
                except ValueError:
                    val = str(val)
            newobj[key] = val

    return newobj

JSON 객체와 같은 JSON 객체를 전달하십시오.

obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)

나는 그것을 수업의 개인 회원으로 가지고 있지만, 당신은 당신이 적합한대로 방법을 용도 변경할 수 있습니다.

JSON 객체 자체가 배열 인 케이스 (내 사용 사례)를 처리하기 위해 Wells의 _parse_json ()을 다시 작성했습니다.

def _parseJSON(self, obj):
    if isinstance(obj, dict):
        newobj = {}
        for key, value in obj.iteritems():
            key = str(key)
            newobj[key] = self._parseJSON(value)
    elif isinstance(obj, list):
        newobj = []
        for value in obj:
            newobj.append(self._parseJSON(value))
    elif isinstance(obj, unicode):
        newobj = str(obj)
    else:
        newobj = obj
    return newobj

다음은 C로 작성된 재귀 인코더입니다.https://github.com/axiros/nested_encode

JSON.LOADS에 비해 "평균"구조 약 10%의 성능 오버 헤드.

python speed.py                                                                                            
  json loads            [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
  json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
  time overhead in percent: 9%

이 시험 구조 사용 :

import json, nested_encode, time

s = """
{
  "firstName": "Jos\\u0301",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "\\u00d6sterreich",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [],
  "spouse": null,
  "a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""


t1 = time.time()
for i in xrange(10000):
    u = json.loads(s)
dt_json = time.time() - t1

t1 = time.time()
for i in xrange(10000):
    b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1

print "json loads            [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])

print "time overhead in percent: %i%%"  % (100 * (dt_json_enc - dt_json)/dt_json)

체크 아웃 이것 이와 같은 비슷한 질문에 대한 답변

U- 접두사는 단지 유니 코드 문자열이 있음을 의미합니다. 실제로 문자열을 사용하면 데이터에 나타나지 않습니다. 인쇄 된 출력에 던져지지 마십시오.

예를 들어, 이것을 시도하십시오.

print mail_accounts[0]["i"]

당신은 u를 보지 못할 것입니다.

Python 3.6을 사용하면 때때로이 문제가 발생합니다. 예를 들어, REST API에서 응답을 받고 응답 텍스트를 JSON에로드 할 때 여전히 유니 코드 문자열을 얻습니다. json.dumps ()를 사용하여 간단한 솔루션을 찾았습니다.

response_message = json.loads(json.dumps(response.text))
print(response_message)

나도이 문제를 해결하고 JSON을 다루어야했고, 유니 코드 키를 문자열로 변환하는 작은 루프를 생각해 냈습니다. (simplejson gae는 문자열 키를 반환하지 않습니다.)

obj JSON에서 해독 된 객체입니다.

if NAME_CLASS_MAP.has_key(cls):
    kwargs = {}
    for i in obj.keys():
        kwargs[str(i)] = obj[i]
    o = NAME_CLASS_MAP[cls](**kwargs)
    o.save()

kwargs 내가 GAE 애플리케이션의 생성자에게 전달하는 것입니다 (좋아하지 않는 unicode 열쇠 **kwargs)

우물의 솔루션만큼 강력하지는 않지만 훨씬 작습니다.

나는 코드를 조정했다 대답 의 마크 아머, 특히 제거하기 위해 isinstance 오리 타자의 전문가를 위해.

인코딩은 수동으로 수행됩니다 ensure_ascii 비활성화되었습니다. 파이썬 문서 json.dump 그렇게 말합니다

ende_ascii가 true (기본값) 인 경우 출력의 모든 비 ASCII 문자는 uxxxx 시퀀스로 빠져 나옵니다.

면책 조항 : doctest에서 나는 헝가리어를 사용했습니다. 주목할만한 헝가리 관련 캐릭터 인코딩은 다음과 같습니다. cp852 IBM/OEM 인코딩은 예를 들어. DOS (때로는) ASCII, 잘못 생각하면 Codepage 환경), cp1250 예를 들어. 창에서 (때로는) ANSI, 로케일 설정에 의존) 및 iso-8859-2, 때때로 HTTP 서버에서 사용됩니다. 테스트 텍스트 Tüskéshátú kígyóbűvölő 다음과 같습니다 Koltai László (기본 개인 이름 양식) 및 출신입니다 위키 백과.

# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json

def encode_items(input, encoding='utf-8'):
    u"""original from: https://stackoverflow.com/a/13101776/611007
    adapted by SO/u/611007 (20150623)
    >>> 
    >>> ## run this with `python -m doctest <this file>.py` from command line
    >>> 
    >>> txt = u"Tüskéshátú kígyóbűvölő"
    >>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
    >>> txt3 = u"uúuutifu"
    >>> txt4 = b'u\\xfauutifu'
    >>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
    >>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
    >>> txt4u = txt4.decode('cp1250')
    >>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
    >>> txt5 = b"u\\xc3\\xbauutifu"
    >>> txt5u = txt5.decode('utf-8')
    >>> txt6 = u"u\\u251c\\u2551uutifu"
    >>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
    >>> assert txt == there_and_back_again(txt)
    >>> assert txt == there_and_back_again(txt2)
    >>> assert txt3 == there_and_back_again(txt3)
    >>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
    >>> assert txt3 == txt4u,(txt3,txt4u)
    >>> assert txt3 == there_and_back_again(txt5)
    >>> assert txt3 == there_and_back_again(txt5u)
    >>> assert txt3 == there_and_back_again(txt4u)
    >>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
    >>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
    >>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
    >>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
    >>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
    >>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
    >>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
    >>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
    """
    try:
        input.iteritems
        return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
    except AttributeError:
        if isinstance(input, unicode):
            return input.encode(encoding)
        elif isinstance(input, str):
            return input
        try:
            iter(input)
            return [encode_items(e) for e in input]
        except TypeError:
            return input

def alt_dumps(obj, **kwargs):
    """
    >>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
    '{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
    """
    if 'ensure_ascii' in kwargs:
        del kwargs['ensure_ascii']
    return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)

또한 강조하고 싶습니다 대답 의 Jarret Hardie 이를 참조하십시오 JSON 사양, 인용 :

문자열은 0 이상의 유니 코드 문자 모음입니다.

내 사용 사례에는 JSON과 파일이있었습니다. 그들은 utf-8 인코딩 된 파일. ensure_ascii 결과적으로 빠져 나가지 만 읽을 수없는 JSON 파일이 적절하지 않기 때문에 Mark Amery의 답변을 내 요구에 맞게 조정했습니다.

DocTest는 특히 사려 깊지는 않지만 누군가에게 유용 할 것이라는 희망으로 코드를 공유합니다.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow

을 얻는 방법을 문자열체를 대신에서는 유니코드 JSON?

예

업데이트

해결책 object_hook

이 작업은 어떻게 작동하며 왜 사용합니까?

메모

변환

해결책 `object_hook`