바이트를 문자열로 변환 하시겠습니까?

https://stackoverflow.com/questions/606191

03-07-2019
|

문제

이 코드를 사용하여 외부 프로그램에서 표준 출력을 얻습니다.

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

communicate () 메소드는 바이트 배열을 반환합니다.

>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

그러나 출력을 일반 Python 문자열로 작업하고 싶습니다. 다음과 같이 인쇄 할 수 있습니다.

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

나는 그것이 무엇이라고 생각했다 binascii.b2a_qp () 메소드는 이루어 지지만 시도했을 때 다시 동일한 바이트 배열을 얻었습니다.

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

바이트 값을 다시 문자열로 변환하는 방법을 아는 사람이 있습니까? 내 말은, "배터리"를 수동으로 수행하는 대신 사용합니다. 그리고 나는 Python 3에서 괜찮을 것이기를 원합니다.

해결책

문자열을 생성하려면 바이트 객체를 디코딩해야합니다.

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8") 
'abcde'

다른 팁

이런 식으로 쉽다고 생각합니다.

bytes_data = [112, 52, 52]
"".join(map(chr, bytes_data))
>> p44

바이트 문자열을 디코딩하고 문자 (유니 코드) 문자열로 바꿔야합니다.

파이썬 2

encoding = 'utf-8'
b'hello'.decode(encoding)

파이썬 3

encoding = 'utf-8'
str(b'hello', encoding)

인코딩을 모르는 경우 Python 3 및 Python 2 호환 방법의 문자열에 이진 입력을 읽으려면 고대 MS-DOS를 사용하십시오. CP437 부호화:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

인코딩은 알려지지 않았기 때문에 영어 이외의 기호가 문자로 번역 될 것으로 기대합니다. cp437 (영어 숯은 대부분의 단일 바이트 인코딩과 UTF-8에서 일치하기 때문에 번역되지 않습니다).

UTF-8에 임의의 이진 입력을 디코딩하는 것은 안전하지 않습니다.

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

동일하게 적용됩니다 latin-1, Python 2의 인기 (기본값?). 누락 된 포인트를 참조하십시오. 코드 레이아웃 레이아웃 - 파이썬이 악명 높은 곳으로 질식하는 곳입니다 ordinal not in range.

20150604 업데이트: Python 3의 소문이 있습니다 surrogateescape 데이터 손실 및 충돌없이 이진 데이터로 인코딩하는 오류 전략이지만 변환 테스트가 필요합니다. [binary] -> [str] -> [binary] 성능과 신뢰성을 모두 검증합니다.

업데이트 20170116: Nearoo의 의견 덕분에 - 알려지지 않은 바이트를 탈출 할 가능성도 있습니다. backslashreplace 오류 핸들러. 그것은 Python 3에서만 작동 하므로이 해결 방법을 사용하더라도 여전히 다른 Python 버전에서 일관성이없는 출력을 얻을 수 있습니다.

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

보다 https://docs.python.org/3/howto/unicode.html#python-s-unicode-support 자세한 내용은.

업데이트 20170119: Python 2와 Python 3에 맞는 슬래시 탈출 디코드를 구현하기로 결정했습니다. cp437 해결책이지만 생산해야합니다 동일한 결과 모든 파이썬 버전에서.

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))

파이썬 3, 기본 인코딩은입니다 "utf-8", 직접 사용할 수 있습니다.

b'hello'.decode()

이는 동등합니다

b'hello'.decode(encoding="utf-8")

반면에, 파이썬 2, 기본 문자열 인코딩으로 기본값을 인코딩합니다. 따라서 사용해야합니다.

b'hello'.decode(encoding)

어디 encoding 당신이 원하는 인코딩입니다.

메모: 키워드 인수에 대한 지원은 Python 2.7에 추가되었습니다.

나는 당신이 실제로 원하는 것이 이것이라고 생각합니다.

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron의 대답은 어떤 인코딩을 사용해야하는지 알아야한다는 점을 제외하고는 정확했습니다. 그리고 Windows는 'Windows-1252'를 사용한다고 생각합니다. 콘텐츠에 특이한 (ASCII가 아닌) 문자가있는 경우에만 중요하지만 차이가 발생합니다.

그건 그렇고, 그것이 중요하다는 사실은 Python이 이진 및 텍스트 데이터에 두 가지 다른 유형을 사용하는 것으로 이동 한 이유입니다. 당신이 알 수있는 유일한 방법은 Windows 문서를 읽거나 여기에서 읽는 것입니다.

Universal_newlines를 true로 설정하십시오

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

하는 동안 @Aaron Maenpaa의 답변 그냥 작동합니다, 사용자 최근에 물었다:

더 간단한 방법이 있습니까? 'fhand.read (). decode ( "ascii")'[...] 너무 길다!

당신이 사용할 수있는:

command_stdout.decode()

decode() a 표준 인수:

codecs.decode(obj, encoding='utf-8', errors='strict')

바이트 시퀀스를 텍스트로 해석하려면 해당 문자 인코딩을 알아야합니다.

unicode_text = bytestring.decode(character_encoding)

예시:

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls 명령은 텍스트로 해석 할 수없는 출력을 생성 할 수 있습니다. UNIX의 파일 이름은 슬래시를 제외한 모든 바이트 시퀀스 일 수 있습니다. b'/' 그리고 0b'\0':

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

UTF-8 인코딩을 사용하여 바이트 수프를 디코딩하려고합니다. UnicodeDecodeError.

더 나빠질 수 있습니다. 디코딩이 조용히 실패하고 생성 될 수 있습니다 모지바키잘못 호환되지 않는 인코딩을 사용하는 경우 :

>>> '—'.encode('utf-8').decode('cp1252')
'â€”'

데이터가 손상되었지만 프로그램은 실패가 발생했다는 것을 알지 못합니다.

일반적으로, 어떤 문자 인코딩을 사용하는 것이 바이트 시퀀스 자체에 포함되어 있지 않습니다. 이 정보를 대역 밖에서 전달해야합니다. 일부 결과는 다른 결과보다 가능성이 높습니다 chardet 할 수있는 모듈이 존재합니다 추측 캐릭터 인코딩. 단일 파이썬 스크립트는 다른 장소에서 여러 문자 인코딩을 사용할 수 있습니다.

ls 출력은 Python String을 사용하여 변환 할 수 있습니다 os.fsdecode()성공하는 기능 명칭 할 수없는 파일 이름 (사용합니다sys.getfilesystemencoding() 그리고 surrogateescape UNIX의 오류 핸들러) :

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

원래 바이트를 얻으려면 사용할 수 있습니다 os.fsencode().

당신이 지나가는 경우 universal_newlines=True 그러면 매개 변수입니다 subprocess 용도locale.getpreferredencoding(False) 바이트를 디코딩하려면 예를 들어있을 수 있습니다cp1252 창에.

바이트 스트림을 날짜로 해독하려면io.TextIOWrapper()사용 될수있다: 예시.

다른 명령은 출력에 대해 다른 문자 인코딩을 사용할 수 있습니다. dir 내부 명령 (cmd) CP437을 사용할 수 있습니다. 출력을 해독하려면 인코딩을 명시 적으로 전달할 수 있습니다 (Python 3.6+).

output = subprocess.check_output('dir', shell=True, encoding='cp437')

파일 이름은 다를 수 있습니다 os.listdir() (Windows Unicode API를 사용하는) 예 : '\xb6' 대체 할 수 있습니다 '\x14'—Python의 CP437 코덱 맵 b'\x14' u+00b6 (¶) 대신 문자 u+0014를 제어합니다. 임의의 유니 코드 문자가있는 파일 이름을 지원하려면 참조하십시오 비 ASCII 유니 코드 문자를 Python 문자열에 포함시킬 수있는 PowEShell 출력 디코딩

이 질문은 실제로 묻고 있기 때문입니다 subprocess 출력, 이후 더 직접적인 접근 방식이 있습니다. Popen 수락합니다 부호화 키워드 (Python 3.6+) :

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

다른 사용자의 일반적인 대답은 풀다 바이트 텍스트 : 텍스트 :

>>> b'abcde'.decode()
'abcde'

논쟁없이, sys.getdefaultencoding() 사용하게 될 것이다. 데이터가 아닌 경우 sys.getdefaultencoding(), 그런 다음 인코딩을 명시 적으로 지정해야합니다. decode 전화:

>>> b'caf\xe9'.decode('cp1250')
'café'

시도하여 다음을 받아야한다면 decode():

AttributeError : 'str'객체는 속성이 없습니다 'decode'

캐스트에서 인코딩 유형을 직접 지정할 수도 있습니다.

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'

Windows Systems의 데이터로 작업 할 때 ( \r\n 라인 엔딩), 내 대답입니다

String = Bytes.decode("utf-8").replace("\r\n", "\n")

왜요? Multiline input.txt로 이것을 시도하십시오.

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)

모든 라인 엔딩은 두 배가됩니다 (TO \r\r\n), 여분의 빈 줄로 이어집니다. Python의 텍스트 읽기 기능은 일반적으로 줄이 만 사용하도록 선 엔딩을 정규화합니다. \n. Windows 시스템에서 이진 데이터를 수신하는 경우 Python은 그렇게 할 기회가 없습니다. 따라서,

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)

원본 파일을 복제합니다.

목록을 청소하는 기능을했습니다

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('\n', '') for x in lista]
    lista = [x.replace('\b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista

Python 3의 경우, 이것은 훨씬 안전하고 피티닉 전환에 대한 접근 byte 에게 string:

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): #check if its in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

산출:

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

바이트로 변환하려는 경우 바이트로 변환 된 문자열이 아니라 :

with open("bytesfile", "rb") as infile:
    str = base64.b85encode(imageFile.read())

with open("bytesfile", "rb") as infile:
    str2 = json.dumps(list(infile.read()))

그러나 이것은 그다지 효율적이지 않습니다. 2MB 사진을 9MB로 바꿀 것입니다.

에서 http://docs.python.org/3/library/sys.html,

표준 스트림에서 이진 데이터를 작성하거나 읽으려면 기본 이진 버퍼를 사용하십시오. 예를 들어, stdout에 바이트를 작성하려면 사용합니다. sys.stdout.buffer.write(b'abc').

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow