将字节转换为字符串？

https://stackoverflow.com/questions/606191

03-07-2019
|

题

我正在使用此代码从外部程序获取标准输出：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

communic（）方法返回一个字节数组：

>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

但是，我想将输出作为普通的Python字符串使用。所以我可以像这样打印出来：

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

我认为这就是 binascii.b2a_qp（）方法适用于，但是当我尝试它时，我又得到了相同的字节数组：

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

有人知道如何将字节值转换回字符串吗？我的意思是，使用“电池”而不是手动完成。而且我希望它能用于Python 3。

解决方案

您需要解码bytes对象以生成字符串：

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8") 
'abcde'

其他提示

我认为这很容易：

bytes_data = [112, 52, 52]
"".join(map(chr, bytes_data))
>> p44

您需要解码字节字符串并将其转换为字符（unicode）字符串。

在Python 2上

encoding = 'utf-8'
b'hello'.decode(encoding)

在Python 3上

encoding = 'utf-8'
str(b'hello', encoding)

如果您不知道编码，那么要以Python 3和Python 2兼容的方式将二进制输入读入字符串，请使用古老的MS-DOS cp437 编码：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

因为编码是未知的，所以希望非英文符号转换为 cp437 的字符（英文字符未翻译，因为它们在大多数单字节编码和UTF-8中匹配）。

将任意二进制输入解码为UTF-8是不安全的，因为你可能会得到这个：

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

这同样适用于 latin-1 ，这对于Python 2来说很流行（默认？）。请参阅代码页布局 - 它是臭名昭着的臭名昭着序号不在范围内。

UPDATE 20150604 ：有传言说Python 3有 surrogateescape 错误策略，用于将内容编码为二进制数据而不会丢失数据并导致崩溃，但它需要转换测试 [二进制] - ＆gt; [str] - ＆gt; [binary] 来验证性能和可靠性。

UPDATE 20170116 ：感谢Nearoo的评论 - 还有可能使用 backslashreplace 错误处理程序来减少所有未知字节的转义。这仅适用于Python 3，因此即使使用此解决方法，您仍将从不同的Python版本获得不一致的输出：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

请参阅 https://docs.python。 org / 3 / howto / unicode.html＃python -s-unicode-support 了解详情。

UPDATE 20170119 ：我决定实现适用于Python 2和Python 3的斜线转义解码。它应该比 cp437 解决方案慢，但它应该产生<每个Python版本都有相同的结果。

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))

在Python 3中，默认编码为＆quot; utf-8＆quot; ，因此您可以直接使用：

b'hello'.decode()

相当于

b'hello'.decode(encoding="utf-8")

另一方面，在Python 2中，编码默认为默认字符串编码。因此，您应该使用：

b'hello'.decode(encoding)

其中 encoding 是您想要的编码。

注意： 支持在Python 2.7中添加了关键字参数。

我认为你真正想要的是：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron的回答是正确的，除了你需要知道要使用的WHICH编码。我相信Windows使用'windows-1252'。只有在你的内容中有一些不寻常的（非ascii）字符才有意义，但它会产生影响。

顺便说一句，它很重要的事实是Python转向使用两种不同类型的二进制和文本数据的原因：它不能在它们之间神奇地转换，因为它不知道编码，除非你告诉它！您知道的唯一方法是阅读Windows文档（或在此处阅读）。

将universal_newlines设置为True，即

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

@Aaron Maenpaa的答案正常运行时，用户最近问过：

还有更简单的方法吗？ 'fhand.read（）。decode（＆quot; ASCII＆quot;）'[...]它太长了！

您可以使用：

command_stdout.decode()

decode（）有标准论点：

codecs.decode（obj，encoding ='utf-8'，errors ='strict'）

要将字节序列解释为文本，您必须知道相应的字符编码：

unicode_text = bytestring.decode(character_encoding)

示例：

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls 命令可能会产生无法解释为文本的输出。文件名在Unix上可能是除了斜杠 b'/'和零之外的任何字节序列 <代码> B '\ 0'：

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

尝试使用utf-8编码解码此类字节汤会引发 UnicodeDecodeError 。

可能会更糟。解码可能会无声地失败并产生 mojibake 如果使用错误的不兼容编码：

>>> '—'.encode('utf-8').decode('cp1252')
'â€”'

数据已损坏但您的程序仍未发现故障已经发生了。

通常，要使用的字符编码不嵌入字节序列本身。您必须在带外传达此信息。某些结果比其他结果更可能，因此 chardet 模块存在，可以猜测字符编码。单个Python脚本可能在不同的地方使用多个字符编码。

可以使用 os.fsdecode（）将

ls 输出转换为Python字符串即使对于 undecodable也能成功的功能文件名（它使用 sys.getfilesystemencoding（）和 surrogateescape 错误处理程序 UNIX）：

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

要获取原始字节，可以使用 os.fsencode（）。

如果传递 universal_newlines = True 参数，则 subprocess 使用 locale.getpreferredencoding（False）来解码字节，例如，它可以 Windows上的 cp1252 。

即时解码字节流， io.TextIOWrapper（） 可以使用：示例。

不同的命令可能会使用不同的字符编码输出，例如， dir 内部命令（ cmd ）可以使用cp437。解码它输出，你可以显式传递编码（Python 3.6 +）：

output = subprocess.check_output('dir', shell=True, encoding='cp437')

文件名可能与 os.listdir（）（使用Windows）不同 Unicode API）例如，'\ xb6'可以替换为'\ x14'＆＃8212; Python的 cp437编解码器映射 b'\ x14'来控制字符U + 0014而不是 U + 00B6（＆＃182;）。要支持具有任意Unicode字符的文件名，请参阅将可能包含非ascii unicode字符的poweshell输出解码为python字符串

由于这个问题实际上是在询问 subprocess 输出，因此您可以使用更直接的方法，因为 Popen 接受 encoding 关键字（在Python 3.6 +中）：

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

其他用户的一般答案是解码字节到文本：

>>> b'abcde'.decode()
'abcde'

没有参数， sys.getdefaultencoding（） 将被使用。如果您的数据不是 sys.getdefaultencoding（），那么您必须在 decode 调用：



>>> b'caf\xe9'.decode('cp1250')
'café'



	
		
	
	
			如果您通过尝试 decode（）：来获得以下内容


   AttributeError：'str'对象没有属性'decode'


您还可以直接在演员表中指定编码类型：

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'
	


	
		
	
	
			使用Windows系统中的数据时（ \ r \ n 行结尾），我的回答是

String = Bytes.decode("utf-8").replace("\r\n", "\n")


为什么呢？尝试使用多行Input.txt：

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)


所有行结尾都会加倍（到 \ r \ n \ r \ n ），导致额外的空行。 Python的文本读取函数通常规范化行结尾，以便字符串仅使用 \ n 。如果从Windows系统接收二进制数据，Python就没有机会这样做。因此，

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)


将复制原始文件。
	


	
		
	
	
			我做了一个清理列表的功能

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('\n', '') for x in lista]
    lista = [x.replace('\b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista
	


	
		
	
	
			对于Python 3，这是一种更安全的 Pythonic 方法，可以从 byte 转换为 string ：

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): #check if its in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')


输出：

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2
	


	
		
	
	
			def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))
	


	
		
	
	
			如果要转换任何字节，而不仅仅是转换为字节的字符串：

with open("bytesfile", "rb") as infile:
    str = base64.b85encode(imageFile.read())

with open("bytesfile", "rb") as infile:
    str2 = json.dumps(list(infile.read()))


但是，这不是很有效。它会将2 mb的图片变成9 mb。
	


	
		
	
	
			来自 http://docs.python.org/3/library/sys。 HTML ，

要从/向标准流写入或读取二进制数据，请使用基础二进制缓冲区。例如，要将字节写入stdout，请使用 sys.stdout.buffer.write（b'abc'）。



	
		
			许可以下： CC-BY-SA 和 归因
			不隶属于 StackOverflow