バイトを文字列に変換しますか？

https://stackoverflow.com/questions/606191

03-07-2019
|

質問

このコードを使用して、外部プログラムから標準出力を取得しています：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

communicate（）メソッドはバイトの配列を返します：

>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

ただし、出力を通常のPython文字列として処理したいと思います。次のように印刷できるように：

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

それが binascii.b2a_qp（）メソッドは、しかし、私がそれを試してみると、同じバイト配列を再度取得しました：

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

バイト値を文字列に戻す方法を知っている人はいますか？つまり、「バッテリー」を使用して、手動で行う代わりに。そして、Python 3でも問題ないことを望んでいます。

解決

バイトオブジェクトをデコードして文字列を生成する必要があります：

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8") 
'abcde'

他のヒント

この方法は簡単だと思います：

bytes_data = [112, 52, 52]
"".join(map(chr, bytes_data))
>> p44

バイト文字列をデコードして、文字（ユニコード）文字列に変換する必要があります。

Python 2で

encoding = 'utf-8'
b'hello'.decode(encoding)

Python 3で

encoding = 'utf-8'
str(b'hello', encoding)

エンコードがわからない場合、Python 3およびPython 2互換の方法でバイナリ入力を文字列に読み込むには、古代のMS-DOS cp437 エンコード：

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

エンコードが不明であるため、英語以外の記号が cp437 の文字に変換されることを期待してください（英語の文字は、ほとんどのシングルバイトエンコーディングとUTF-8で一致するため、変換されません）。

UTF-8への任意のバイナリ入力のデコードは、これを取得する可能性があるため、安全ではありません。

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

Python 2で一般的であった（デフォルト？） latin-1 にも同じことが当てはまります。コードページレイアウト-悪名高い ordinal not in range でPythonがチョークする場所です。

UPDATE 20150604 ：Python 3には、データの損失やクラッシュを伴うことなくバイナリデータにエンコードする surrogateescape エラー戦略があるという噂がありますが、変換テスト [バイナリ]-＆gt; [str]-＆gt; [バイナリ] ：パフォーマンスと信頼性の両方を検証します。

UPDATE 20170116 ：Nearooのコメントのおかげで、 backslashreplace エラーハンドラーで不明なバイトをすべてエスケープすることもできます。これはPython 3でのみ機能するため、この回避策を使用しても、異なるPythonバージョンから一貫性のない出力が得られます。

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

https：//docs.pythonを参照してください。詳細については、org / 3 / howto / unicode.html＃python-s-unicode-support をご覧ください。

UPDATE 20170119 ：Python 2とPython 3の両方で動作するスラッシュエスケープデコードを実装することにしました。 cp437 ソリューションよりも遅いはずですが、< strong>すべてのPythonバージョンで同一の結果。

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))

Python 3 では、デフォルトのエンコードは＆quot; utf-8＆quot; なので、直接使用できます：

b'hello'.decode()

これは同等です

b'hello'.decode(encoding="utf-8")

一方、 Python 2 では、エンコードデフォルトはデフォルトの文字列エンコーディングです。したがって、次を使用する必要があります。

b'hello'.decode(encoding)

encoding は希望するエンコードです。

注：のサポートキーワード引数のPython 2.7で追加されました。

あなたが実際に欲しいのはこれだと思います：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaronの答えは正しかったですが、使用するエンコーディングを知っている必要があります。そして、私はWindowsが「windows-1252」を使用していると信じています。コンテンツに特殊な（非ASCII）文字が含まれている場合にのみ問題になりますが、それによって違いが生じます。

ところで、それが重要であるという事実は、Pythonがバイナリデータとテキストデータに2つの異なるタイプを使用するようになった理由です。！あなたが知る唯一の方法は、Windowsのドキュメントを読むことです（またはここで読んでください）。

universal_newlinesをTrueに設定します。つまり、

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

@Aaron Maenpaaの回答は機能しますが、ユーザー最近の質問：

これ以上簡単な方法はありますか？ 'fhand.read（）。decode（＆quot; ASCII＆quot;）' [...]とても長い！

次を使用できます：

command_stdout.decode()

decode（）には標準引数：

codecs.decode（obj、encoding = 'utf-8'、errors = 'strict'）

バイトシーケンスをテキストとして解釈するには、対応する文字エンコード：

unicode_text = bytestring.decode(character_encoding)

例：

>>> b'\xc2\xb5'.decode('utf-8')
'µ'

ls コマンドは、テキストとして解釈できない出力を生成する場合があります。ファイル名 Unixでは、スラッシュ b '/' およびゼロ以外の任意のバイトシーケンスが可能です。 b '\ 0' ：

>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()

utf-8エンコーディングを使用してこのようなバイトスープをデコードしようとすると、 UnicodeDecodeError が発生します。

さらに悪化する可能性があります。デコードは黙って失敗し、 mojibake を生成する場合があります互換性のない間違ったエンコーディングを使用した場合：

>>> '—'.encode('utf-8').decode('cp1252')
'â€”'

データは破損していますが、プログラムは障害を認識しませんが発生しました。

一般に、使用する文字エンコードは、バイトシーケンス自体には埋め込まれません。この情報を帯域外で伝達する必要があります。一部の結果は他の結果よりも可能性が高いため、文字エンコードを推測できる chardet モジュールが存在します。 1つのPythonスクリプトは、異なる場所で複数の文字エンコーディングを使用する場合があります。

os.fsdecode（）を使用して、

ls の出力をPython文字列に変換できます。 undecodableでも成功する関数ファイル名（ sys.getfilesystemencoding（）および surrogateescape エラーハンドラー Unix）：

import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))

元のバイトを取得するには、 os.fsencode（）を使用できます。

universal_newlines = True パラメーターを渡すと、 subprocess はバイトをデコードする locale.getpreferredencoding（False） Windowsの場合は cp1252 。

バイトストリームをオンザフライでデコードするには、 io.TextIOWrapper（） 使用できます：例。

異なるコマンドは、異なる文字エンコーディングを使用する場合があります出力、たとえば dir 内部コマンド（ cmd ）はcp437を使用できます。デコードするには出力、エンコードを明示的に渡すことができます（Python 3.6以降）：

output = subprocess.check_output('dir', shell=True, encoding='cp437')

ファイル名は os.listdir（）（Windows Unicode API）（例： '\ xb6' は '\ x14' ＆＃8212; Python'sで置き換えることができます） cp437コーデックは b '\ x14' を制御文字U + 0014の代わりにマップします U + 00B6（＆＃182;）。任意のUnicode文字を含むファイル名をサポートするには、非ASCIIのUnicode文字を含むpoweshell出力をPython文字列にデコードする

この質問は実際には subprocess の出力について尋ねているため、 Popen は エンコード キーワード（Python 3.6以降）：

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

他のユーザーに対する一般的な答えは、バイトをテキストにデコードすることです：

>>> b'abcde'.decode()
'abcde'

引数なしで、 sys.getdefaultencoding（） が使用されます。データが sys.getdefaultencoding（）でない場合は、 decode 呼び出し：

>>> b'caf\xe9'.decode('cp1250')
'café'

decode（）を試して以下を取得する必要がある場合：

AttributeError： 'str'オブジェクトには属性 'decode'がありません

キャストでエンコードタイプを直接指定することもできます。

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'

Windowsシステムのデータ（ \ r \ n の行末記号）を使用する場合、私の答えは

String = Bytes.decode("utf-8").replace("\r\n", "\n")

なぜですか？複数行のInput.txtでこれを試してください：

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)

すべての行末が（ \ r \ r \ n に）2倍になり、余分な空行になります。 Pythonのテキスト読み取り関数は、通常、文字列が \ n のみを使用するように行末を正規化します。 Windowsシステムからバイナリデータを受け取った場合、Pythonはそれを行う機会がありません。したがって、

Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)

元のファイルを複製します。

リストを消去する関数を作成しました

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('\n', '') for x in lista]
    lista = [x.replace('\b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista

Python 3の場合、これは byte から string に変換するためのはるかに安全で Pythonic アプローチです：

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): #check if its in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

出力：

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

文字列をバイトに変換するだけでなく、任意のバイトを変換する場合：

with open("bytesfile", "rb") as infile:
    str = base64.b85encode(imageFile.read())

with open("bytesfile", "rb") as infile:
    str2 = json.dumps(list(infile.read()))

ただし、これはあまり効率的ではありません。 2 mbの画像を9 mbに変換します。

http://docs.python.org/3/library/sysから。 html 、

標準ストリームとの間でバイナリデータを読み書きするには、基礎となるバイナリバッファを使用します。たとえば、stdoutにバイトを書き込むには、 sys.stdout.buffer.write（b'abc '）を使用します。

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow