Python File Slurp w /エンディアン変換

https://stackoverflow.com/questions/1632673

06-07-2019
|

質問

最近 Pythonでファイルを丸lurみする方法が尋ねられ、受け入れられた答えは何かを示唆しましたのような：

with open('x.txt') as x: f = x.read()

ファイルを読み込んでデータのエンディアン表現を変換するには、どうすればよいでしょうか？

たとえば、ビッグエンディアンとしてパックされた単精度浮動小数点数の単なる1 GBのバイナリファイルがあり、それをリトルエンディアンに変換してnumpy配列にダンプしたいとします。以下は、これを実現するために作成した関数と、それを呼び出す実際のコードです。エンディアン変換を行う struct.unpack を使用し、 mmap を使用してすべてを高速化しようとしました。

私の質問は、 mmap および struct.unpack でslurpを正しく使用していますか？これを行うよりクリーンで高速な方法はありますか？現在、私が持っているものは動作しますが、これをもっと良くする方法を学びたいです。

事前に感謝します！

#!/usr/bin/python
from struct import unpack
import mmap
import numpy as np

def mmapChannel(arrayName,  fileName,  channelNo,  line_count,  sample_count):
    """
    We need to read in the asf internal file and convert it into a numpy array.
    It is stored as a single row, and is binary. Thenumber of lines (rows), samples (columns),
    and channels all come from the .meta text file
    Also, internal format files are packed big endian, but most systems use little endian, so we need
    to make that conversion as well.
    Memory mapping seemed to improve the ingestion speed a bit
    """
    # memory-map the file, size 0 means whole file
    # length = line_count * sample_count * arrayName.itemsize
    print "\tMemory Mapping..."
    with open(fileName, "rb") as f:
        map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        map.seek(channelNo*line_count*sample_count*arrayName.itemsize)

        for i in xrange(line_count*sample_count):
            arrayName[0, i] = unpack('>f', map.read(arrayName.itemsize) )[0]

        # Same method as above, just more verbose for the maintenance programmer.
        #        for i in xrange(line_count*sample_count): #row
        #            be_float = map.read(arrayName.itemsize) # arrayName.itemsize should be 4 for float32
        #            le_float = unpack('>f', be_float)[0] # > for big endian, < for little endian
        #            arrayName[0, i]= le_float

        map.close()
    return arrayName

print "Initializing the Amp HH HV, and Phase HH HV arrays..."
HHamp = np.ones((1,  line_count*sample_count),  dtype='float32')
HHphase = np.ones((1,  line_count*sample_count),  dtype='float32')
HVamp = np.ones((1,  line_count*sample_count),  dtype='float32')
HVphase = np.ones((1,  line_count*sample_count),  dtype='float32')



print "Ingesting HH_Amp..."
HHamp = mmapChannel(HHamp, 'ALPSRP042301700-P1.1__A.img',  0,  line_count,  sample_count)
print "Ingesting HH_phase..."
HHphase = mmapChannel(HHphase, 'ALPSRP042301700-P1.1__A.img',  1,  line_count,  sample_count)
print "Ingesting HV_AMP..."
HVamp = mmapChannel(HVamp, 'ALPSRP042301700-P1.1__A.img',  2,  line_count,  sample_count)
print "Ingesting HV_phase..."
HVphase = mmapChannel(HVphase, 'ALPSRP042301700-P1.1__A.img',  3,  line_count,  sample_count)

print "Reshaping...."
HHamp_orig = HHamp.reshape(line_count, -1)
HHphase_orig = HHphase.reshape(line_count, -1)
HVamp_orig = HVamp.reshape(line_count, -1)
HVphase_orig = HVphase.reshape(line_count, -1)

解決

with open(fileName, "rb") as f:
  arrayName = numpy.fromfile(f, numpy.float32)
arrayName.byteswap(True)

スピードと簡潔さのためにかなり難しい;-)。バイトスワップについては、 here （ True 引数は、「所定の場所で実行」を意味します）; fromfileについては、こちら。

これは、リトルエンディアンのマシンでそのまま機能します（データはビッグエンディアンであるため、バイトスワップが必要です）。バイトスワップを条件付きで実行するかどうかをテストし、最後の行を無条件のバイトスワップへの呼び出しから次のように変更できます。

if struct.pack('=f', 2.3) == struct.pack('<f', 2.3):
  arrayName.byteswap(True)

i.e。、リトルエンディアンのテストを条件としたバイトスワップの呼び出し。

他のヒント

少し変更 @Alex Martelliの回答：

arr = numpy.fromfile(filename, numpy.dtype('>f4'))
# no byteswap is needed regardless of endianess of the machine

ASMベースのソリューションを共同で使用できます CorePy 。ただし、アルゴリズムの他の部分から十分なパフォーマンスを得ることができるかどうかは疑問です。 1GBのデータチャンクのI / Oと操作には、どのようにスライスしたとしても時間がかかります。

Pythonでアルゴリズムのプロトタイプを作成したら、Cに切り替えると便利かもしれません。これは、全世界のDEM（高さ）データセットを1回操作するために行いました。解釈されたスクリプトから離れると、全体がはるかに耐えられるものになりました。

このようなものがより高速になると期待しています

arrayName[0] = unpack('>'+'f'*line_count*sample_count, map.read(arrayName.itemsize*line_count*sample_count))

変数名として map を使用しないでください

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow