Python文件Slurp w / endian转换

https://stackoverflow.com/questions/1632673

06-07-2019
|

题

最近有人问过如何在python中执行文件诽谤，并且接受的答案提示了一些内容像：

with open('x.txt') as x: f = x.read()

我将如何执行此操作来读取文件并转换数据的字节序表示？

例如，我有一个1GB的二进制文件，它只是一堆单精度浮点数打包为大端，我想将它转换为小端并转储到一个numpy数组。下面是我为完成此操作而编写的函数以及一些调用它的实际代码。我使用 struct.unpack 进行endian转换，并尝试使用 mmap 加速一切。

我的问题是，我是否使用 mmap 和 struct.unpack 正确使用了slurp？有更清洁，更快的方法吗？现在我有所作为，但我真的想学习如何做得更好。

提前致谢！

#!/usr/bin/python
from struct import unpack
import mmap
import numpy as np

def mmapChannel(arrayName,  fileName,  channelNo,  line_count,  sample_count):
    """
    We need to read in the asf internal file and convert it into a numpy array.
    It is stored as a single row, and is binary. Thenumber of lines (rows), samples (columns),
    and channels all come from the .meta text file
    Also, internal format files are packed big endian, but most systems use little endian, so we need
    to make that conversion as well.
    Memory mapping seemed to improve the ingestion speed a bit
    """
    # memory-map the file, size 0 means whole file
    # length = line_count * sample_count * arrayName.itemsize
    print "\tMemory Mapping..."
    with open(fileName, "rb") as f:
        map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        map.seek(channelNo*line_count*sample_count*arrayName.itemsize)

        for i in xrange(line_count*sample_count):
            arrayName[0, i] = unpack('>f', map.read(arrayName.itemsize) )[0]

        # Same method as above, just more verbose for the maintenance programmer.
        #        for i in xrange(line_count*sample_count): #row
        #            be_float = map.read(arrayName.itemsize) # arrayName.itemsize should be 4 for float32
        #            le_float = unpack('>f', be_float)[0] # > for big endian, < for little endian
        #            arrayName[0, i]= le_float

        map.close()
    return arrayName

print "Initializing the Amp HH HV, and Phase HH HV arrays..."
HHamp = np.ones((1,  line_count*sample_count),  dtype='float32')
HHphase = np.ones((1,  line_count*sample_count),  dtype='float32')
HVamp = np.ones((1,  line_count*sample_count),  dtype='float32')
HVphase = np.ones((1,  line_count*sample_count),  dtype='float32')



print "Ingesting HH_Amp..."
HHamp = mmapChannel(HHamp, 'ALPSRP042301700-P1.1__A.img',  0,  line_count,  sample_count)
print "Ingesting HH_phase..."
HHphase = mmapChannel(HHphase, 'ALPSRP042301700-P1.1__A.img',  1,  line_count,  sample_count)
print "Ingesting HV_AMP..."
HVamp = mmapChannel(HVamp, 'ALPSRP042301700-P1.1__A.img',  2,  line_count,  sample_count)
print "Ingesting HV_phase..."
HVphase = mmapChannel(HVphase, 'ALPSRP042301700-P1.1__A.img',  3,  line_count,  sample_count)

print "Reshaping...."
HHamp_orig = HHamp.reshape(line_count, -1)
HHphase_orig = HHphase.reshape(line_count, -1)
HVamp_orig = HVamp.reshape(line_count, -1)
HVphase_orig = HVphase.reshape(line_count, -1)

解决方案

with open(fileName, "rb") as f:
  arrayName = numpy.fromfile(f, numpy.float32)
arrayName.byteswap(True)

速度和简洁性很难被击败;-)。对于byteswap，请参阅这里（ True 参数表示，“在适当位置执行”）;对于fromfile，请参阅此处

这在little-endian机器上工作（因为数据是big-endian，需要byteswap）。您可以测试是否有条件地执行byteswap，将最后一行从无条件调用更改为byteswap，例如：

if struct.pack('=f', 2.3) == struct.pack('<f', 2.3):
  arrayName.byteswap(True)

，即对byteswap的调用以little-endianness测试为条件。

其他提示

略微修改 @Alex Martelli的回答：

arr = numpy.fromfile(filename, numpy.dtype('>f4'))
# no byteswap is needed regardless of endianess of the machine

您可以使用基于ASM的解决方案汇集在一起 CorePy 。我想知道，如果你能从算法的其他部分获得足够的性能。 I / O和对1GB数据块的操作将需要一段时间才能切片。

一旦你在python中对算法进行了原型化，你可能会发现有用的另一件事就是切换到C.我这样做是为了对一个全世界的DEM（高度）数据集进行一次操作。一旦我离开解释的脚本，整个事情就更容易忍受了。

我希望这样的事情更快

arrayName[0] = unpack('>'+'f'*line_count*sample_count, map.read(arrayName.itemsize*line_count*sample_count))

请不要将 map 用作变量名称

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow