How to pipe binary data into numpy arrays without tmp storage?

https://stackoverflow.com/questions/13059444

14-07-2021
|

Frage

There are several similar questions but none of them answers this simple question directly:

How can i catch a commands output and stream that content into numpy arrays without creating a temporary string object to read from?

So, what I would like to do is this:

import subprocess
import numpy
import StringIO

def parse_header(fileobject):
    # this function moves the filepointer and returns a dictionary
    d = do_some_parsing(fileobject)
    return d

sio = StringIO.StringIO(subprocess.check_output(cmd))
d = parse_header(sio)
# now the file pointer is at the start of data, parse_header takes care of that.
# ALL of the data is now available in the next line of sio
dt = numpy.dtype([(key, 'f8') for key in d.keys()])

# i don't know how do make this work:
data = numpy.fromxxxx(sio , dt)

# if i would do this, I create another copy besides the StringIO object, don't I?
# so this works, but isn't this 'bad' ?
datastring = sio.read()
data = numpy.fromstring(datastring, dtype=dt)

I tried it with StringIO and cStringIO but both are not accepted by numpy.frombuffer and numpy.fromfile.

Using StringIO object I first have to read the stream into a string and then use numpy.fromstring, but I would like to avoid creating the intermediate object (several Gigabytes).

An alternative for me would be if I can stream sys.stdin into numpy arrays, but that does not work with numpy.fromfile either (seek needs to be implemented).

Are there any work-arounds for this? I can't be the first one trying this (unless this is a PEBKAC case?)

Solution: This is the current solution, it's a mix of unutbu's instruction how to use the Popen with PIPE and the hint of eryksun to use bytearray, so I don't know who to accept!? :S

proc = sp.Popen(cmd, stdout = sp.PIPE, shell=True)
d = parse_des_header(proc.stdout)
rec_dtype = np.dtype([(key,'f8') for key in d.keys()])
data = bytearray(proc.stdout.read())
ndata = np.frombuffer(data, dtype = rec_dtype)

I didn't check if the data is really not creating another copy, don't know how. But what I noticed that this works much faster than everything I tried before, so many thanks to both the answers' authors!

Lösung

You can use Popen with stdout=subprocess.PIPE. Read in the header, then load the rest into a bytearray to use with np.frombuffer.

Additional comments based on your edit:

If you're going to call proc.stdout.read(), it's equivalent to using check_output(). Both create a temporary string. If you preallocate data, you could use proc.stdout.readinto(data). Then if the number of bytes read into data is less than len(data), free the excess memory, else extend data by whatever is left to be read.

data = bytearray(2**32) # 4 GiB
n = proc.stdout.readinto(data)
if n < len(data):
    data[n:] = ''        
else:
    data += proc.stdout.read()

You could also come at this starting with a pre-allocated ndarray ndata and use buf = np.getbuffer(ndata). Then readinto(buf) as above.

Here's an example to show that the memory is shared between the bytearray and the np.ndarray:

>>> data = bytearray('\x01')
>>> ndata = np.frombuffer(data, np.int8)
>>> ndata
array([1], dtype=int8)
>>> ndata[0] = 2
>>> data
bytearray(b'\x02')

Andere Tipps

Since your data can easily fit in RAM, I think the easiest way to load the data into a numpy array is to use a ramfs.

On Linux,

sudo mkdir /mnt/ramfs
sudo mount -t ramfs -o size=5G ramfs /mnt/ramfs
sudo chmod 777 /mnt/ramfs

Then, for example, if this is the producer of the binary data:

writer.py:

from __future__ import print_function
import random
import struct
N = random.randrange(100)
print('a b')
for i in range(2*N):
    print(struct.pack('<d',random.random()), end = '')

Then you could load it into a numpy array like this:

reader.py:

import subprocess
import numpy

def parse_header(f):
    # this function moves the filepointer and returns a dictionary
    header = f.readline()
    d = dict.fromkeys(header.split())
    return d

filename = '/mnt/ramfs/data.out'
with open(filename, 'w') as f:  
    cmd = 'writer.py'
    proc = subprocess.Popen([cmd], stdout = f)
    proc.communicate()
with open(filename, 'r') as f:      
    header = parse_header(f)
    dt = numpy.dtype([(key, 'f8') for key in header.keys()])
    data = numpy.fromfile(f, dt)

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow