Domanda

I have some large (even gzipped around 10GB) files, which contain an ASCII header and then in principle numpy.recarrays of about 3MB each, we call them "events". My first approach looked like this:

f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
        ('Id', '>u4'),                # simplified
        ('UnixTimeUTC', '>u4', 2), 
        ('Data', '>i2', (1600,1024) ) 
        ])
event = np.fromfile( f, dtype = event_dtype, count=1 )

However, this is not possible, since np.fromfile needs a real FILE object, because it really makes low level calls (found a pretty old ticket https://github.com/numpy/numpy/issues/1103).

So as I understand I have to do it like this:

s = f.read( event_dtype.itemsize )
event = np.fromstring(s, dtype=event_dtype, count=1)

And yes, it works! But isn't this awfully inefficient? Isn't the mem for s allocated, and garbage collected for every event? On my laptop I reach something like 16 events/s, i.e. ~50MB/s

I wonder if anybody knows a smart way, to allocate the mem once and then let numpy read directly into that mem.

Btw. I'm a physicist, so ... well still a newbie in this business.

È stato utile?

Soluzione

@Bakuriu is probably correct that this is probably a micro-optimization. Your bottleneck is almost definitely IO, and after that, decompression. Allocating the memory twice probably isn't significant.

However, if you wanted to avoid the extra memory allocation, you could use numpy.frombuffer to view the string as a numpy array.

This avoids duplicating memory (the string and the array use the same memory buffer), but the array will be read-only, by default. You can then change it to allow writing, if you need to.

In your case, it would be as simple as replacing fromstring with frombuffer:

f = gzip.GzipFile(filename)
f.read(10000) # fixed length ascii header
event_dtype = np.dtype([
        ('Id', '>u4'),                # simplified
        ('UnixTimeUTC', '>u4', 2), 
        ('Data', '>i2', (1600,1024) ) 
        ])
s = f.read( event_dtype.itemsize )
event = np.frombuffer(s, dtype=event_dtype, count=1)

Just to prove that memory is not duplicated using this approach:

import numpy as np

x = "hello"
y = np.frombuffer(x, dtype=np.uint8)

# Make "y" writeable...
y.flags.writeable = True

# Prove that we're using the same memory
y[0] = 121
print x # <-- Notice that we're outputting changing y and printing x...

This yields: yello instead of hello.

Regardless of whether or not it's a significant optimization in this particular case, it's a useful approach to be aware of.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top