How do I fill an n-dimensional array in HDF5 from a 1D source?

https://stackoverflow.com//questions/20007123

20-12-2019
|

Question

I have an array with multiple dimensions (x, y, channels, z, time-steps). However, the raw data is stored in a TIFF image as a single stack of (x, y, channels), with z * time-steps frames.

Finally, Pillow's Image.getdata() function returns a 1D array-like object that needs to be reshaped.

What is the best way to read this into HDF5 if the dataset is too large to fit in memory? Is it possible to reshape the array once it's been written into HDF5, or to write 1D data in a way that it automatically fills in an array (i.e. writes in with x varying fastest, y second-fastest, etc.) Update: Something like numpy.ndarray.flat would be ideal.

Here's what I've tried so far (img is PIL.Image, dset is a h5py dataset):

1) Reading individual frames. This method is too slow since it takes ~20min for 300MB in 1000 frames. Most of the time is spent in the dset[] = a call.

for i in range(0, img_layers):
  img.seek(i)
  a = numpy.array(img.getdata(), dtype=dtype) # a.shape = (sx * sz * channels,)
  a.resize(sx, sy, channels)
  z = i % sz
  frame = i // sz
  dset[..., z, frame] = a

2) Incomplete: Reading in chunks. This is much faster (2min for the same dataset), but I've only got this working for a 4D image (sx, sy, channels, time-steps), and need an additional dimension for z-slices:

chunk_bits = 256 * 1000**2 # 256MB
frame_bits = depth_bits[dtype] * sx * sy * channels
chunk_frames = chunk_bits // frame_bits
a = numpy.zeros((sx, sy, channels, chunk_frames), dtype=dtype)
for i in range(0, layers):
  img.seek(i)
  temp = numpy.array(img.getdata(), dtype=dtype)
  temp.resize(sx, sy, channels)
  a[..., i % chunk_frames] = temp
  if (i + 1) % chunk_frames == 0 or i == (layers - 1):
    chunk = i // chunk_frames
    dset[..., chunk * chunk_frames : i + 1] = a[..., : i % chunk_frames + 1

Solution

Option 1 was the correct answer. However, it makes a large difference which dimension varies fastest:

~15 minutes:

for i in range(0, img_layers):
  img.seek(i)
  a = numpy.array(img.getdata(), dtype=dtype)
  a.resize(sx, sy, channels)
  z = i % sz
  frame = i // sz
  dset[..., z, frame] = a # Majority of time in this call

~3 minutes:

for i in range(0, img_layers):
  img.seek(i)
  a = numpy.array(img.getdata(), dtype=dtype) # Majority of time in this call
  a.resize(sx, sy, channels)
  z = i % sz
  frame = i // sz
  dset[frame, z, ...] = a

To read this data quickly the fastest varying index should be LAST, not first.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow