A fast way to fill an dataset with same values of compound data in h5py

https://stackoverflow.com/questions/17283583

01-06-2022
|

Вопрос

I have a large dataset of compound data in a hdf file. The Type of the compound data looks as following:

    numpy.dtype([('Image', h5py.special_dtype(ref=h5py.Reference)), 
                 ('NextLevel', h5py.special_dtype(ref=h5py.Reference))])

With that I create a dataset with references to an image and another dataset at each position. These datasets have the dimensions n x n, with n typically at least 256, but more likely >2000. I have to initially fill each position of these datasets with the same value:

    [[(image.ref, dataset.ref)...(image.ref, dataset.ref)],
      .
      .
      .
     [(image.ref, dataset.ref)...(image.ref, dataset.ref)]]

I try to avoid filling it with two for-loops like:

    for i in xrange(0,n):
      for j in xrange(0,n):
         daset[i,j] =(image.ref, dataset.ref)

because the performance is very bad. So I'm searching for something like numpy.fill, numpy.shape, numpy.reshape, numpy.array, numpy.arrange, [:] and so on. I tried those functions in various ways, but they all seem to work only with number and string datatypes. Is there any way to fill these datasets in a faster way then the for-loops?

Thank you in advance.

Решение

You can use either numpy broadcasting or a combination of numpy.repeat and numpy.reshape:

my_dtype = numpy.dtype([('Image', h5py.special_dtype(ref=h5py.Reference)), 
             ('NextLevel', h5py.special_dtype(ref=h5py.Reference))])
ref_array = array( (image.ref, dataset.ref), dtype=my_dtype)
dataset = numpy.repeat(ref_array, n*n)
dataset = dataset.reshape( (n,n) )

Note that numpy.repeat returns a flattened array, hence the use of numpy.reshape. It seems repeat is faster than just broadcasting it:

%timeit empty_dataset=np.empty(2*2,dtype=my_dtype); empty_dataset[:]=ref_array
100000 loops, best of 3: 9.09 us per loop

%timeit repeat_dataset=np.repeat(ref_array, 2*2).reshape((2,2))
100000 loops, best of 3: 5.92 us per loop

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow