Вопрос

I have some event data from a HDF5 file:

>>> events
<class 'h5py._hl.dataset.Dataset'>

I get the array data like so:

>>> events = events[:]

And the structure is like so:

>>> type(events)
<type 'numpy.ndarray'>
>>> events.shape
(273856,)
>>> type(events[0])
<type 'numpy.void'>
>>> events[0]
(0, 30, 3523, 5352)
>>> # More information on structure 
>>> [type(e) for e in events[0]]    
[<type 'numpy.uint64'>, 
 <type 'numpy.uint32'>, 
 <type 'numpy.float64'>, 
 <type 'numpy.float64'>]   
>>> events.dtype 
[('start', '<u8'), 
 ('length', '<u4'), 
 ('mean', '<f8'), 
 ('variance', '<f8')]

I need to get the largest index of a particular event where the first field is less than some value. The brute force approach is:

>>> for i, e in enumerate(events):
>>>     if e[0] >= val:
>>>         break

The first index of the tuple is sorted so I can do bisection so speed things up:

>>> field1 = [row[0] for row in events]
>>> index = bisect.bisect_right(field1, val)

This show improvement but [row[0] for row in event] is slower than I expected. Any ideas on how to tackle this problem?

Это было полезно?

Решение

Yep, iterating over numpy arrays as you're currently doing is relatively slow. Normally, you'd use slicing instead (which creates a view, rather than copying the data into a list).

It looks like you have an object array. This will make things even slower. Do you really need an object array? It looks like all of the values are ints. (Is this a "vlen" hdf5 dataset?)

The use case where an object array would make sense is if you have a different number of items in each element of events. If you don't, then there's no reason to use one.

If you were using a 2D array of ints instead of an object array of tuples, you'd just do:

field1 = events[:,0]

However, in that case, you could just do: (searchsorted uses bisection)

index = np.searchsorted(events[:,0], val)

Edit

Ah! Okay, you have a structured array. In other words, it's an array (1D, in this case) where each item is a C-like struct. From:

>>> events.dtype 
[('start', '<u8'), 
 ('length', '<u4'), 
 ('mean', '<f8'), 
 ('variance', '<f8')]

...we can see that the first field is named "start".

Therefore, you just want:

index = np.searchsorted(events["start"], val)

In more general terms, if we didn't know the name of the field, but knew that it was a structured array of some sort, you'd do (paring things down to just the slicing step):

events[event.dtype.names[0]]

As far as whether or not it's a good idea to convert everything to a "normal" 2D array of ints, that depends on your use case. For basic slicing and calling searchsorted, there's no reason to. There shouldn't (untested) be any significant speed increase.

Based on what you're doing at the moment, I'd just leave it as is.

However, structured arrays are often cumbersome to deal with.

There are plenty of cases where structured arrays are very useful (e.g. reading in certain binary formats from disk), but if you want to think of it as a "table-like" array, you'll quickly hit pain points. You're often better off storing the columns as separate arrays. (Or better yet, use a pandas.DataFrame for "tabular" data.)

If you did want to convert it to a 2D array of ints, do:

events = np.hstack([events[name] for name in events.dtype.names])

This will automatically find a compatible datatype (int64, in this case) for the new array and "stack" the fields of the structured array into columns in a 2D array.

Calling events = events.astype(int) will effectively just yield the first column. (This is because each item of events is a C-like struct, and astype operates element-wise, so each struct is converted to a single int.)

Другие советы

You can use numpy.searchsorted:

>>> a = np.arange(10000).reshape(2500,4)
>>> np.searchsorted(a[:,0], 1000)
250

Timing comparisons:

>>> %timeit np.searchsorted(a[:,0], 1000)
100000 loops, best of 3: 11.7 µs per loop
>>> %timeit field1 = [row[0] for row in a];bisect.bisect_right(field1, 1000)
100 loops, best of 3: 2.63 ms per loop
Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top