I have a struct array created by matlab and stored in v7.3 format mat file:

struArray = struct('name', {'one', 'two', 'three'}, 
                   'id', {1,2,3}, 
                   'data', {[1:10], [3:9], [0]})
save('test.mat', 'struArray', '-v7.3')

Now I want to read this file via python using h5py:

data = h5py.File('test.mat')
struArray = data['/struArray']

I have no idea how to get the struct data one by one from struArray:

for index in range(<the size of struArray>):
    elem = <the index th struct in struArray>
    name = <the name of elem>
    id = <the id of elem>
    data = <the data of elem>
有帮助吗?

解决方案

Matlab 7.3 file format is not extremely easy to work with h5py. It relies on HDF5 reference, cf. h5py documentation on references.

>>> import h5py
>>> f = h5py.File('test.mat')
>>> list(f.keys())
['#refs#', 'struArray']
>>> struArray = f['struArray']
>>> struArray['name'][0, 0]  # this is the HDF5 reference
<HDF5 object reference>
>>> f[struArray['name'][0, 0]].value  # this is the actual data
array([[111],
       [110],
       [101]], dtype=uint16)

To read struArray(i).id:

>>> f[struArray['id'][0, 0]][0, 0]
1.0
>>> f[struArray['id'][1, 0]][0, 0]
2.0
>>> f[struArray['id'][2, 0]][0, 0]
3.0

Notice that Matlab stores a number as an array of size (1, 1), hence the final [0, 0] to get the number.

To read struArray(i).data:

>>> f[struArray['data'][0, 0]].value
array([[  1.],
       [  2.],
       [  3.],
       [  4.],
       [  5.],
       [  6.],
       [  7.],
       [  8.],
       [  9.],
       [ 10.]])

To read struArray(i).name, it is necessary to convert the array of integers to string:

>>> f[struArray['name'][0, 0]].value.tobytes()[::2].decode()
'one'
>>> f[struArray['name'][1, 0]].value.tobytes()[::2].decode()
'two'
>>> f[struArray['name'][2, 0]].value.tobytes()[::2].decode()
'three'

其他提示

visit or visititems is quick way of seeing the overall structure of a h5py file:

fs['struArray'].visititems(lambda n,o:print(n, o))

When I run this on a file produced by Octave save -hdf5 I get:

type <HDF5 dataset "type": shape (), type "|S7">
value <HDF5 group "/struArray/value" (3 members)>
value/data <HDF5 group "/struArray/value/data" (2 members)>
value/data/type <HDF5 dataset "type": shape (), type "|S5">
value/data/value <HDF5 group "/struArray/value/data/value" (4 members)>
value/data/value/_0 <HDF5 group "/struArray/value/data/value/_0" (2 members)>
value/data/value/_0/type <HDF5 dataset "type": shape (), type "|S7">
value/data/value/_0/value <HDF5 dataset "value": shape (10, 1), type "<f8">
value/data/value/_1 <HDF5 group "/struArray/value/data/value/_1" (2 members)>
...
value/data/value/dims <HDF5 dataset "dims": shape (2,), type "<i4">
value/id <HDF5 group "/struArray/value/id" (2 members)>
value/id/type <HDF5 dataset "type": shape (), type "|S5">
value/id/value <HDF5 group "/struArray/value/id/value" (4 members)>
value/id/value/_0 <HDF5 group "/struArray/value/id/value/_0" (2 members)>
...
value/id/value/_2/value <HDF5 dataset "value": shape (), type "<f8">
value/id/value/dims <HDF5 dataset "dims": shape (2,), type "<i4">
value/name <HDF5 group "/struArray/value/name" (2 members)>
...
value/name/value/dims <HDF5 dataset "dims": shape (2,), type "<i4">

This may not be the same what MATLAB 7.3 produces, but it gives an idea of a structure's complexity.

A more refined callback can display values, and could be the starting point for recreating a Python object (dictionary, lists, etc).

def callback(name, obj):
    if name.endswith('type'):
        print('type:', obj.value)
    elif name.endswith('value'):
        if type(obj).__name__=='Dataset':
            print(obj.value.T)  # http://stackoverflow.com/questions/21624653
    elif name.endswith('dims'):
        print('dims:', obj.value)
    else:
        print('name:', name)

fs.visititems(callback)

produces:

name: struArray
type: b'struct'
name: struArray/value/data
type: b'cell'
name: struArray/value/data/value/_0
type: b'matrix'
[[  1.   2.   3.   4.   5.   6.   7.   8.   9.  10.]]
name: struArray/value/data/value/_1
type: b'matrix'
[[ 3.  4.  5.  6.  7.  8.  9.]]
name: struArray/value/data/value/_2
type: b'scalar'
0.0
dims: [3 1]
name: struArray/value/id
type: b'cell'
name: struArray/value/id/value/_0
type: b'scalar'
1.0
...
dims: [3 1]
name: struArray/value/name
type: b'cell'
name: struArray/value/name/value/_0
type: b'sq_string'
[[111 110 101]]
...
dims: [3 1]

I'm sorry but I think it will be quite challenging to get contents of cells/structures from outside Matlab. If you view the produced files (eg with HDFView) you will see there are lots of cross-references and no obvious way to proceed.

If you stick to simple numerical arrays it works fine. If you have small cell arrays containing numerical arrays you can convert them to seperate variables (ie cellcontents1, cellcontents2 etc.) which is usually just a few lines and allows them to be saved and loaded directly. So in your example I would save a file with vars name1, name2, name3, id1, id2, id3 ... etc.

EDIT: You specified h5py in the question so thats what I answered, but worth mentioning that with scipy.io.loadmat you should be able to get the original variables converted to numpy equivalents (eg object arrays).

I would start by firing up the interpreter and running help on struarray. It should give you enough information to get you started. Failing that, you can dump the attributes of any Python object by printing the __dict__ attribute.

I know of two solutions (one of which I made and works better if the *.mat file is very large or very deep) that abstracts away your direct interactions with the h5py library.

  • the hdf5storage package, which is well maintained and meant to help load v7.3 saved matfiles into Python
  • my own matfile loader, which I wrote to overcome certain problems even the latest version (0.2.0) of hdf5storage has loading large (~500Mb) and/or deep arrays (I'm actually not sure which of the two causes the issue)

Assuming you've downloaded both packages into a place where you can load them into Python, you can see that they produce similar outputs for your example 'test.mat':

In [1]: pyInMine = LoadMatFile('test.mat')
In [2]: pyInHdf5 = hdf5.loadmat('test.mat')  
In [3]: pyInMine()                                                                                                                                          
Out[3]: dict_keys(['struArray'])
In [4]: pyInMine['struArray'].keys()                                                                                                                             
Out[4]: dict_keys(['data', 'id', 'name'])
In [5]: pyInHdf5.keys()                                                                                                                                      
Out[5]: dict_keys(['struArray'])
In [6]: pyInHdf5['struArray'].dtype                                                                                                                          
Out[6]: dtype([('name', 'O'), ('id', '<f8', (1, 1)), ('data', 'O')])
In [7]: pyInHdf5['struArray']['data']                                                                                                                        
Out[7 ]: 
array([[array([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]]),
        array([[3., 4., 5., 6., 7., 8., 9.]]), array([[0.]])]],
      dtype=object)
In [8]: pyInMine['struArray']['data']                                                                                                                            
Out[8]: 
array([[array([[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]]),
        array([[3., 4., 5., 6., 7., 8., 9.]]), array([[0.]])]],
      dtype=object)

The big difference is that my library converts structure arrays in Matlab into Python dictionaries whose keys are the structure's fields, whereas hdf5storage converts them into numpy object arrays with various dtypes storing the fields.

I also note that the indexing behavior of the array is different from how you would expect it from the Matlab approach. Specifically, in Matlab, in order to get the name field of the second structure, you would index the structure:

[Matlab] >> struArray(2).name`
[Matlab] >> 'two'

In my package, you have to first grab the field and then index:

In [9]: pyInMine['struArray'].shape                                                                                                                              
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-64-a2f85945642b> in <module>
----> 1 pyInMine['struArray'].shape

AttributeError: 'dict' object has no attribute 'shape'
In [10]: pyInMine['struArray']['name'].shape
Out[10]: (1, 3)
In [11]: pyInMine['struArray']['name'][0,1]
Out[11]: 'two'

The hdf5storage package is a little bit nicer and lets you either index the structure and then grab the field, or vice versa, because of how structured numpy object arrays work:

In [12]: pyInHdf5['struArray'].shape
Out[12]: (1, 3)
In [13]: pyInHdf5['struArray'][0,1]['name']
Out[13]: array([['two']], dtype='<U3')
In [14]: pyInHdf5['struArray']['name'].shape
Out[14]: (1, 3)
In [15]: pyInHdf5['struArray']['name'][0,1]
Out[15]: array([['two']], dtype='<U3')

Again, the two packages treat the final output a little differently, but in general are both quite good at reading in v7.3 matfiles. Final thought that in the case of ~500MB+ files, I've found that the hdf5storage package hangs while loading, while my package does not (though it still takes ~1.5 minutes to complete the load).

It's really a problem with Matlab 7.3 and h5py. My trick is to convert the h5py._hl.dataset.Dataset type to numpy array. For example,

np.array(data['data'])

will solve your problem with the 'data' field.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top