I am trying to store a variable length list of string to a HDF5 Dataset. The code for this is

import h5py
h5File=h5py.File('xxx.h5','w')
strList=['asas','asas','asas']  
h5File.create_dataset('xxx',(len(strList),1),'S10',strList)
h5File.flush() 
h5File.Close()  

I am getting an error stating that "TypeError: No conversion path for dtype: dtype('&lt U3')" where the &lt means actual less than symbol
How can I solve this problem.

有帮助吗?

解决方案

You're reading in Unicode strings, but specifying your datatype as ASCII. According to the h5py wiki, h5py does not currently support this conversion.

You'll need to encode the strings in a format h5py handles:

asciiList = [n.encode("ascii", "ignore") for n in strList]
h5File.create_dataset('xxx', (len(asciiList),1),'S10', asciiList)

Note: not everything encoded in UTF-8 can be encoded in ASCII!

其他提示

From https://docs.h5py.org/en/stable/special.html:

In HDF5, data in VL format is stored as arbitrary-length vectors of a base type. In particular, strings are stored C-style in null-terminated buffers. NumPy has no native mechanism to support this. Unfortunately, this is the de facto standard for representing strings in the HDF5 C API, and in many HDF5 applications.

Thankfully, NumPy has a generic pointer type in the form of the “object” (“O”) dtype. In h5py, variable-length strings are mapped to object arrays. A small amount of metadata attached to an “O” dtype tells h5py that its contents should be converted to VL strings when stored in the file.

Existing VL strings can be read and written to with no additional effort; Python strings and fixed-length NumPy strings can be auto-converted to VL data and stored.

Example

In [27]: dt = h5py.special_dtype(vlen=str)

In [28]: dset = h5File.create_dataset('vlen_str', (100,), dtype=dt)

In [29]: dset[0] = 'the change of water into water vapour'

In [30]: dset[0]
Out[30]: 'the change of water into water vapour'

I am in a similar situation wanting to store column names of dataframe as a dataset in hdf5 file. Assuming df.columns is what I want to store, I found the following works:

h5File = h5py.File('my_file.h5','w')
h5File['col_names'] = df.columns.values.astype('S')

This assumes the column names are 'simple' strings that can be encoded in ASCII.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top