Pytables error when record descriptor exceeds 16,384 bytes

https://stackoverflow.com/questions/22915946

29-06-2023
|

Вопрос

When exploring a large, new dataset I like to import the entire file as string data, do some printouts and frequencies, and then fine tune a more accurate data description for the final pre-processing step.

Pytables seems ideal for this and it supports a string data type. However, when I add enough columns to the description such that that the maximum row size exceeds 16,384 bytes I received an error. I have tested that this is causing the error by adding columns one at a time and creating the h5file.

Is there a maximum size that a pytables row can have? I could not find anything in the documentation nor a way to increase the size limit if it exists.

Code:

from tables import *

# record descriptor
class Record(IsDescription):
     var1 = StringCol(16)
     var2 = StringCol(16)
     var3 = StringCol(16)
     var4 = StringCol(16)
     ...
     varN = StringCol(16)


h5file = open_file("test.h5", mode="w", title="Test file")

group = h5file.create_group("/", 'Test', 'Test group')

table = h5file.create_table(group, 'Test', Record, 'Test example')

Error:

HDF5ExtError: Problems creating the table

Version Infor:

In [0]: tables.__version__
Out[0]: '3.1.0'

In [1]: sys.version
Out[1]: '2.7.6 |Anaconda 1.9.1 (64-bit)| (default, Nov 11 2013, 10:49:15) [MSC v.1500 64 bit (AMD64)]'

Решение

Here is one limit I know about, 512 columns per row (though it says it could be changed, not sure if you have to recompile), see here.

I don't know if the bytes limit per row is a hard HDF5 limit. (Though I would suspect it is; their are various hard limits, e.g. 64KB of meta data per group for example). These allow a fixed sized layout of the HDF5 files, for good performance.

Maybe just split into several sub-tables is your best bet.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow