Appending large amount of data to a tables (HDF5) database where database.numcols != newdata.numcols?

https://stackoverflow.com/questions/7327739

27-10-2019
|

Question

I am trying to append a large dataset (>30Gb) to an existing pytables table. The table is N columns, and the dataset is N-1 columns; one column is calculated after I know the other N-1 columns.

I'm using numpy.fromfile() to read chunks of the dataset into memory before appending it to the database. Ideally, I'd like to stick the data into the database, then calculate the final column, and finish up by using Table.modifyColumn() to complete the operation.

I've considered appending numpy.zeros((len(new_data), N)) to the table, then using Table.modifyColumns() to fill in the new data, but I'm hopeful someone knows a nice way to avoid generating a huge array of empty data for each chunk that I need to append.

Solution

If the columns are all the same type, you can use numpy.lib.stride_tricks.as_strided to make the array you read from the file of shape (L, N-1) to look like shape (L, N). For example,

In [5]: a = numpy.arange(12).reshape(4,3)

In [6]: a
Out[6]: 
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [7]: a.strides
Out[7]: (24, 8)

In [8]: b = numpy.lib.stride_tricks.as_strided(a, shape=(4, 4), strides=(24, 8))

In [9]: b
Out[9]: 
array([[  0,   1,   2,   3],
       [  3,   4,   5,   6],
       [  6,   7,   8,   9],
       [  9,  10,  11, 112]])

Now you can use this array b to fill up the table. The last column of each row will be the same as the first column of the next row, but you'll overwrite them when you can compute the values.

This won't work if a is record array (i.e. has a complex dtype). For that, you can try numpy.lib.recfunctions.append_fields. As it will copy the data to a new array, it won't save you any significant amount of memory, but it will allow you to do all the writing at once.

OTHER TIPS

You could add the results to another table. Unless there's some compelling reason for the calculated column to be adjacent to the other columns, that's probably the easiest. There's something to be said for separating raw data from calculations anyways.

If you must increase the size of the table, look into using h5py. It provides a more direct interface to the h5 file. Keep in mind that depending on how the data set was created in the h5 file, it may not be possible to simply append a column to the data. See section 1.2.4, "Dataspace" in http://www.hdfgroup.org/HDF5/doc/UG/03_DataModel.html for a discussion regarding the general data format. h5py supports resize if the underlying dataset supports it.

You could also use a single buffer to store the input data like so:

z = zeros((nrows, N))
while more_data_in_file:
    # Read a data block
    z[:,:N-1] = fromfile('your_params')
    # Set the final column
    z[:,N-1:N] = f(z[:,:N-1])
    # Append the data
    tables_handle.append(z)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow