Pandas, large file with varying number columns, in memory append

Question

I would like to maintain a large PyTable in a hdf5 file. Normally as new data comes I would append to the existing table:

    store = pd.HDFStore(path_to_dataset, 'a')
    store.append("data", newdata)
    store.close()

However, if the columns of old stored data and those of the incoming newdata are partially only overlapping, it is returned the following error:

Exception: cannot match existing table structure for [col1,col2,col3] on appending data

In these cases, I would like to get a behavior similar to the normal DataFrame append function which fills non overlapping entries with NAN

import pandas as pd
a = {"col1":range(10),"col2":range(10)}
a = pd.DataFrame(a)
b = {"b1":range(10),"b2":range(10)}
b = pd.DataFrame(b)
a.append(b)

Is it possible have a similar operation "in memory", or do I need to create a completely new file?

Solution

HDFStore stores row-oriented, so this is currently not possible.

You could need to read it in, append, and write it out. Possibly you could use: http://pandas.pydata.org/pandas-docs/stable/io.html#multiple-table-queries

However, you could also create the table with all columns that are possible at the beginning (and just leave them nan).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow