This augments the previous answer by some examples and explanations. For my version of Pandas (1.2.3) and PyTables (3.6.1), I see the following behavior when writing to an HDF store:
import pandas as pd
df = pd.DataFrame([[1, "a"], [2, "b"], [3, "c"]])
# Create a store with fixed format: creates considerable memory overhead!
# File size store1.h5: 1.1MB
store = pd.HDFStore("store1.h5")
store.put(key="some/key", value=df, format="fixed")
store.close()
# Better: create a store with table format.
# File size store1.h5: 86kB!
store = pd.HDFStore("store2.h5")
store.put(key="some/key", value=df, format="table")
store.close()
Note: Instead of using the store, use directly DataFrame.to_hdf()
:
df = pd.DataFrame([[1, "a"], [2, "b"], [3, "c"]])
df.to_hdf("store1.h5", key="some/key", format="fixed")
df.to_hdf("store2.h5", key="some/key", format="table")
In this example, the memory-overhead is drastically reduced in the second approach (store2.h5). In more realistic situations, this overhead will become less significant with larger amounts of data. A fixed-format store allows for fast read/write operations, while the table format is more flexible (see docs for details).
For instance, the table
format can handle mixed data-types (per column) better than the fixed format. See, for instance, what happens if you use df.T.to_hdf(...)
in the above examples. The fixed format will issue the below PerformanceWarning (see this post on SO, or this pandas issue), while the table format works just fine.
PerformanceWarning: your performance may suffer as PyTables will pickle
object types that it cannot map directly to c-types
ptrepack
is a command-line utility that comes with PyTables (the package is named tables
). To see the current version of PyTables: python -m pip show tables
.
Using ptrepack, I can further reduce the file sizes for my dummy examples by applying some compression. (Using option --chunkshape=auto
did not have noticeable effect.)
# store1.repack.h5: 1.1MB -> 22kB
ptrepack --complevel=9 --complib=blosc "store1.h5" "store1.repack.h5"
# store2.repack.h5: 86kB -> 9kB
ptrepack --complevel=9 --complib=blosc "store2.h5" "store2.repack.h5"
In summary, saving the data frame in table-format and repacking the resulting store with compression allows to reduce the store's memory footprint. Whether its reasonable to minimize the storage overhead of an HDF store depends on your application.