thread and exception safe way of using HDFStore files

https://stackoverflow.com/questions/21583113

07-10-2022
|

Question

The following code snippet:

    HDFStore = pandas.io.pytables.HDFStore
    lock = threading.RLock()
    with lock:
        store = HDFStore('my_datafile.hdf','r')
        data_frame = store['my_series']
        store.close()
    return data_frame['my_column']

is executed in response to web requests, so it is possibly executed on multiple threads at the same time. It is also possible that the execution is interrupted before the store.close is called.

I'm experiencing some troubles (exceptions in the HDFStore library, or empty data returned) in a non reproducible way.

What is the correct way to make this code thread-safe and to be sure that the file is correctly closed upon exceptions?

With some investigation I found that HDFStore has a caching mechanism for opened files... maybe this might be the problem?

Solution

for reference, see pandas docs

and just release PyTables 3.1 release notes

This should work on PyTables 3.0.0. As long as you are not writing the file anywhere else (iow it already exists).

You can try doing this as well:

with get_store('my_datafile.hdf',mode='r') as store:
    return store['my_series']

which will automatically close it for you (I dont think this is thread-safe per say, but maybe if you put it within your with lock: it will be thread-safe.

If you are only reading, then you don't care about being threadsafe at all. DO NOT UNDER ANY CIRCUMSTANCES try to write in multiple threads (or even processes). This will blow your file up.

PyTables 3.1 was just released that changes the file caching mechanism at least on a lower HDF5 version, do to see your version:

import tables
tables.get_hdf_version()

I don't know the effect this will have on thread safe-ness.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow