I am not sure if I can give a proper solution but we have a similar setup.
We have meta-information stored in a RBDMS (postgresql) and the actual scientific data in HDF5 files.
We have a couple of analysis that are run on a our HPC. The way it is done is as follows:
- User wants to run an analysis (from a web-frontend)
- A message is sent to a central message broker (AMQP, RabbitMQ) containing the type of analysis and some additional information
- A worker machine (VM) picks up the message from the central message broker. The worker uses
REST
to retrieve meta-information from the RDBMS database and stages the files on the HPC and then creates aPBS
job on the cluster. - Once the
PBS
job is submitted a message with the job-id is sent back to the message broker to be stored in the RBDS database. - The HPC job will run the scientific analysis and then store the result in a HDF5 file.
- Once the job is finished, the worker machine will stage-out the HDF5 files into a NFS share and it will store the link in the RBMS database.
I would recommend against storing binary files in a RDBMS as a BLOB.
I would keep them in HDF5 format. You can have different backup policies for the database and the filesystem.
A couple of additional pointers:
- You could hide everything (both RBMS and HDF5 storage) behind a
REST
interface. This might solve your containment issue - If you want to store everything in a
NoSQL
DB I would recommend to have a look atElasticsearch
. It works well with time-series data, it is distributed out of the box and it has also a Hadoop plugin