Question

I have a strong use case for mixing up scientific data i.e. double matrices and vectors along with relational data and use this as data source for a distributed computation e.g. MapReduce, hadoop etc. Up to now I have been storing my scientific data in HDF5 files with custom HDF schemas and the relational data in Postgres but since this setup does not scale very well I was wondering whether there is a more NoSQL hybrid approach to support the heterogeneity of this data?

e.g. my use case would be to distribute a complex process that involves:

  1. loading GB of data from a time series database provider
  2. link the time series to static data e.g. symbol information, expiry, maturity dates etc
  3. launch a series of scientific computations e.g. covariance matrix, distribution fitting, MC simulations
  4. distribute the computations across many separate HPC nodes and storing the intermediate results for traceability.

These steps require a distributed database that can handle both relational and scientific data. A possibility would be to store the scientific data in HDF5 and then put it as BLOB columns within a relational database but this is a misuse. Another would be to store the HDF5 results in disk and have a relational database linking to it but we lose self-containment. However, none of these two approaches accounts for distributing the data for direct access in the HPC nodes as the data would need to be pulled from a central node and this is not ideal.

Was it helpful?

Solution

I am not sure if I can give a proper solution but we have a similar setup.

We have meta-information stored in a RBDMS (postgresql) and the actual scientific data in HDF5 files.
We have a couple of analysis that are run on a our HPC. The way it is done is as follows:

  1. User wants to run an analysis (from a web-frontend)
  2. A message is sent to a central message broker (AMQP, RabbitMQ) containing the type of analysis and some additional information
  3. A worker machine (VM) picks up the message from the central message broker. The worker uses REST to retrieve meta-information from the RDBMS database and stages the files on the HPC and then creates a PBS job on the cluster.
  4. Once the PBS job is submitted a message with the job-id is sent back to the message broker to be stored in the RBDS database.
  5. The HPC job will run the scientific analysis and then store the result in a HDF5 file.
  6. Once the job is finished, the worker machine will stage-out the HDF5 files into a NFS share and it will store the link in the RBMS database.

I would recommend against storing binary files in a RDBMS as a BLOB.
I would keep them in HDF5 format. You can have different backup policies for the database and the filesystem.

A couple of additional pointers:

  • You could hide everything (both RBMS and HDF5 storage) behind a REST interface. This might solve your containment issue
  • If you want to store everything in a NoSQL DB I would recommend to have a look at Elasticsearch. It works well with time-series data, it is distributed out of the box and it has also a Hadoop plugin
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top