Domanda

I'm looking for a good database solution to store large (~100's of GB to several TB) amounts of scientific data. Ideally it would be able to handle larger quantities of data.

Requirements

My datafiles are "images", a ~4 million entry array (1000x1000x3 ints + 1000x1000 floats), plus associated metadata of ~50-100 entries per image. The metadata is stored hierarchically. Images will be organized into one or several "folders" (or "projects"), which themselves can contain other folders. Everything has owners, etc.

I will need to search 100-10,000 images, in one or several folders, based predominantly on its metadata. Then, I might need to pull slices from the image -- I really don't want to load all of the data if I only need a fraction of it. The images should be stored in a compressed format.

Edit: It is important to emphasize that I lack uniform data. Images, for instance, are floats or ints of unknown dimensions with typically 10^5-10^6 entries, and the number of metadata per image might vary. Searching metadata across images would of course be limited to those with identical keys.

Current Approach

My current, and not so great, solution is to mix databases. First, I'm using a SQL database (Django + MySQL right now) to handle "folder", owners, and has a record for each image, but none of its data. I might create records for the metadata as well. Second, I'm using PyTables to store the images and metadata in an hdf5 format and treat it like a database. This solves the slicing and compression problem, and allows me to store the metadata hierarchically, but PyTables does not seem scalable and is far less developed than commercial databases. (It's not made for a multiuser environment: I'm writing my own locks!, which is a bad sign.)

Help!

I'm not a hardcore programmer, so a standard database solution is strongly preferred. My "optimization" would definitely include maintenance and programming cost. Can anyone recommend favorite database solutions or architectures? Ideas on relational vs hierarchical vs other?

Options might be SciDB (not common, could be good), SQL (heard it's bad for these applications, maybe PostgreSQL?), and HBase (actually, I know nothing about it). I feel like there must be good solutions in the scientific, especially astronomy, community, but the large-scale projects seem to require a serious team to build and maintain.

I'm happy to provide lots more info.

È stato utile?

Soluzione

Did you store the data in HDF5 format? Since you already mentioned that you were reluctant to load all of the data, you may not really like the array database options like SciDB, MonetDB or RasDaMan. It is very painful to load big data in raw scientific format into a database, and it usually also requires some extra programming work.

You can check this paper:Supporting a Light-Weight Data Management Layer over HDF5. This work proposes to manipulate SQL directly over HDF5.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top