is there an ocaml library store/use data structure on disk

https://stackoverflow.com/questions/11514626

21-06-2021
|

Question

like bdb. However, I looked at the ocaml-bdb, seems like it's made to store only string. My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database or those key-value db things, which is my last resort. I'm wondering if there's a better way.

Solution

The HDF4 / HDF5 file format might suit your needs. See http://forge.ocamlcore.org/projects/ocaml-hdf/

OTHER TIPS

In addition to the HDF4 bindings mentioned by jrouquie there are HDF5 bindings available (http://opam.ocaml.org/packages/hdf5/). Depending on the type of data you're storing there are bindings to GDAL (http://opam.ocaml.org/packages/gdal/).

For data which can fit in a bigarray you also have the option of memory mapping a large file on disk. See https://caml.inria.fr/pub/docs/manual-ocaml/libref/Bigarray.Genarray.html#VALmap_file for example. While it ties you to a rather strict on-disk format, it does make it relatively simple to manipulate arrays which are larger than the available RAM.

there was an ocaml BerkeleyDB wrapper in the past: OCamlDB

Apparently someone looked into it recently: recent patch for OCamlDB

However, the GDAL bindings from hcarty are probably production ready and in intensive usage somewhere.

Also, there are bindings for dbm in opam: dbm and cryptodbm

HDF5 is prolly the answer, but given the question is somewhat vague, another solution is possible.

Disclaimer: I don't know ocaml (but I knew caml-light) and I know berkeley database (AKA. bsddb (AKA bdb)).

However, I looked at the ocaml-bdb, seems like it's made to store only string.

That maybe true in ocaml-bdb but in reality it stores bytes. I am not sure about your case, because in Python2 there was no difference between bytes and strings of unicode chars. It's until recently that Python 3 got a proper byte type and the bdb bindings take and spit bytes. That said, the difference is subtile but you'd rather work with bytes because that what bdb understand and use.

My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database

or use those key-value db things, which is my last resort.

I'm wondering if there's a better way.

It depends on you need and how the data looks.

If the data can all stay in memory, you'd rather dump memory to a file and load it back.

If you need to share than data among several architectures or Operating system you'd rather use a serialisation framework like HDF5. Remember is that HDF5 doesn't handle circular references.

If the data can not stay all in memory, then you need to use something like bdb (or wiredtiger).

Why bdb (or wiredtiger)

Simply said, several decades of work have gone into:

splitting data
storing it on disk
retrieve data

As fast as possible.

wiredtiger is the successor of bdb.

So yes you could split the files yourself et al. but that will require a lot of work. Only specialized compagnies do that (bloomberg included...), among people that manage themself all the above there is the famous postgresql, mariadb, google and algolia.

ordered key value stores like wiredtiger and bdb use similar algorithm to higher level databases like postgresql and mysql or specialized one like lucene/solr or sphinx ie. mvcc, btree, lsm, PSSI etc...

MongoDB since 3.2 use wiredtiger backend for storing all the data.

Some people argue that key-value store are not good at storing relational data, that said several project started doing distributed databases on top of key value stores. This is a clue that it's useful. E.g. FoundationDB or CockroachDB.

The idea behind key-value stores is to deliver a generic framework for:

splitting data
storing it on disk
retrieve data

As fast as possible, giving some guarantees (like ACID) and other nice to haves (like compression or cryptography).

To take advantage of the power offer by those libraries. You need to learn about key-value composition.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow