Question

I'm researching possible options how to organize data storage for an Erlang application. The data it supposed to use is basically a huge collection of binary blobs indexed by short string ids. Each blob is under 10 Kb but there are many of them. I'd expect that in total they would have size up to 200 Gb so obviously it cannot fit into memory. The typical operation on this data is either reading a blob by its id or updating a blob by its id or adding a new one. At each given period of day only a subset of ids is being used so the data storage access performance might benefit from in-memory cache. Speaking about performance - it is quite critical. The target is to have around 500 reads and 500 updates per second on commodity hardware (say on EC2 VM).

Any suggestions what to use here? As I understand dets is out of question as it is limited to 2G (or was it 4G?). Mnesia probably out of question too; my impression is that it was mainly designed for cases when data fits memory. I'm considering trying EDTK's Berkeley DB driver for the task. Would it work in the above scenario? Does anybody have experience using it in the production in the similar conditions?

Was it helpful?

Solution

tcerl came out of facing the same size limit. I'm not using Erlang these days but it sounds like what you're looking for.

OTHER TIPS

Have you looked at what CouchDB is doing? It might not be quite what you are after as a drop in product, but there is lots of erlang code in there for storing data. There is also some talk of providing a native erlang interface instead of the REST api.

Is there any reason why you can't just use a file system, treating filename as your string id and file contents as a binary blob? You can choose one (filesystem) that fits your performance requirements, and you should get caching basically for free, provided by your OS.

Mnesia can store data on disk just fine. There's also dets (disk based term storage) which is roughly analogous to Berkeley DB. It's in the standard lib: http://www.erlang.org/doc/apps/stdlib/index.html

I would recommend Apache CouchDB.

It's a great fit for Erlang, and from the sound of it (you mention ID-based blobs and don't mention any relational requirements) you're looking for a document-oriented database.

Since the interface is REST, you can very simply add a commodity HTTP cache in front of it if you need caching.

The documentation for CouchDB is of a very high quality.

It also has built-in Map-Reduce :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top