Question

Small brainstorming here.

I search the most suitable solution for a distributed storage solution. I look for a efficient key/value storage, flat namespace, with minimum latency.

Scenario

I plan to save small blob records, 1Ko or less. They are mostly produced/consumed records:

  • 1 write
  • 1 read, more on rare cases.
  • delete, after several month for archive.

However some records may grow up to 10Mb, it's the maximum but must be possible.

The data must be serialize on disk.

Important

My first one priority is a storage that can provide good response time on a really huge list of file, may be several hundred of millions.

Of course, with this number, I don't care about iterating over my files (I look for the functionality but don't care about performance, only for debug or maintenance).

And of course a solution that scale, without SPOF only better.

Must be Linux solutions and no Cloud allowed (private data).

What I found yet

I looked at Voldemort, Cassandra and HBase.

  • I'm afraid that Cassandra and HBase are not really efficient for blob record.
  • Voldemort looks still immature and I can't find information about record size and number of files supported.

I check also Lustre and Ceph, but they're not key/value store.

CouchBase and MongoDB have terrible performance with persistence activated.

I'm running some tests but can't really launch solid benchmark just yet. If someone have some information about this solutions or know another product design for such workload?

Was it helpful?

Solution

Have you taken a look at in-memory data grids like Infinispan or Hazelcast? They have excellent scalability and are responsive but having 10Mb objects stored could be a problem if one day you would consider any processing on those entries. However Hazelcast for example allows tasks execution on same member of cluster that owns target entries thereby reducing amount of inter-member data flow.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top