Write heavy, replicated, bigger-than-memory key-value store

Question 1

I found the perfect solution for my use case: memcachedb

It doesn't do fancy document/indexing, it's just a simple key value store.

I didn't do any performance testing yet though.

Edit:

We dropped memcachedb due to problems with replication. Instead we run now mongodb. Mongodb requires much more disk space, and more resources in general. But the replica sets work very reliable and are easy to set up.

Question 2

Maybe you should try mongodb:
http://www.mongodb.org/display/DOCS/Amazon+EC2

Quickstart:
http://www.mongodb.org/display/DOCS/Amazon+EC2+Quickstart

Free courses at 10gen and video presentations:
http://www.10gen.com/presentations/nyc-meetup-group/mongodb-and-ec2-a-love-story

Other key-value storages:
http://google-opensource.blogspot.com/2011/07/leveldb-fast-persistent-key-value-store.html

Comments about Riak and their storages especially bitcask and innostore:
http://basho.com/blog/technical/2011/07/01/Leveling-the-Field/

RaptorDB: A extremely small size and fast embedded, noSql, persisted dictionary database using b+tree or MurMur hash indexing. It was primarily designed to store JSON data (see my fastJSON implementation), but can store any type of data that you give it.

HamsterDB: A delightful engine written in C++, which impressed me a lot for its speed while I was using Aarons Watters code for indexing. (RaptorDB eats it alive now... ahem!) It's quite large at 600KB for the 64bit edition.

Esent PersistentDictionary: A project on CodePlex which is part of a another project which implements a managed wrapper over the built in Windows esent data storage engine. The dictionary performance goes down exponentially after 40,000 items indexed and the index file just grows on guid keys. Apparently after talks with the project owners, it's a known issue at the moment.

Tokyo/Kyoto Cabinet: A C++ implementation of key store which is very fast. Tokyo cabinet is a b+tree indexer while Kyoto cabinet is a MurMur2 hash indexer.

4aTech Dictionary: This is another article on CodeProject which does the same thing, the commercial version at the web site is huge (450KB) and fails dismally performance wise on guid keys after 50,000 items indexed.

BerkeleyDB: The grand daddy of all database which is owned by Oracle and comes in 3 flavours, C++ key store, Java key store and XML database.

(Quotation source: http://www.codeproject.com/Articles/190504/RaptorDB)

Question 3

Seems like a perfect use case for HBase. It gives great write throughput, especially if your insert keys are somewhat random. HBase is not usually advertised as a K/V store, but it should work just fine. The AWS documentation presents some use cases you might want to have a closer look at. The downside is that HBase can do a lot more than just K/V, so it might be more complex (and complicated) than what you need.

Question 4

Couchbase sounds like a good match for you needs. It's a lot like having memcached with disk storage.

Pros:

It's a key/value database. You can store whatever binary blob you want. As of version 2.0 it has support for storing your data as json and running some queries and map/reduce on it. But, if you don't need that, using it as key/value works great.
Of all the NoSQL databases I've tried, it's the fastest. This may be because your writes are not immediately committed to disk. Instead, you get an acknowledgment once a write is replicated in the cluster. Data is written to disk asynchronously. So, one potential downside is that if all your nodes crashed simultaneously (e.g. your data center loses power), you may lose data. Depending on the application this may or may not be an issue (and if your whole cluster goes down, you probably have bigger problems).
In my experience it has been reliable. If a node goes down, the cluster keeps working and it's very easy to do a failover. Adding new nodes is pretty easy too.
Data doesn't have to fit in memory. It gets stored on disk and paged in and out as necessary.
The admin interface is very, very nice. It has nifty live graphs to monitor the cluster.
It's backwards compatible with the memcached protocol. If you already have code that uses memcached, it'd be pretty straightforward to have it use Couchbase instead.

Cons:

The product is still somewhat young, so documentation and support tools are somewhat lacking. This can be a bit annoying sometimes.