Question

We're setting out to build an online platform (API, Servers, Data, Wahoo!). For context, imagine that we need to build something like twitter, but with the comments (tweets) organized around a live event. Information about the live event itself must be delivered to clients as fast and consistently as possible, while comments about the event can probably wait a bit longer to be delivered. We'll be read-heavy after the live event finishes.

Scalability is very important. We want to start out renting VPS slices, and scale from there. I'm a big fan of the cloud, and would like to remain there as long as possible. We'll probably be using ruby.

I'm convinced that I want to try a document store instead of an RDBMS. I like the idea of schema-less storage and the promises of easier scalability by focusing on key-value.

The problem is I don't know which technology is the most appropriate for our platform. I've looked at Couch, Mongo, Tokyo Cabinet, Cassandra, and an RDBMS with blobbed documents. Any help picking the right tool for this particular job?

Was it helpful?

Solution

Checkout the NO SQL alternatives comparison by BJ Clark.

Scalability is very important.

Then you need to consider the excerpts from his blog:

  1. Tokyo Cabinet - Doesn't scale
  2. Redis - Doesn't scale
  3. Project Voldemort - scales
  4. MongoDB - limted (sharding is been implemented)
  5. Cassandra - scales
  6. Amazon S3 - scales
  7. Couch - Doesn't scale (Clustering & replication)
  8. MySQL - Doesn't scale

And consider HyperTable. This is also a serious contender in No-SQL alternatives. It's an open source implementation of Google's BigTable concept. I believe it scales well because it's extensively used by the Chinese search engine Baidu and entertainment portal Rediff.

You were saying:

Information about the live event itself must be delivered to clients as fast and consistently as possible, while comments about the event can probably wait a bit longer to be delivered. We'll be read-heavy after the live event finishes.

This is something like Twitter's approach. Your programming language selection is also very important, because Twitter initially went with Ruby for back-end message delivery but they were saying it's not a correct choice and they have moved the entire message delivery system to the Scala language.

They are still using Ruby for their front-end. If you want to go with a highly reliable, fault tolerant system that is well suited for scalable environments, then you should consider Scala or Erlang.

OTHER TIPS

Ramesh has a good summary. I would add that Cassandra has a richer data model than vanilla Dynamo clones (like Voldemort or Dynomite): rows with named, sorted columns rather than just key/value. Cassandra is being used by Twitter, Mahalo, Ooyala, SimpleGeo, WebEx, and others (http://n2.nabble.com/Cassandra-users-survey-td4040068.html), at least some of which are running Cassandra clusters on EC2 or rackspace cloud servers.

If you want to scale horizontally (distribute your data over more than one node) you have to take the CAP theorem into account.

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

It is not easy stuff but you have to choose, there is always some kind of trade off.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top