I'm currently working on a problem that involves querying a tremendous amount of data (billions of rows) and, being somewhat inexperienced with this type of thing, would love some clever advice.

The data/problem looks like this:

  1. Each table has 2-5 key columns and 1 value column.
  2. Every row has a unique combination of keys.
  3. I need to be able to query by any subset of keys (i.e. key1='blah' and key4='bloo').
  4. It would be nice to able to quickly insert new rows (updating the value if the row already exists) but I'd be satisfied if I could do this slowly.

Currently I have this implemented in MySQL running on a single machine with separate indexes defined on each key, one index across all keys (unique) and one index combining the first and last keys (which is currently the most common query I'm making, but that could easily change). Unfortunately, this is quite slow (and the indexes end up taking ~10x the disk space, which is not a huge problem).

I happen to have a bevy of fast computers at my disposal (~40), which makes the incredible slowness of this single-machine database all the more annoying. I want to take advantage of all this power to make this database fast. I've considered building a distributed hash table, but that would make it hard to query for only a subset of the keys. It seems that something like BigTable / HBase would be a decent solution but I'm not yet convinced that a simpler solution doesn't exist.

Thanks very much, any help would be greatly appreciated!

No correct solution


I'd suggest you listen to this podcast for some excellent information on distributed databases. episode-109-ebays-architecture-principles-with-randy-shoup

To point out the obvious: you're probably disk bound.

At some point if you're doing randomish queries and your working set is sufficiently larger than RAM then you'll be limited by the small number of random IOPS a disk can do. You aren't going to be able to do better than a few tens of sub-queries per second per attached disk.

If you're up against that bottleneck, you might gain more by switching to an SSD, a larger RAID, or lots-of-RAM than you would by distributing the database among many computers (which would mostly just get you more of the last two resources)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow