In-memory database scalability

Question 1

in-memory databases scale in size the same way as on-disk (aka persistent) databases do: either throw more storage at it (memory, in this case) or distribute it across multiple nodes of a cluster. The latter alternative increases the complexity (both of the DBMS, and your administration of it), relative to an in-memory database on a single system. Consider the difference between vanilla MySQL and MySQL Cluster. And, you'll want to have a really fast network for those times when the DBMS has to perform inter-node operations (e.g. distribute the data or pull data from multiple nodes to satisfy a query).

There's nothing particularly special about in-memory databases in this regard. There are some special optimizations in the database engine when you know storage is memory. But it doesn't change the fundamental principles of database systems.

What you don't want to do is create an in-memory database larger than physical memory. You'll force the OS to swap in-memory database pages in/out of swap space, and the performance will suck. You're better off, in that case, using a conventional DBMS and giving it as much cache as you have memory available for. The DBMS will use the cache more intelligently than the OS' will the swap space.

Question 2

Current production-ready in-memory databases have mainly focused on scale-up as opposed to scale-out. So-far, they have either managed to integrate main memory tier into their core, existing architecture (IBM via Blu acceleration) or have re-built the database from almost-scratch to leverage the main-memory as primary storage layer (SAP HANA), and in both cases their claim to fame is the obvious speedup that DRAM offers in comparision to the disk.

However very few databases, presently, have a complete offering which scales-out in-memory performance benefit accross multiple nodes. Most of the in-memory databases require the applications to manage the distribution of data/objects across nodes (Ex: SAP HANA).

Oracle's DBIM and MemSQL are a few scalable and distributed options, at this time, that implement distributed in-memory database/tier by collective utilization of memory resources across the cluster (RAC in case of Oracle). MemSQL can be deployed on a cluster of commodity compute nodes and it claims to scale by utilizing aggregate resources, including memory. Oracle RAC is a shared cache architecture that overcomes the limitations of traditional shared-nothing and shared-disk approaches to provide highly scalable and available database solutions, including in-memory benefits.