GraphDB querying and sharding

Question

Please see the following two posts about Titan (http://titan.thinkaurelius.com):

Titan at small scale -- http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/
Titan at large scale -- http://thinkaurelius.com/2013/05/13/educating-the-planet-with-pearson/

Typically, when you begin developing a graph application, you are using a single machine. In this model, the entire graph is on one machine. If the graph is small (data size wise) and the transactional load is low (not a massive amount of read/writes), then when you go into production, you simply add replication for high availability. With non-distributed replication, the data is fully copied to the other machines and if any one machine goes down, the others are still available to serve requests. Again, note that in this situation your data is not partitioned/distributed, just replicated.

Next, as your graph grows in size (beyond the memory and HD space of a single machine), you need to start thinking about distribution. With distribution, you partition your graph over a multi-machine cluster and (to ensure high availability) make sure you have some data redundancy (e.g. replication factor 3).

From single server to distributed cluster: http://thinkaurelius.com/2013/03/30/titan-server-from-a-single-server-to-a-highly-available-cluster/

There are two ways to partition data in Titan currently:

Random partitioning: Vertices and their co-located incident edges are distributed amongst the cluster. That is, a vertex and its incident edges form a "bundle of data" and exist together on a machine. Random partitioning ensures that the cluster is properly balanced so no one machine is maintaining all the data. A simple distribution strategy that is generally effective.
User directed partitioning: A vertex (and its incident edges) is assigned to a partition (this partition ultimately represents a machine -- though not fully true because of replication and the same data existing on multiple machines). User directed partitioning is useful for applications that understand the topology of their domain. For example, you may know that there are few edges between people of different universities than there are between people of the same university. Thus, a smart partition would be based on university. This ensures proper vertex-vertex colocation and reduces multi-machine hoping to solve a traversal. The drawback is you want to make sure your cluster isn't too unbalanced (all the data on one partition).

At the end of the day, the whole story is about co-location. Can you ensure that co-retrieved data is close in physical space?

http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte-graph/

Finally, note that Titan allows for parallel reads (and writes) using Faunus (http://faunus.thinkaurelius.com). Thus, if you have an OLAP question that requires scanning the entire graph, then Titan's co-location model is handy as a vertex and its edges is a sequential read from disk. Again, the story remains the same -- co-location in space in accordance with co-retrieval in time.

http://thinkaurelius.com/2012/11/11/faunus-provides-big-graph-data-analytics/