Titan vertex centric indices vs Neo4j labels

Question 1

Agreeing with everything Marko said, one could take it further and argue that in the graph database world local indexes can (and even should) substitute global ones. In my opinion, the single greatest advantage of a graph data model is that it lets you encode your data model into the graph topology, gaining qualitative advantages in terms of flexibility, ease of evolution and performance. With this in mind, I'd argue that labels in Neo4j actually detract from all this; reifying a label into a node with adjacent edges pointing to the source having that label is much more in line with the "schema is the graph" philosophy.

Of course, if your engine lacks local indexes we are back at the supernode problem. But if you do have them (something which I'd say should be a requirement for something to be called a graph database), you can easily transform your label into a node L, and create relationships pointing to that node for those vertices which you want labeled with L

v -[L]-> L

meaning that v has label L. Now if you want this in Titan to behave like a Neo4j label, just make the -[L]-> relation to be "manyToOne" (see Titan cardinality constraints) and create a vertex-centric index. This pattern lets you get everything that you could with labels and much more; you can

effectively use this as a namespace for properties relating to that label
sort your elements inside one label
nest labels easily without losing performance (just use a composite key)
separate the declaration of a label L with how elements labeled with it are accessed

Question 2

The difference between the two concepts is the difference between global and local indexing.

As I understand it, Neo4j vertex labels allow you to break up your index space by "categories" of vertices. In this way, a O(log(|V|)) lookup is now an O(log(|V|/c)), where c is the number of categories/labels you have over your vertex set and (the equation) assumes an equal number of vertices in each category. As such, vertex label aid in global index calls as this is a function of V.

Next, Titan's vertex-centric indices sort and index the incident edges of a vertex. The cost to find a particular edge by its label/properties incident to a vertex is O(log(inc(v))), where inc(v) is the size of the incident edge set to vertex v. As such, vertex-centric indices are local indices as this is a function of v.

As I understand it, Neo4j does not support vertex-centric indices. You see this concept currently in Titan, OrientDB, and TinkerGraph (…and RDF stores sort in this manner as well -- via spog pairings). Next, all known graph databases support global indices though, (I believe only Neo4j and OrientDB), support a vertex set partition via the concept of a label.

Again, assuming my assumptions are correct about the use of vertex labels in Neo4j, we are talking about two different use cases — global vs. local indexing. From the perspective of the supernode problem, global indices do not quell the issue of traversing through a large vertex, while this is the sole purpose of the local vertex-centric indices.

You can read about the supernode problem and vertex-centric indices here:
http://thinkaurelius.com/2012/10/25/a-solution-to-the-supernode-problem/

Question 3

Labels may afford some design patterns that improve performance by de-densifying the graph. For example: they eliminate the need for type nodes, which can often get quite dense. Labels can optionally be associated with a unique index. Here, the ability to index a property isn't new, but the ability to constrain it uniquely is. If you were previously doing work in your application, you may experience some performance gains by letting the database handle this. (It's certainly much more convenient to do so.) Finally, if you don't assign a unique index to a label, it will still be indexed, in order to help performance for certain kinds of queries (e.g. "give me all of the nodes having label ")

All that said, while labels may help with performance in certain cases, they were introduced more with ease-of-use in mind. We're just getting started with Neo4j 2.1, which specifically addresses dense node performance (something I know you've been waiting for), along with other performance & scalability improvements... including removing (for all practical purposes eliminating) the upper size limits.

Philip