Why would index nodes or an indexed property be better in a graph database?

https://stackoverflow.com/questions/12754619

05-07-2021
|

Question

I'm just getting into graph databases, and I seem to keep running into a problem deciding between using an "index node" or an "indexed property" for tracking things like "node type". Since I've no real experience thus far, I don't have any information to base the decision on and both approaches seem to be equally valid.

So, the question is: What are the tradeoffs between two approaches, and how does scale (ie. number of nodes) affect the decision?

For a sample scenario, lets assume there are two types of "things": User and Product, and the edges between the User nodes and the Product nodes don't matter so much, but what we care about is if we want type: User and type: Product properties on each node, or if we want each node to have an edge pointing back at a User node and a Product node, respectively.

Which approach is better under which circumstances?

_{Note: I'm looking at Neo4j and Titan in particular, but I would think that this will tend to apply more generally as well.}

Solution

First, you need to ask yourself: Does the type of a vertex/node need to be indexed? I.e. do you need to retrieve vertices/nodes by their type, let's say, retrieve all 'user' vertices from the graph or do you need to answer queries that start by retrieving all vertices of a given type and then filter/process those further?

If the answer to this question is yes, then I suggest you store the type as a string property that is indexed. Or, if you are developing in a jvm based language, you could define a type enum and use that as the property type for more type safety and automatic error checking. Titan supports arbitrary user defined classes/enums as property types and will compress those for a low memory footprint.

However, the downside of this approach is that this won't scale because you are building a low selectivity index. What that means is that there will likely be very many vertices of type 'user' or 'product' and all those need to be associated with the index entry for 'user' or 'product' respectively. This makes maintaining and querying this index very expensive and hard to scale (imagine facebook had a 'type' index: the 'photo' entry would have billions of vertices under it). If you are not (yet) concerned with scaling, then this can work.

If the answer to the question is no, then I suggest to model types as vertices/nodes in the graph. I.e. have a 'user' vertex and a 'product' vertex and an edge labeled 'type' from each user to the 'user' vertex, etc.

The advantage of this approach is that you use the graph to model your data rather than having string values outside of your database represent crucial type information. As you build your application, the graph database will become its central component and last for a long time. As programming languages and developers come and go, you don't want data modeling and type information to go with them and be faced with the question: "What does SPECIAL_USER mean?" Rather, have a SPECIAL_USER vertex and add provenance information to it, i.e., who created this type, what does it represent and a short description - all in the database.

One problem with this approach is that the 'user' and 'product' vertices will have a lot of edges incident on them as your application scales. In other words, you are creating supernodes which create scaling issues. This is why Titan introduced the concept of a unidirectional edge. A unidirectional edge is like a link on the web: the starting vertex points to another vertex, but that vertex is unaware of the edge. Since you don't want to traverse from the 'user' vertex to all user vertices, you aren't loosing anything but gaining in scalability and performance.

OTHER TIPS

What kind of query do you want to ask? In Neo4j, you would create a User and a Product index or even combine them in one, and then be able to ask things like

start bob = node:User(name='Bob') match ....

and even fulltext search. For easy checking if a node is a User or Product, you could have the property still on the nodes, just for convenient and fast traversal. If you are not traversing from User/Product to the instance nodes (you do the index lookups for that), you can even do the check by having a PRODUCT or USER relationship back to the type (super)nodes, giving you a check in-traversal like

start s = node:User(name='Bob') match s-[r]-(product)-[typeRel:PRODUCT]->() return product

HTH

A very important reason for indexing has been missed here imo. Suppose you have a complex graph with many different properties, and many different node types, and you want to match a pattern with a "person" who has a bunch of properties.

With no indexes, you have no option but to traverse the graph, a graph in which maybe only 0.01% of the nodes are of type person. And traversals may not reach unconnected regions of the graph.

Instead, if I have indexed person, I just iterate through every person, and search locally around each person to see if their pattern matches.

You should be able to see instantly that the first of these approaches scales with the total size of the graph, but the second only scales with the total number of people in the graph.

Moral: If you envisage a use case where there will be many search of the graph with a particular type of node as the bound node in your pattern (e.g. here lots of searches for "people with pattern X"), then you should index these nodes for improved search performance.

If you are going to search for things like "all nodes within two links of person Peter", then indexing person by their name would be critical, and would allow constant time performance regardless of graph size - as you are essentially looking up the location of Peter in a hash table.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow