What is the difference in these cypher queries?

https://stackoverflow.com/questions/23390741

neo4j
cypher

12-07-2023
|

Question

These queries are all logically equivalent returning the same 6 results (except the last which returns only 5 results), but performance is very different ranging from 31 ms to 45 seconds. I'm using Neo4j 2.0.2. I have an index ON :SEGMENT(propertyId), but the lookup of (n) is not why the query is slow.

match (n {productId:6122})<-[:PARENT_OF*]-(p) return n,p;
[...]
6 rows
43879 ms

match (n:SEGMENT {productId:6122})<-[:PARENT_OF*]-(p) return n,p;
[...]
6 rows
44926 ms

start n=node(111426) match (n)<-[:PARENT_OF*]-(p) return n,p;
[...]
6 rows
31 ms

match (n {productId:6122}) match path=(n)<-[:PARENT_OF*]-(p) return path;
[...]
6 rows
694 ms

match (n:SEGMENT {productId:6122}) match path=(n)<-[:PARENT_OF*]-(p) return path;
[...]
6 rows
161 ms

match (n:SEGMENT)<-[:PARENT_OF*]-(p:SEGMENT) where n.productId=6122 return n,p;
[...]
5 rows
45332 ms

Added PROFILE output:

PROFILE match (n:SEGMENT {productId:6122})<-[:PARENT_OF*]-(p:SEGMENT) return n,p;
`ColumnFilter(symKeys=["n", "p", "  UNNAMED34"], returnItemNames=["n", "p"], _rows=5, _db_hits=0)
Filter(pred="(hasLabel(n:SEGMENT(0)) AND Property(n,productId(9)) == Literal(6122))", _rows=5, _db_hits=1895169)
  TraversalMatcher(start={"label": "SEGMENT", "producer": "NodeByLabel", "identifiers": ["p"]}, trail="(p)-[:PARENT_OF*1..]->(n)", _rows=1895169, _db_hits=1895169)`

PROFILE match (n {productId:6122}) match path=(n)<-[:PARENT_OF*]-(p) return path; 
`ColumnFilter(symKeys=["n", "p", "  UNNAMED41", "path"], returnItemNames=["path"], _rows=6, _db_hits=0)
ExtractPath(name="path", patterns=["ParsedVarLengthRelation(  UNNAMED41,Map(),ParsedEntity(n,n,Map(),List()),ParsedEntity(p,p,Map(),List()),List(PARENT_OF),INCOMING,false,None,None,None)"], _rows=6, _db_hits=0)
  PatternMatcher(g="(n)-['  UNNAMED41']-(p)", _rows=6, _db_hits=0)
    Filter(pred="Property(n,productId(9)) == Literal(6122)", _rows=1, _db_hits=48531)
      AllNodes(identifier="n", _db_hits=48531, _rows=48531, identifiers=["n"], producer="AllNodes")`

Solution

The fastest query is the internal ID lookup, which is not surprising. The ID value itself (avoid it as an external identifier) is tightly coupled to the stored data structure. It is roughly equivalent to telling Cypher where the node is in the node store file. (*)

For the next two fastest ones, I might be totally mistaken, but I think they are faster because you match a path only, although I'm not sure how exactly this influences the query behaviour. The small delta between the two can be explained by the fact than one query is using the schema index under the hood, while the other is not (as the label isn't specified in the second case).

For the last 3 ones, it might be that the start point location lookup time is very irrelevant compared to the depth of your relationships PARENT_OF. You may end up traversing long paths, I'm not sure.

(*) Still I don't understand how just a lookup by ID of the start node would explain such a difference with the similar 2 slowest queries (they also don't match by path...)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow