سؤال

I've been evaluating Neo4j 1.9.M03 for a while now, and have come to a head that I did not expect.

I have a graph of ~140,000 vertices. I also have three classes of edges, let's call them father, mother, and husband. There are about 80,000 edges per class. There are no properties, and no indexes. The vertex store size is about 1.3 MB, and the edge store is about 8 MB.

The data originates in SQL Server and the quality of migration from SQL to Neo4j is known to be correct. A SQL shortest path stored procedure has been run for dozens of vertex pairs, such that the shortest path distance and path is known.

The shortest path query is Cypher: START one=node(0), two=node(1234) MATCH p = shortestPath(one-[*..1000]-two) RETURN p;

PARTIAL TEST CASE ONE: I use only husband and father relations, the occurrence of cycles (e.g. v[0] -> v[1] -> v[2] -> v[0]) is low. If I perform a shortest path calculation on a specific known long path (e.g. known to be ~450 hops), it returns within 50ms (non-cached) with a path of ~550 hops. The increased length is expected, since we are excluding a portion of the edges.

PARTIAL TEST CASE TWO: Likewise, if I use only husband and mother relations, the occurrence of cycles (e.g. v[0] -> v[1] -> v[2] -> v[0]) is low. If I perform the same shortest path, I get a result on the same order as before: about 50ms (non-cached), with a similar increase in path length.

FULL TEST CASE: I use all (father, mother, and husband) relations. The occurrence of cycles is now predictably high because of the common case v[0] mother-> v[1] husband-> v[2] <-father v[0]. When I execute the shortest path query, the JVM allocates 4 gigabytes of memory and the calculation does not complete. This is the problem.


My thesis is that the regular occurrence of cycles is causing this behavior, otherwise I would not expect such a huge difference in performance when I only add another class of parent edge--unless the shortest path algorithm was not accounting for cycles.

I applied the Dijkstra Algorithm using the Java API directly, with a cost of 1 across all edges, and achieved similar results to the standard ShortestPath algorithm used. As a result, I received this exception after 6 minutes of IntelliJ debug time.

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.neo4j.kernel.impl.util.RelIdArray$RelIdIteratorImpl.<init>(RelIdArray.java:661)
    at org.neo4j.kernel.impl.util.RelIdArray$DirectionWrapper$3.iterator(RelIdArray.java:327)
    at org.neo4j.kernel.impl.util.RelIdArray.iterator(RelIdArray.java:270)
    at org.neo4j.kernel.impl.core.NodeImpl.getAllRelationships(NodeImpl.java:172)
    at org.neo4j.kernel.impl.core.NodeImpl.getRelationships(NodeImpl.java:270)
    at org.neo4j.kernel.impl.core.NodeProxy.getRelationships(NodeProxy.java:82)
    at org.neo4j.kernel.StandardExpander$AllExpander.doExpand(StandardExpander.java:303)
    at org.neo4j.kernel.StandardExpander$RelationshipExpansion.iterator(StandardExpander.java:194)
    at org.neo4j.kernel.impl.traversal.TraversalBranchImpl.expandRelationshipsWithoutChecks(TraversalBranchImpl.java:114)
    at org.neo4j.kernel.impl.traversal.TraversalBranchImpl.expandRelationships(TraversalBranchImpl.java:104)
    at org.neo4j.kernel.impl.traversal.TraversalBranchImpl.initialize(TraversalBranchImpl.java:130)
    at org.neo4j.kernel.impl.traversal.TraversalBranchImpl.next(TraversalBranchImpl.java:150)
    at org.neo4j.graphalgo.impl.util.BestFirstSelectorFactory$BestFirstSelector.next(BestFirstSelectorFactory.java:73)
    at org.neo4j.kernel.impl.traversal.TraverserIterator.fetchNextOrNull(TraverserIterator.java:65)
    at org.neo4j.kernel.impl.traversal.TraverserIterator.fetchNextOrNull(TraverserIterator.java:34)
    at org.neo4j.helpers.collection.PrefetchingIterator.hasNext(PrefetchingIterator.java:55)
    at org.neo4j.graphalgo.impl.util.StopAfterWeightIterator.fetchNextOrNull(StopAfterWeightIterator.java:45)
    at org.neo4j.graphalgo.impl.util.StopAfterWeightIterator.fetchNextOrNull(StopAfterWeightIterator.java:29)
    at org.neo4j.helpers.collection.PrefetchingIterator.hasNext(PrefetchingIterator.java:55)
    at org.neo4j.helpers.collection.IteratorUtil.firstOrNull(IteratorUtil.java:51)
    at org.neo4j.helpers.collection.IteratorUtil.firstOrNull(IteratorUtil.java:201)
    at org.neo4j.graphalgo.impl.path.Dijkstra.findSinglePath(Dijkstra.java:98)
    at org.neo4j.graphalgo.impl.path.Dijkstra.findSinglePath(Dijkstra.java:50)
    at ShortestPathCalc.Dijkstra(Main.java:198)
    at Main.main(Main.java:53)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:601)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

Do you think I'm right? Is this a known limitation of graph databases or their shortest path algorithms? It seems quite silly to me that previously-visited vertexes would not be stored in a hash table, such that the shortest path algorithm would not attempt to path out of a previously visited vertex more than once.


UPDATE 1/25/2013

A Github repo so you can follow along!

https://github.com/squirrelsama/neo4j-shortestpath-issue

UPDATE 2/7/2013

See the accepted answer. In short, cycles have nothing to do with it.

هل كانت مفيدة؟

المحلول 2

If one attempts to get the shortest path between nodes 44715 and 17173, whose shortest path is known to be 112 hops, one can observe the issue.

If we limit the shortest path evaluation to 111 hops, the query completes very quickly, but with no path. START one=node(44715), two=node(17173) MATCH p = shortestPath(one-[*..111]-two) RETURN p;

However, if we then limit the shortest path evaluation to 112 hops, we observe the failure of the query to complete and the JVM rapidly allocating memory up to 4 gigabytes. START one=node(44715), two=node(17173) MATCH p = shortestPath(one-[*..112]-two) RETURN p;

Neo has confirmed that this is an edge case bug relating to the assembly of the Path object to be returned. It's in their bug backlog.

In other words, cycles have nothing to do with the issue.

نصائح أخرى

Using the neo4j traversal framework you can select which uniqueness to use in a traversal, for example RELATIONSHIP_GLOBAL, such that it only traverses a relationship once during a traversal. That may that would resolve your problems:

// single directional
Traversal.traversal( Uniqueness.RELATIONSHIP_GLOBAL )
         .evaluator( Evaluators.returnWhereEndNodeIs( myEndNode )
         .traverse( myStartNode );

// bi-directional
Traversal.bidirectionalTraversal()
         .mirroredSides( Traversal.traversal( Uniqueness.RELATIONSHIP_GLOBAL ) )
         .traverse( myStartNode, myEndNode );

the above examples are in principal and may need to be modified to work with your query though.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top