Aggregating data while traversing from root node to all the nodes and back in OrientDB graph database

StackOverflow https://stackoverflow.com/questions/21407521

Frage

I am using OrientDB graph database. I have to traverse a tree and collect data at every node and roll up. For example:

if A is root node and it has got A1 and A2 nodes connected by 'has a' relationship. A1 is connected to A11 and A12 with 'has a' relationship. Similarly A2 is connected to A21 and A22 with 'has a' relationship. The leaf nodes A11, A12, A21 and A22 has property called 'points'. I have to calculate the average points at each parent node based on the child nodes. If A11.points=20 and A12.points=10. Then the average points at A1 becomes 15. For A node I have to calculate average points based on the average points calculated at A1 and A2.

                        A
                      /   \
                    A1     A2
                   /  \    /  \
                 A11  A12 A21  A22

In short, I have to start from root node of tree and go on traversing all the nodes and traverse back collecting the data. Does anybody know how to do that using either OrientDB API or Gremlin?

Actually I tried to simplify the problem statement. The average is not simple average, it's weighted average. There is one more field at leaf nodes, let's say hours. The average will change as per the hours. If A11 has 90 points for 100 hours and A12 has 10 points for 10 hours. We need to consider hours also while calculating the average points.

War es hilfreich?

Lösung

This solution might get you closer to what you need, though I'm not sure it's an exact fit as it only let's you calculate the the average at a selected level of the hierarchy (i.e. at "A" or "A1"). Here's my Gremlin session:

gremlin> g = new TinkerGraph()
==>tinkergraph[vertices:0 edges:0]
gremlin> a = g.addVertex("a")  
==>v[a]
gremlin> a1 = g.addVertex("a1")
==>v[a1]
gremlin> a2 = g.addVertex("a2")
==>v[a2]
gremlin> a.addEdge('has',a1)    
==>e[0][a-has->a1]
gremlin> a.addEdge('has',a2)
==>e[1][a-has->a2]
gremlin> a1.addEdge('relationship',g.addVertex("a11",[points:20]))
==>e[2][a1-relationship->a11]
gremlin> a1.addEdge('relationship',g.addVertex("a12",[points:20]))
==>e[3][a1-relationship->a12]
gremlin> a2.addEdge('relationship',g.addVertex("a21",[points:100]))
==>e[4][a2-relationship->a21]
gremlin> a2.addEdge('relationship',g.addVertex("a22",[points:0]))  
==>e[5][a2-relationship->a22]
gremlin> p=g.v("a").out.loop(1){it.loops<10}{true}.path.filter{it.last().getProperty("points")!=null}.toList()               
==>[v[a], v[a2], v[a22]]
==>[v[a], v[a2], v[a21]]
==>[v[a], v[a1], v[a12]]
==>[v[a], v[a1], v[a11]]
gremlin> p.collect{[it, it.last().getProperty("points")]}._().groupBy{it[0][0]}{it[1]}{it.sum()/it.size()}.cap.next()
==>v[a]=35
gremlin> p.collect{[it, it.last().getProperty("points")]}._().groupBy{it[0][1]}{it[1]}{it.sum()/it.size()}.cap.next()
==>v[a1]=20
==>v[a2]=50

So, this line gets us the paths that matter (i.e. those that end with a leaf node that has points:

p=g.v("a").out.loop(1){it.loops<10}{true}.path.filter{it.last().getProperty("points")!=null}.toList()

I store those in p for later usage. Note that this will explore the tree to a depth of 10 as controlled by it.loops<10. From there it's pretty simple to use p to calculate the averages. Here's the example for calculating it for A:

p.collect{[it, it.last().getProperty("points")]}._().groupBy{it[0][0]}{it[1]}{it.sum()/it.size()}.cap.next()

The above basically says, for each path, convert that to a new List where the first item is the path and the second item is the points at the leaf node. Convert that list into a pipeline with the identity function and group on the first item in the path (identified by it[0][0]) and grab the points value (the second closure to groupBy) for that path. The third closure to groupBy is a reducer function which sums the points and calculates the average from that.

Another alternative, if you only need to calculate the average for a single vertex would be with this approach:

gremlin> g.v("a").out.loop(1){it.loops<10}{true}.path{it.points}.filter{it.last()!=null}
                 .transform{it.last()}.gather.transform{it.sum()/it.size()}
==>35

Note that the traversal is largely the same at the start, but uses a closure when grabbing the path. That closure converts the vertex to the value of the points property (note that using it.getProperty("points") is much more efficient than it.points). From there I again filter out paths that have a null value the last item in the path (i.e. as leaf nodes are the only ones with a points property this should leave us with paths that end with leaves). I then transform those paths to grab the points, gather them to a list and convert the list to an average points for "A".

Andere Tipps

You could start by looking at Traverse statement and store average points: - into parent vertices or - into the run-time context

By writing few lines of Java (or Javascript) would be easier using the Java Traverse.

Example to traverse all the nodes in "out" starting from node A assuming A is the name property (I suggest to index the "name" property for faster retrieval):

traverse out('has_a') from (
  select from Node where name = 'A'
)

With this query you create a in-memory map with key = record and value = the average of the underlying node points:

select avg( out('has_a').points ) ) as total_points from (
  traverse out('has_a') from (
    select from Node where name = 'A'
  )
)

This works only if all the nodes under "A" already have points.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top