Question

Following example graph is given:

CREATE (app1:Application {Id: 1, Name: 'A big application'}),
    (db1:Database {Name: 'db1'}),
    (db2:Database {Name: 'db2'}),
    (di1:DatabaseInstance {Name: 'db1-i1'}),
    (di2:DatabaseInstance {Name: 'db1-i2'}),
    (di3:DatabaseInstance {Name: 'db2-i1'}),
    (di4:DatabaseInstance {Name: 'db2-i2'}),
    (s1:Server {Name: 'Server 1'}),
    (s2:Server {Name: 'Server 2'}),
    (s1)-[:ClusteredWith]->(s2),
    (s2)-[:ClusteredWith]->(s1),
    (di1)-[:InstalledOn]->(s1),
    (di3)-[:InstalledOn]->(s1),
    (di2)-[:InstalledOn]->(s2),
    (di4)-[:InstalledOn]->(s2),
    (di1)-[:Instantiate]->(db1),
    (di3)-[:Instantiate]->(db2),
    (di2)-[:Instantiate]->(db1),
    (di4)-[:Instantiate]->(db2),
    (app1)-[:Utilize]->(db1),
    (app1)-[:Utilize]->(db2)

This is an IT inventory where i describe the relations of deployed applications in a large environment. In real this application has ~500 Databases and ~1000 Servers.

My goal is to get all related assets of an application, database or server.

Using Cypher 2 my idea of the query to get all related assets from an application is:

MATCH (a:Application)
WHERE a.Id = 1
OPTIONAL MATCH (a)-->(d:Database)
OPTIONAL MATCH (a)-[*1..]-(s:Server)
RETURN
    a AS Application,
    collect(d) AS Databases,
    collect(s) AS Servers

With Neo4jClient:

var result = client.Cypher
    .Match("(a:Application)")
    .Where((Application a) => a.Id == request.Id)
    .OptionalMatch("(a)-->(d:Database)")
    .OptionalMatch("(a)-[1..]-(s:Server)")
    .Return((a, d, s) => new {
        Application = a.As<Anwendung>(),
        Databases = d.CollectAs<Database>(),
        Servers = s.CollectAs<Server>()
    });

I'm using OPTIONAL MATCH because the inventory graph is not always in a consistent state.

But this query runs too long and eats 99% CPU. Running this query with either of them, databases or servers, it works, but query speed is still slow -> 400->900ms

Is my Datamodel bad or i'm using Cypher the wrong way? How can i speed things up?

Was it helpful?

Solution

When I run that query on just the sample data you provided I get 128 matches. You can see this by changing your return clause to

RETURN
     a AS Application,
     count(d) AS Databases,
     count(s) AS Servers

Try it on your full data set as well, I bet you a lollipop you get an uncomfortable amount of matches there. If you look at the data that is returned you can also see that there are tons of duplicates. Even though only two :Database nodes are matched in your first optional pattern, the collection that you return as "Databases" contains 128 items. Try making the resolution of the database nodes explicit before matching the servers, you can do this by introducing this between the two optional patterns

WITH a, collect(d) as Databases

If you run this query and return the counts you'll see that you've cut the result in half. Your original query considers the second optional pattern twice, once for each result in the fist pattern. When you explicitly resolve the first pattern with WITH ..., the second pattern is only considered once. This is probably the reason why the queries are less absurdly slow when you run them separately.

The second optional match still matches 64 patterns in the tiny data sample, however, and that's probably the ultimate culprit causing your query to be slow. Here the problem is that your pattern is too vague. The pattern matcher is allowed to travel any kind of relationship in the graph in any direction and to any depth, as long as the last node is a :Server. There are a couple of ways that this blows up. Consider adding a second application. This application utilizes a different database, but one of the servers that hosts instances of that database also hosts instances of a database used by the first application, from your sample. The query to extend the graph could look like

MATCH (s2:Server {Name: 'Server 2'})
CREATE (app2:Application {Id:2, Name: 'Another application'}),
    (db3:Database {Name: 'db3'}),
    (di5:DatabaseInstance {Name: 'db3-i1'}),
    (di6:DatabaseInstance {Name: 'db3-i2'}),
    (s3:Server {name: 'Server 3'}),
    (di5)-[:Instantiate]->(db3),
    (di5)-[:InstalledOn]->(s3),
    (di6)-[:Instantiate]->(db3),
    (di6)-[:InstalledOn]->(s2),
    (app2)-[:Utilize]->(db3)

Your second optional match is given free reign to traverse the graph, so while it does find the correct servers it can also continue to travel 'backwards' via other database instances installed on that same server. It will follow paths like

(s2:Server {Name: 'Server 2'})<-[:InstalledOn]-(di6:DatabaseInstance {Name: 'db3-i2'})

and on this path a different database can be reached, one which the application doesn't use. Once that database is reached, any server related to it (actually, any server connected to any application using that database - simply any server in any way connected in any way) will be matched, for instance 'server 3':

(di6)-[:Instantiate]->(db3:Database {Name: 'db3'})<-[:Instantiate]-(dbi5:DatabaseInstance {Name: 'db3-i1'})-[:InstalledOn]->(s3:Server {Name: 'Server 3'})

If you run the count query after adding the application like above you'll get 192 results. Still only two paths to databases, but now 96 paths to servers and 2x96=192. What you should do is specify the second pattern more to exclude the paths that you don't want to follow. Since I can only go by what's in your sample I can't tell with confidence what the more specific pattern should look like, but you could start with something like

OPTIONAL MATCH (a)-[:Utilize]->()<-[:Instantiate]-()-[:InstalledOn]->(server)

I've excluded the labels from this pattern, and this brings me to my final observation, that...


Aside:
...labels are very useful, but they are only labels. Putting on a philosopher's hat for a moment, types speak to what something is, labels speak to what someone thinks something is. This happens to be reflected in such details as how the database is structured in persistent memory. Relationships are typed, nodes aren't; the type of a relationship is stored together with it, labels and properties are not stored with the node that has them. At least this was true of properties, I'm not positive about labels. I understand the main purpose of labels to be acquiring starting points for pattern matching, and that means that, primarily, labels have references to their nodes. This is borne out by the schema dynamics, where labels provide indexes and constraints. You could think of labels as something akin to mathematical sets and relationship types as something more like natural kinds (or ideally, Aristotelian categoriae). Anyway, shoot me if I'm wrong, I say it is both theoretically more sound and computronically less expensive to ask for the type of relationship than for the labels or properties of a node. Regardless..


...for performance considerations, when resolving a traversal step the type of a relationship is relevant before the label of the remote node, and often relationships of the same type will lead to nodes with the same label (this is the case in your sample). Indeed, most graph modelling relies on relationships for meaning and labels are used to provide convenient starting points for traversals or for 'object node mappings'. Final suggestion therefore is generally to emphasize relationship types over labels when declaring patterns and particularly to specify your second optional pattern something like above, and for the same reasons your first optional pattern something like

OPTIONAL MATCH (a)-[:Utilize]->(d)

If an application utilizes other things than databases in your model, then keep the :Database label for disambiguation; the same for the other patterns. My point is this: don't substitute labels for relationship types when declaring patterns in cypher.

Based on the sample you provided, you could probably improve the performance of your query by changing it to something like

MATCH (a:Application {Id:1}) // From 2.0RC1 and on you can match with properties
OPTIONAL MATCH (a)-[:Utilize]->(d:Database)
WITH a, collect(d) as Databases
OPTIONAL MATCH (a)-[:Utilize]->()<-[:Instantiate]-()-[:InstalledOn]->(server)
RETURN
    a AS Application,
    Databases
    collect(s) AS Servers
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top