py2neo: performance and return values of various commands

https://stackoverflow.com/questions/18926102

29-06-2022
|

Pregunta

Using py2neo (1.5.1) an neo4j (1.9.2) and I'm wondering about the performance of different commands (with about 80k relationships in the graph):

So first I get all relationships (~80k), which obviously takes some time.

graph_db = neo4j.GraphDatabaseService("http://localhost:7474/db/data/")
rels = graph_db.match()

However, why does it take a remarkable amount of time (~ 1-2 minutes) to loop over the relations and print them (or store in some variable)? rels is a list of relationships, but what does each relationship contain?

for relation in rels:
    print relation.start_node
    print relation.type
    print relation.end_node
    print relation.get_properties()

When removing the line print relation.get_properties() the execution time of the loop gets much better (~10 seconds). So I assume that each relation.get_properties() executes a query to the db? Sounds reasonable.

However, the weired thing to me: Why is the following code much faster, even though print relation contains all information I need?

for relation in rels:
    print relation     #example output: (244358)-[:KNOWS {"since":2011,"reason":"unknown"}]->(244359)
    print relation.start_node
    print relation.type
    print relation.end_node

So it prints actually all information I need, and the execution of it is much faster, even though that why I'm not able to extract the properties of the relation and store it in a variable.

for relation in rels:
    print relation     #example output: (244358)-[:KNOWS {"since":2011,"reason":"unknown"}]->(244359)
    print relation.start_node
    print relation.type
    print relation.end_node
    #print relation["since"] #would slow down the execution significantly, why??

So what information is stored in the relationship? How can I extract all properties without using get_properties(). Has this anything to do with the cache? I don't get it, this is driving me nuts... I'm already looking forward for your answer, Nigel ;-)

Note: I know that I could optimize it by using batches, but that is not really the question right now.

EDIT: Does print relation["since"] also result in a query for each iteration?

EDIT2: And while we are talking about performance one more thing: Comparing the following cypher queries I noticed that the first one is way slower than the second one, why? (executed on a cold graph, so no cache influence)

query1: START n=node(*) RETURN n

query2: START n=node(*) RETURN n.name, n.age

Solución

A Relationship object stores the immutable pieces of that relationship, which are returned in the REST response. These are the start node, end node and type. The properties are mutable and a call to get_properties will indeed (as you suggest) make a separate call to the server.

Side note: if you enable logging...

import logging
logging.basicConfig(level=logging.DEBUG)

...you should be able to see the traffic going to and fro.

You can however use a small trick to pick up a snapshot of the properties without a separate call. The __metadata__ attribute for any resource contains the last known details fetched from the server so the following should return the properties:

props = my_rel.__metadata__["data"]

With py2neo 1.6 (almost released!), the match method has evolved slightly. Instead of fetching everything before returning it to you, it returns a generator and can be iterated over. So code such as...

for rel in graph_db.match("KNOWS"):
    print rel.start_node["name"] + " knows " + rel.end_node["name"]

...will execute more quickly instead of waiting for the full response to be received.

EDIT ANSWER: Yes.

EDIT 2 ANSWER: That's a question for the Neo guys :-P

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow