how to improve Neo4J performance in creating edges?

https://stackoverflow.com//questions/25038980

21-12-2019
|

質問

i'm building a traffic schedule application using Neo4J, NodeJS and GTFS-data; currently, i'm trying to get things working for the traffic on a single day on the Berlin subway network. these are the grand totals i've collected so far:

10 routes 211 stops 4096 trips 83322 stoptimes

to put it simply, GTFS (General Transit Feed Specification) has the concept of a stoptime which denotes the event of a given train or bus stopping for passengers to board and alight. stoptimes happen on a trip, which is a series of stoptimes, they happen on a specific date and time, and they happen on a given stop for a given route (or 'line') in a transit network. so there's a lot of references here.

the problem i'm running into is the amount of data and the time it takes to build the database. in order to speed up things, i've already (1) cut down the data to a single day, (2) deleted the database files and have the server create a fresh one (very effective!), (3) searched a lot to get better queries. alas, with the figures as given above, it still takes 30~50 minutes to get all the edges of the graph.

these are the indexes i'm building:

CREATE CONSTRAINT ON (n:trip)     ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stop)     ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:route)    ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stoptime) ASSERT n.id IS UNIQUE;
CREATE INDEX ON :trip(`route-id`);
CREATE INDEX ON :stop(`name`);
CREATE INDEX ON :stoptime(`trip-id`);
CREATE INDEX ON :stoptime(`stop-id`);
CREATE INDEX ON :route(`name`);

i'd guess the unique primary keys should be most important.

and here are the queries that take up like 80% of the running time (with 10% that are unrelated to Neo4J, and 10% needed to feed the node data using plain HTTP post requests):

MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);

MATCH (stoptime:`stoptime`), (trip:`trip`)
WHERE stoptime.`trip-id` = trip.id
CREATE UNIQUE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]-(stoptime);

MATCH (stoptime:`stoptime`), (stop:`stop`)
WHERE stoptime.`stop-id` = stop.id
CREATE UNIQUE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]-(stoptime);

MATCH (a:stoptime), (b:stoptime)
WHERE a.`trip-id` = b.`trip-id`
AND ( a.idx + 1 = b.idx OR a.idx - 1 = b.idx )
CREATE UNIQUE (a)-[:linked]-(b);

MATCH (stop1:stop)-->(a:stoptime)-[:next]->(b:stoptime)-->(stop2:stop)
CREATE UNIQUE (stop1)-[:distance {`~label`: 'distance', value: 0}]-(stop2);

the first query is still in the range of some minutes which i find longish given that there are only thousands (not hundreds of thousands or millions) of trips in the database. the subsequent queries that involve stoptimes take several ten minutes each on my desktop machine.

(i've also calculated whether the schedule really contains 83322 stoptimes each day, and yes, it's plausible: in Berlin, subway trains run on 10 lines for 20 hours a day with 6 or 12 trips per hour, and there are 173 subway stations: 10 lines x 2 directions x 17.3 stops per line x 20 hours x 9 trips per hour gives 62280, close enough. there are some faulty? / double / extra stop nodes in the data (211 stops instead of 173), but those are few.)

frankly, if i don't find a way to speed up things at least tenfold (rather more), it'll make little sense to use Neo4J for this project. just in order to cover the single city of Berlin many, many more stoptimes have to be added, as the subway is just a tiny fraction of the overall public transport here (e.g. bus and tramway have like 170 routes with 7,000 stops, so expect around 7,000,000 stoptimes each day).

Update the above edge creation queries, which i perform one by one, have now been running for over an hour and not yet finished, meaning that—if things scale in a linear fashion—the time needed to feed the Berlin public transport data for a single day would consume something like a week. therefore, the code currently performs several orders of magnitude too slow to be viable.

Update @MichaelHunger's solution did work; see my response below.

解決

I just imported 12M nodes and 12M rels into Neo4j in 10 minutes using LOAD CSV.

You should see your issues when you run profiling on your queries in the shell. Prefix your query with profile and look a the profile output if it mentions to use the index or rather just label-scan.

Do you use parameters for your insert queries? So that Neo4j can re-use built queries?

For queries like this:

MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);

It will very probably not use your index. Can you perhaps point to your datasource? We can convert it into CSV if it isn't and then import even more quickly. Perhaps we can create a graph gist for your model?

I would rather use:

MATCH (route:`route`)
MATCH (trip:`trip` {`route-id` = route.id)
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);

For your initial import you also don't need create unique as you match every trip only once. And I'm not sure what your "~label" is good for?

Similar for your other queries.

As the data is public it would be cool to work together on this.

Something I'd love to hear more about is how you plan do express your query use-cases.

I had a really great discussion about timetables for public transport with training attendees last time in Leipzig. You can also email me on michael at neo4j.org

Also perhaps you want to check out these links:

Tramchester

London Tube Graph

他のヒント

detailed solution

i'm happy to report that @MichaelHunger's solution works like a charm. i modified the edge-building queries from the question with the below shapes that keep to the suggested query outline:

MATCH (route:`route`)
MATCH (trip:`trip` {`route-id`: route.id})
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]->(route)

MATCH (trip:`trip`)
MATCH (stoptime:`stoptime` {`trip-id`: trip.id})
CREATE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]->(stoptime)

MATCH (stop:`stop`)
MATCH (stoptime:`stoptime` {`stop-id`: stop.id})
CREATE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]->(stoptime)

MATCH (a:stoptime)
MATCH (b:stoptime {`trip-id`: a.`trip-id`, `idx`: a.idx + 1})
CREATE (a)-[:linked {`~label`: 'linked'}]->(b)

MATCH (stop1:stop)--(a:stoptime)-[:linked]-(b:stoptime)--(stop2:stop)
CREATE (stop1)-[:distance {`~label`: 'distance', value: 0}]->(stop2)

as can be seen, the trick here is to give each participating node a MATCH statement of its own and to move the WHERE clause inside the second match condition; presumably, as mentioned above, Neo4J can only then take advantage of its indexes.

with these queries in place, the process of reading in nodes and building edges takes roughly 13 minutes; of these 13 minutes, fetching the data from an external source, building the node representations and issuing CREATE queries takes about 10 minutes, and building almost a half million edges between them is done in about 3 minutes.

right now none of my queries (especially the node CREATE statements and updates for stop distances) use parametrized queries, which is another potential source for performance gains.

as for the ~label field and also the question why i use dahes in names where underscores would be more convenient, well, that's a long story about what i perceive good and practical naming that sometimes clashes with the syntax of some languages (of most languages, should i say). but that's boring detail. maybe more intersting is the question: why is there a ~label attribute that repeats what the element label says (what you write after the colon)? well, it's an attempt to comply with Neo4J conventions (we use labels here), take advantage of the 'identifier, colon, label' syntax of cypher queries, AND to make it so the labels do appear in the returned values.

mind you, labels are so central to graph thinking the Neo4J way, but *in query results, labels are conspicuously absent. when you include a relationship that is marked with nothing but a label in your result set, then that edge will arrive as an empty object, telling you only that there is something but not what. so i decided i to duplicate the label on each single node and each single edge. not an optimal solution but at least now i get an informative graph display in the Neo4J browser.

as for how to express query use-cases, that's an active field of reserach for me right now. i guess it will all start with a 'field of interest', like 'show all Berlin subway stops', or 'all busses departing within the next 15 minutes from a bus stop near me'. the data already allows to see which stops are directly connected by a subway line, their geographical distance, what services are present and what routes they take. the idea is to grab the data and present them in novel, usable and beatiful ways. 9292 is quite close to what i imagine; what's missing are graphical representations of spatial and temporal relationships.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow