Optimizing DBSCAN for neo4j in Cypher / Python

https://stackoverflow.com/questions/12632736

04-07-2021
|

Question

Hi I have been trying to implement the DBSCAN algorithm for Neo4j, but am running into serious performance bottlenecks. I'll describe the implementation then ask for help.

I discretized the possible epsilon values and put counts of the number of neighbors under each discretization in each node in order to be able to retrieve all of the core nodes in one query.

START a = node(*)
WHERE a.rel<cutoff threshold>! >= {minp}
RETURN a

This part is fast, the part that isn't fast is the follow up query :

START a = node({i})
SET a.label<cutoff threshold>_<minpoints> = {clust}
WITH a
MATCH a -[:'|'.join(<valid distance relations>)]- (x)
WHERE not(has(x.label<cutoff threshold>_<minpoints>))
WITH x
SET x.label<cutoff threshold>_<minpoints>={clust}
RETURN x

I then pick a core node to start from, and as long as there are still core node neighbors, run the above query to label their neighbors.

I think the problem is that my graph has very different levels of sparsity - starting from only weak similarity it is almost fully connected, with ~50M relations between ~10k nodes, whereas at strong similarity there are as few as ~20k relations between ~10k nodes (or fewer). No matter what, it is always REALLY slow. What is the best way for me to handle this? Is it to index on relationship type and starting node? I haven't been able to find any resources on this problem, and surprisingly there isn't already an implementation since this is a pretty standard graph algorithm. I could use scikit.learn but then I would be restricted to in-memory distance matricies only :(

Solution

What version of neo4j did you try this with?

Up until 1.8 performance has been no design goal of cypher (rather the language) Have a look at a recent snapshot (1.9-SNAP).

Also make sure that your hot dataset is not just loaded from disk (otherwise you measure disk-io) so your memory mapped settings and also JVM heap is large enough.

You might also want to check out the GCR cache from Neo4j enterprise which has a smaller memory footprint.

What is the cardinality of count(x) in your query? If it is too small you have too many small transactions going on. Depending if your run python embedded or REST use a larger tx-scope or REST-batch-operations

You're already using parameters which is great. What is the variability of your rel-types ?

Any chance to share your dataset/generator and the code with us (Neo4j) for performance testing on our side?

OTHER TIPS

There are DBSCAN implementations around that use indexing. I don't know about neo4j so I can't really tell if your approach is efficient. The thing you might need to precompute is actually a sparse version of your graph, with only the edges that are within the epsilon threshold.

What I'd like to point out that apparently you have different densities in your data set, so you might want to instead use OPTICS, which is a variant of DBSCAN that does away with the epsilon parameter (and also doesn't need to distinguish "core" nodes, as every node is a core node for a certain epsilon). Do not use the Weka version (or the weka-inspired python version that is floating around). They are half OPTICS and half DBSCAN.

When you have efficient sorted updatable heaps available, OPTICS can be pretty fast.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow