Question

I'm quite new to graph databases and I'm trying to decide if Neo4j is the right tool to use for data mining on network graphs or if there is something more suitable out there.

I'm planning on using a graph database to perform analyses on some large graphs (millions of nodes/ 10s to 100s million edges), but I'll be looking to apply algorithms and calculate metrics for everybody in the graph. For example:

  • for each person how many people in their extended network have a certain attribute.
  • how many steps is each person from someone with a certain attribute.
  • perform community detection
  • Running Page Rank

From looking into it a bit, it seems like neo4j is very suited to running queries starting from a certain node, but is it also suited to applying a calculation over everybody in the network? I've come across the term 'Graph compute engine' as a distinction between the two, but can't seem to find much on it.

Are there any other tools that would be useful on this scale (gephi and similar won't handle the volume of data I need to use).

No correct solution

OTHER TIPS

Since you need to use a graph database analytics engine, you might be interested in Faunus. This is their description:

Faunus is a Hadoop-based graph analytics engine for analyzing graphs represented across a multi-machine compute cluster.

I know of it because I keep and eye on their graph database, Titan, which integrates nicely with Tinkerpop, but I have not used it (Faunus).

So by using Faunus you can also have a graph backend which IMO goes hand in hand with what you want to do.

Another really good graph analytic engine is GraphLab (and it's single machine version: GraphChi). Very impressive performance - see: http://graphlab.com/

Mirroring other comments (and to keep this from becoming a product thread which will get it locked on SO) - Neo4j is a graph database - very useful for queries/exploring/etc. GraphLab and the other examples given are more whole graph analytics - things like pagerank, graph triangle counts, etc...

It doesn't look like neo4j is what you are looking for here. In my opinion you really need a graph-engine, rather than a graph database

  • With a graph database you should be able to perform queries. And it will perform very fast when dealing with highly connected data. For instance, Neo4j should be ligthing fast to pick a node, find its friends, and then find the friends of friends of the starting node in a social graph. In this scenario the graph database outperform the sql models when dealing with a high number of nodes. Note that the efficiency precisely comes from the fact that your engine doesn't have to look over the whole graph to answer your query.

  • With a graph engine you can perform computations on the whole graph, as you describe it.

If you want to scale and analyse a high number of nodes I'd suggest you take a look at the MapReduce approach ; see Hadoop (and perhaps Mahout).

Hope this helps !

I realise this is late but for the benefit of future Googlers.

You might also want to try the GraphX project built on Spark. It's alpha as of now but looks good for large scale graph analytics.

https://spark.apache.org/graphx/

If you want a pure Neo4j solution, you should check this project.

Implemented algorithms:

1 PageRank

2 Triangle Count

3 Label Propagation for Community Detection

4 Modularity (for Community Detection)

Hope it helps

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top