Question

I have used Neo4J to implement a content recommendation engine. I like Cypher, and find graph databases to be intuitive.

Looking at scaling to a larger data set, I am not confident No4J + Cypher will be performant. Spark has the GraphX project, which I have not used in the past.

Has anybody switched from Neo4J to Spark GraphX? Do the use cases overlap, aside from scalability? Or, does GraphX address a completely different problem set than Neo4J?

Was it helpful?

Solution

Neo4j and Spark GraphX are meant for solving problem at different level and they are complimentary to each other.

They can be connected by Neo4j's Mazerunner extension:

Mazerunner is a Neo4j unmanaged extension and distributed graph processing platform that extends Neo4j to do big data graph processing jobs while persisting the results back to Neo4j.

Mazerunner uses a message broker to distribute graph processing jobs to Apache Spark's GraphX module. When an agent job is dispatched, a subgraph is exported from Neo4j and written to Apache Hadoop HDFS.

After Neo4j exports a subgraph to HDFS, a separate Mazerunner service for Spark is notified to begin processing that data. The Mazerunner service will then start a distributed graph processing algorithm using Scala and Spark's GraphX module. The GraphX algorithm is serialized and dispatched to Apache Spark for processing.

Once the Apache Spark job completes, the results are written back to HDFS as a Key-Value list of property updates to be applied back to Neo4j.

Neo4j is then notified that a property update list is available from Apache Spark on HDFS. Neo4j batch imports the results and applies the updates back to the original graph.

Check out this tutorial to get an idea on how to combine the two: http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top