Question

I am doing some work for an organisation that has offices in 48 countries of the world. Essentially the way they work now is that they all store data in a local copy of the database and that is replicated out to all the regions/offices in the world. On the odd occasion where they need to work directly on something where the "development copy" is on the London servers, they have to connect directly to the London servers, regardless of where they are in the world.

So lets say I want to have a single graph spanning the whole organisation which is sharded so that each region has relatively fast reads of the graph. I am worried that writes are going to kill performance. I understand that writes go through a single master, does that mean there is a single master globally? i.e. if that master happens to be in London then each write to the database from Sydney has to traverse that distance regardless of the local sharding? And what would happen if Sydney and London were cut off (for whatever reason)?

Essentially, how does Neo4j solve the global distribution problem?

Was it helpful?

Solution

The distribution mechanism in Neo4j Enterprise edition is indeed master-slave style. Any write request to the master is committed locally and synchronously transferred to the number in slaves defined by push_factor (default: 1). A write request to a slave will synchronously apply it the master, to itself and to enough machines to fulfill push_factor. The synchrous slave-to-master communication might hit performance thats why it's recommended to do redirect writes to the master and distribute reads over slaves. The cluster communication works fine on high-latency networks.

In a multi-region setup I'd recommend to have a full (aka minimum 3 instances) cluster in the 'primary region'. Another 3-instance cluster is in a secondary region running in slave-only mode. In case that the primary region goes down completely (happens very rarly but it dows) the monitoring tool trigger a config change in the secondary region to enable its instances to become master. All other offices requiring fast read access have then x (x>=1, depending on read performance) slave-only instances. In each location you have a HA proxy (or other LB) that directs writes to the master (normally in primary region) and reads to the local region.

If you want to go beyond ~20 instances for a single cluster, consider doing a serious proof of concept first. Due to master slave architecture this approach does not scale indefinitly.

OTHER TIPS

As of 2020, Neo4J is still a replication-only graph database. It has some number of Read-Write "Core Servers" that exist as part of its "Synced Cluster" and then some number of "Read-Replicas". Updates are performed to one of the Core Servers and the update is then synchronized across (N/2)+1 of the Core Servers before the client is notified that the commit is successful. This is Neo4J's implementation of the RAFT protocol. This all means that Neo4J implements replication and the replicas are distributed. All of the nodes and edges for a graph are limited to exist on the same server.

Objectivity/DB implements a distributed database. Objectivity/DB allows a user to distribute their graph across up to 65,000 servers where a Node-A can be on Server-10 and Node-B can be on Server-47550 and the edge between them can be on Server-543. This allows a single connected graph to grow beyond the scale of any single server. The interface (API) to the database creates what is called a Single Logical View where, once connected to the federated database, the client has a Single Logical View of all of the data in the entire federation regardless of what host it resides on. This allows the user to create massive graphs using smaller commodity hardware instead of having to buy an exabyte-scale server to create an exabyte-scale graph.

The other advantage is that it allows a user to place data close to where it will be used. If your organization is distributed around the globe, you can place the Hong Kong data in or near Hong Kong, and the New York data in or near New York. And you can create edges between the nodes in Hong Kong and New York. And the usee in Denver can run navigational queries across all of it because the query engine is distributed.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top