Efficient data structure and strategy for synchronizing several item collections

https://stackoverflow.com//questions/20019799

21-12-2019
|

Question

I want one primary collection of items of a single type that modifications are made to over time. Periodically, several slave collections are going to synchronize with the primary collection. The primary collection should send a delta of items to the slave collections.

Primary Collection: A, C, D
Slave Collection 1: A, C    (add D)
Slave Collection 2: A, B    (add C, D; remove B)

The slave collections cannot add or remove items on their own, and they may exist in a different process, so I'm probably going to use pipes to push the data.

I don't want to push more data than necessary since the collection may become quite large.

What kind of data structures and strategies would be ideal for this?

Solution

For that I use differential execution.

(BTW, the word "slave" is uncomfortable for some people, with reason.)

For each remote site, there is a sequential file at the primary site representing what exists on the remote site.

There is a procedure at the primary site that walks through the primary collection, and as it walks it reads the corresponding file, detecting differences between what currently exists on the remote site and what should exist. Those differences produce deltas, which are transmitted to the remote site. At the same time, the procedure writes a new file representing what will exist at the remote site after the deltas are processed.

The advantage of this is it does not depend on detecting change events in the primary collection, because often those change events are unreliable or can be self-cancelling or made irrelevant by other changes, so you cut way down on needless transmissions to the remote site.

In the case that the collections are simple lists of things, this boils down to having local copies of the remote collections and running a diff algorithm to get the delta. Here are a couple such algorithms:

If the collections can be sorted (like your A,B,C example), just run a merge loop:

while(ix<nx && iy<ny){
  if (X[ix] < Y[iy]){
    // X[ix] was inserted in X
    ix++;
  } else if (Y[iy] < X[ix]){
    // Y[iy] was deleted from X
    iy++;
  } else {
    // the two elements are equal. skip them both;
    ix++; iy++;
  }
}
while(ix<nx){
  // X[ix] was inserted in X
  ix++;
}
while(iy<ny>){
  // Y[iy] was deleted from X
  iy++;
}

If the collections cannot be sorted (note relationship to Levenshtein distance),

Until we have read through both collections X and Y,
  See if the current items are equal

  else see if a single item was inserted in X
  else see if a single item was deleted from X

  else see if 2 items were inserted in X
  else see if a single item was replaced in X
  else see if 2 items were deleted from X

  else see if 3 items were inserted in X
  else see if 2 items in X replaced 1 items in Y
  else see if 1 items in X replaced 2 items in Y
  else see if 3 items were deleted from X

  etc. etc. up to some limit

Performance is generally not an issue, because the procedure does not have to be run at high frequency.

There's a crude video demonstrating this concept, and source code where it is used for dynamically changing user interfaces.

OTHER TIPS

If one doesn't push all data, sort of a log is required, which, instead of using pipe bandwidth, uses main memory. The parameter to find a good balance between CPU & memory usage would be the 'push' frequency.
From your question, I assume, you have more than one slave process. In this case, some shared memory or CMA (Linux) approach with double buffering in the master process should outperform multiple pipes by far, as it doesn't even require multithreaded pushing, which would be used to optimize the overall pipe throughput during synchronization.
The slave processes could be notified using a global synchronization barrier for reading from masterCollectionA without copying, while master modifies masterCollectionB (which is initialized with a copy from masterCollectionA) and vice versa. Access to a collection should be interlocked between slaves and master. The slaves could copy that collection (snapshot), if they would block it past the next update attempt from master, thus, allowing it to continue. Modifications in slave processes could be implemented with a copy on write strategy for single elements. This cooperative approach is rather simple to implement and in case the slave processes don't copy whole snapshots everytime, the overall memory consumption is low.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow