In house simple data synchronization

https://stackoverflow.com/questions/2056210

20-09-2019
|

Question

I know the post is long but it also contains suggested solutions to my problem. Therefore, the question itself is not that long... don't be scared. :)

Please help me with the following scenario:

I will have a SQL server where I will store data sets. I will also have a bunch of PC clients (all with their own local SQL server) that will need to be synchronized with the server. The PCs will be uniquely identified.

Example:

Let's say I have 2 PC clients, PC1 and PC2.

On the server, I have the following data sets:

DS1 DS2

On the client PCs, I won't have anything to start with. When PC1 will connect to the server to see if "anything is new", it will note that there are two datasets that it needs to retrieve so it will update its local database with DS1 and DS2. The same will happen with PC2. Now, let's say PC2, locally, modifies DS2. When PC2 will connect to the server, it will update DS2 on the server with the changes that it made to its local DS2. Finally, when PC1 will connect again to see if there are any changes, it will note that DS2 has changed and it will retrieve it and overwrite its local DS2 with the DS2 from the server.

Don't worry about concurrency problems because not all the PCs can change any data set. Only the owner of a certain dataset can change it (the owner being defined as a single PC.)

I thought of a few solutions but they don't seem very efficient; maybe someone else will have a few ideas.

1st Solution:

On the server, I will create a table called "SyncTable" where I will write any changes to the data sets.

Example:

Let's say PC1 needs to retrieve DS1 and PC2 needs to retrieve DS4 and DS5.

The sync table would contain:

PC1 DS1
PC2 DS4
PC2 DS5

So, when PC2 connects to the server, it looks at this table, notes that it needs to download DS4 and DS5, goes ahead and does that and then it deletes its two entries from the server table. Thus, the server table would only contain "PC1 DS1" after PC2 synced itself. Next time, when PC2 connects, notes that there are no entries for it and thus knows that it is "up to date". When PC1 would connect, the exact same thing would happen: It would note that it needs to download DS1, download it and then delete the entry.

The problem with this is that if there are 10 000 PCs, and maybe 5000 datasets modified, I would have a lot of entries in this table.

The second solution would be to store a modified timestamp associated with each dataset. The problem with this is that the client PCs would have to go through ALL their local records and compare the local timestamps with the last timestamp on the server to see if there are any changes. Not sure how efficient this is given a large number of records. It seems that it would be better if they would know directly where to look for changes rather than go through all the records each time...

Therefore, do you have any suggestions about this ?

Technology used: MS SQL Server Compact Edition 3.5 on the client PCs and MySQL on the server. The communication would be done through web services. Therefore, Merge Replication/Remote Data Access are out.

Thanks!

Solution

You are considering two options

generating a list of required updates at the time of modification, client asks "what's in my list of updates"?
keeping timestamps such that a client can ask "My latest time for DSx is ..., do I need an update?"

I see option 2 as prefereable. I think it is more resilient to problems, recovery from client crashes etc. This is because each participant keeps only what it knows: the date of its data. The server does not need to "understand" who took what.

I think you can optimise the determination of what to upload. You speak as though the client needs to iterate through all it's data sets, retrieveing each time stamp one at a time, and making a decision to retrieve. Instead you could have a web service call:

I have DS1=<time>, DS2=<time> ...; which do I need to download?

The actual decision is made by the server on the basis of the data sent by the client, rather than the client fetching data to allow it to make the decision itself.

In passing I should point out that there are commercial products than handle all this. Do you really need to code it?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow