Best practices for app-cloud synchronization of database representation

https://softwareengineering.stackexchange.com/questions/379422

13-02-2021
|

Question

I'm planning a cookbook-like Android app that would let users store/add/edit their e-juice recipes, keep track of the ingredients, etc. Internally, I'll be storing the data in a database on the user's device (Room, which is an SQLite wrapper) and keep a copy on the user's Google Drive so that they can access it from other devices. Google Drive side, the database will be represented in JSON.

What I'm stuck with is, how do I make the synchronization most efficient and least prone to data corruption? I definitely don't want to rewrite the whole database for every change -- that would be far too expensive and most likely to mess up the data.

Other approaches I'm considering are:

Keep a separate JSON file for every table (with timestamp for latest changes) with JSONArray of entries in it. On the positive side, I wouldn't have to rewrite all of the semi-static data. On the negative side, all of the most vulnerable data -- current recipes, ingredients in stock, etc. -- will stay just as vulnerable.
Keep a sub folder for every table with every entry in a separate file with timestamps for latest changes both in the sub folder and affected files/entries. This one should minimise the possibility of data corruption but dealing with too many files doesn't seem like a good idea -- for one thing, it would be too expensive.

Honestly, as someone who's only had to deal with client-server apps till now I'm a bit out of my depth, so any advice would be appreciated.

La solution

You are trying to create a distributed eventually-consistent system. Those are inherently complicated. It's understandable that you are having difficulty coming up with a good solution because there isn't one.

For example, consider a user with two devices “phone” and “tablet”, and the following sequence of events:

The phone and tablet start with the same database, and are disconnected from any networks.
at 1pm, the user deletes the “Lemon bars” recipe on their tablet.
at 2pm, the user edits the “Lemon bars” recipe on their phone.
at 3pm, the phone synchronizes with the cloud backup. The backed up database will contain the Lemon bars edit.
at 4pm, the tablet synchronizes with the cloud backup. But now, the backed up database contains an edit of a record that the tablet wants to delete. What to do?

In a version control system, this would be a “merge conflict” that would require manual resolution. However, it is generally undesirable to implement the user interface for conflict resolution, at least in consumer apps.

One possible approach is to separate the concepts of the queryable database state, and the sequence of events that led up to that state (Event Sourcing). The state can be recomputed from replaying all events. Here, the difficult question remains how multiple event streams should be merged, in particular how the events should be ordered. In many cases, it is fine to order by timestamp of the event and merge by a “last one wins” strategy in case of conflict. However, the exact merging rules will depend on your problem domain. There is no universal approach that will always work.

In particular, the result of this is that while all devices will eventually show the same state (the devices will be consistent), there is a good chance that the merge will have led to data loss or data quality loss. For consumer apps this is probably somewhat acceptable, especially since conflicts will be rare. But this depends on your precise requirements.

With such an event-oriented approach, you would store the event logs in the cloud backup, and possibly database snapshots – although deleting old snapshots without preventing another device from syncing correctly is so difficult that it might be best to avoid snapshots entirely. You can give each log an unique name, e.g. based on device + timestamp or based on a hash of the contents. If the files are immutable this simplifies syncing: upload any files that you have but the cloud doesn't, and download any new logs that the cloud has (compare also the idea of content-adressable storage). While syncing will involve the local database state being reconstructed from the event logs, this shouldn't be a problem for reasonably small databases.

All of this becomes much easier if you run your own server that can serve as a source of truth for the application state. Normally events would be sent to the server immediately and produce a new state there. If the device is disconnected it could offer an offline mode that buffers events for later, with the understanding that these events might be discarded in case of conflict.

A note on defining events: an event should not replace an entity with a new version, but describe a small self-contained change that results in a logically consistent data model. This will make the events easier to merge. For example, an event that replaces a recipe with an edited version is problematic. It might be better to create multiple events “mark as favorite”, “change ingredient amount”, “insert step”, “edit step”. Note that entities must not be referenced by an ID that depends on the order of events, e.g. an autoincrement ID. Instead, prefer UUIDs.

Licencié sous: CC-BY-SA avec attribution

Non affilié à softwareengineering.stackexchange