How to version logic in a distributed event-sourced system

https://stackoverflow.com/questions/14775712

07-03-2022
|

Pergunta

Example

My distributed event-sourced system simulates houses being built and purchased over a period of time. For simplicity sake, we will use the year as the distributed clock value (forgetting vector clocks for now).

Houses take 1 year to build in version 1 of the system, but take twice as long in version 2. This is a change in logic rather than structure.

To cope with this change, events recorded in version 1 must also be replayed by version 1 when rebuilding state/snapshots. When version 2 of the log is reached, the application switches over to version 2 of the logic and continues replaying the residual events. A valid snapshot is built.

Problem

The nodes in my distributed system will be updated to version 2 at different times, creating a window whereby multiple versions are running simultaneously. My current understanding is this window can only be reduced through techniques like feature switching, but cannot be completely removed (unless you sacrifice availability by bringing the entire system down for an upgrade).

This creates a problem when merging the event logs from distributed nodes. The event versions bleed into each other, making it impossible to simply upgrade from version 1 to 2 during the replay. E.g.:

Node    Clock   Event

... pre-merge ...

A       2000    HouseBuildStarted('Alpha')   
A       2001    HousePurchased('Alpha')    <- 'HouseBuilt' event is implicit (inferred through logic).
A       2002    NodeUpgradedTo('V2')
B       2002    HouseBuildStarted('Bravo')
B       2003    HousePurchased('Bravo')
B       2004    NodeUpgradedTo('V2')

... post-merge ...

A       2000    HouseBuildStarted('Alpha')
A       2001    HousePurchased('Alpha')
B       2002    HouseBuildStarted('Bravo')
A       2002    NodeUpgradedTo('V2')        
B       2003    HousePurchased('Bravo')    <- 'Bravo' does not exist yet (1 year early)
B       2004    NodeUpgradedTo('V2')

How is this usually handled in systems where taking all the nodes down is not acceptable?

Solução

The questions of upgrading logic and distributing upgrades are different. If you need to upgrade an event stream (e.g. your "HouseBuilt event is implicit"), then you should do so. Your read models will have to be rebuilt by playing the event stream over again, with logic that does the upgrade. This is really no different in concept than patching a database when you upgrade your program. Facts about the persisted data now need to be reconsidered in light of your newer representations (default values may have to be substituted, obsolete events ignored, etc.)

How you determine which node runs which version of the code is a separate question. If you have a no-downtime-upgrade policy, what do you normally do? Have some customers serviced by old versions and some by new? The same thing could happen, building new read models while the old ones are online, servicing old aggregates and application services while the new ones are being deployed, etc.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow