Can message passing be used for a CPU redundancy and load-balancing construct

https://softwareengineering.stackexchange.com/questions/334913

31-12-2020
|

Question

In bare metal or minimal RTOS type embedded systems with multiple processors is it possible to have an identical program running on each processor that uses Message Passing Interface (MPI) to provide load balancing and also redundancy in the case of processor failure? Such as a state machine that changes what actions other CPUs perform based on passed messages, for example asking another processor to take over some part of the system loop for load balancing or sending periodic alive messages and remembering what each CPU is responsible for as far as CPU redundancy.

In this example diagram the actual parts of the full system loop that are "open" could be any distinct systems. There could be no cooperation just the ability to open and or close parts of the full system loop running on each CPU in a kind of very primitive asymmetric multiprocessing. "Process migration" to another CPU would be triggered by a request for another CPU to open that part of the system loop after which the requesting CPU closes its portion, or a lack of response from another CPU when queried if alive for some amount of time.

It has been proposed as a solution to potential processor failure and solution to load balancing since we can not port an embedded OS to truly do symmetric or asymmetric multiprocessing on the custom board, and it sounds like it is theoretically possible, but an incredibly poor design idea. Also, I have not been able to find any design patterns or algorithms for using message passing in this way.

Some background important to the software engineering decisions: A student CubeSat project (not graded or for a class), we have a small software development team with mostly junior students with little to no knowledge of operating systems design. For various reasons we can not do any of the many real world solutions I have read about. This while it sounds like it is possible sounds like it will introduce too much complexity for the team to deal with, and even if it can be done will cause a terrible design that will lead to some problem that turns the CubeSat into an orbiting rock.

I am not even sure we would be able to implement message passing in a way reliable enough for spacefairing, I have not even been able to find any production ready communication protocols that can be used to pass messages on a bus with a tiny OS or bare metal like we need. But I am also curious to know if this proposed solution for process migration, CPU redundancy, and load balancing is even viable for a safety critical system. It seems like it could lead to a state where two CPUs are running the same "Process" or part of the program loop if one wakes that would be hard to detect.

Solution

Excellent questions because I actually worked some of this out back in the mid 90s. Spacecraft are expensive and it is difficult to modify software once in orbit. I thought about a variant of this issue when thinking how spacecraft software resources could re-allocate based upon changing mission requirements. As far as we took it in the lab (VxWorks) :

Estimate Task load for essential for each processor per requirements.
Estimate Task load for sub mission task set. This is the new configuration desired based upon a providing the necessary tasks per processor required to fullfill the most important mission requirements. Basically what you can't life without.
For each processor we now have a primary mission tasking model and variants thereof based upon other processing states we may have to switch to as expeditiously as possible. This is simple planned adaptation. Nothing special, just different sets of tasking models cutting in and out to some stimuli. Load balancing in my experiments were essentially pre-planned. We used basic RMA scheduling for this operation. Basically this is a big context switch on a system wide tasking model level.

On Station Program updates under a RTOS. Basically plug in a new task set, hook up the queue network and start the data flow again.
So in this simple implementation we suspend or remove some tasks and allowed others to run. We took it a bit further in what we called the "Heart Transplant" technique. This was for on station software updates. We could disconnect and re-route the queue networks within the tasking model. Basically disconnect the task and eliminate it if so desired, kill the queues and reconnect the new task (heart) and arteries (queue network). We did this bit of play time back in 1995/96. I not only wanted the ability to add functionality but to remove that not required since memory is a much limited resource. Don't know much about MPI, I never used it. Is it deterministic? Using information theory, you do not need much to send a keep alive signal. Use minimal bits. Most common info like "keep alive" takes only one bit, true or false. Events occurring with a much lower probability need more bits to represent. Eliminate any software overhead you can. Follow the KISS principle (Keep It Simple..Stupid!).

Now radiation protection of some sort. Student project means likely flying CMOS. I'd at least put CRC checks on memory and run a watchdog to catch errors like hangups radiation does weird stuff to electronics. Single event upset effects can be mitigated using CRC on memory. Latch-up requires a power up reset.

I would suggest trying something like FreeRTOS and see what features you can bend to your will. Space is a very challenging environment. Have Fun.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange