esiste un linguaggio a prova di disastro?

https://stackoverflow.com/questions/1403915

05-07-2019
|

Domanda

Quando creo servizi di sistema che devono avere un'alta affidabilità, finisco spesso per scrivere molti meccanismi "fail-safe" in caso di cose come: comunicazioni che sono sparite (ad esempio comunicazione con il DB), cosa accadrebbe se la potenza viene persa e il servizio si riavvia .... come raccogliere i pezzi e continuare in modo corretto (e ricordando che durante la raccolta dei pezzi il potere potrebbe uscire di nuovo ...), ecc. ecc.

Posso immaginare sistemi non troppo complessi, un linguaggio adatto a questo sarebbe molto pratico. Quindi una lingua che ricorderebbe il suo stato in un dato momento, non importa se il potere viene interrotto e continua da dove era stato interrotto.

Esiste ancora? In tal caso, dove posso trovarlo? In caso contrario, perché questo non può essere realizzato? Mi sembrerebbe molto utile per i sistemi critici.

P.S. In caso di perdita della connessione DB, segnalerebbe che si è verificato un problema ed è necessario un intervento manuale. Nel momento in cui viene ripristinata la connessione, continuerà da dove era stata interrotta.

EDIT: Dato che la discussione sembra essere terminata, vorrei aggiungere alcuni punti (mentre aspetto prima di poter aggiungere una taglia alla domanda)

La risposta di Erlang sembra essere la più votata al momento. Sono a conoscenza di Erlang e ho letto il libro pragmatico di Armstrong (il principale creatore). È tutto molto bello (anche se i linguaggi funzionali mi fanno girare la testa con tutta la ricorsione), ma il bit "fault tolerant" non viene automaticamente. Lontano da esso. Erlang offre molti supervisori e altre metodologie per supervisionare un processo e riavviarlo se necessario. Tuttavia, per creare correttamente qualcosa che funzioni con queste strutture, devi essere un vero guru e devi adattare il tuo software a tutti questi framework. Inoltre, se la potenza cala, anche il programmatore deve raccogliere i pezzi e provare a recuperare la prossima volta che il programma si riavvia

Quello che sto cercando è qualcosa di molto più semplice:

Immagina un linguaggio (semplice come PHP per esempio), in cui puoi fare cose come fare query DB, agire su di esso, eseguire manipolazioni di file, eseguire manipolazioni di cartelle, ecc.

La sua caratteristica principale dovrebbe essere: se il potere si spegne e la cosa si riavvia prende da dove era stata interrotta (quindi non si ricorda solo dove si trovava, ma ricorderà anche gli stati delle variabili). Inoltre, se si è arrestato nel mezzo di una filecopy, riprenderà correttamente. ecc ecc.

Ultimo ma non meno importante, se la connessione DB si interrompe e non può essere ripristinata, la lingua si interrompe e segnala (forse syslog) per l'intervento umano, quindi continua da dove era stata interrotta.

Un linguaggio come questo renderebbe molto più semplice la programmazione di molti servizi.

EDIT: Sembra (a giudicare da tutti i commenti e le risposte) che un tale sistema non esiste. E probabilmente non lo sarà nel prossimo futuro a causa della sua impossibilità (quasi?) Di essere corretta.

Peccato .... di nuovo non sto cercando questo linguaggio (o framework) per portarmi sulla luna, o usarlo per monitorare qualcuno del battito cardiaco. Ma per piccoli servizi / compiti periodici che finiscono sempre per avere un sacco di bordi di gestione del codice (mancanza di corrente da qualche parte nel mezzo, connessioni che cadono e non riappaiono), ... dove una pausa qui, ... risolve i problemi, .. ..e continuare da dove avevi interrotto l'approccio avrebbe funzionato bene.

(o un approccio checkpoint come ha sottolineato uno dei commentatori (come in un videogioco). Imposta un checkpoint .... e se il programma si interrompe, riavvia qui la prossima volta.)

Premio assegnato: All'ultimo minuto possibile, quando tutti stavano arrivando alla conclusione che non si può fare, Stephen C arriva con napier88 che sembra avere gli attributi che stavo cercando. Sebbene sia un linguaggio sperimentale, dimostra che può essere fatto ed è qualcosa che vale la pena indagare di più.

Cercherò di creare il mio framework (con stato persistente e forse istantanee) per aggiungere le funzionalità che sto cercando in .Net o in un'altra VM.

Tutti grazie per l'input e le ottime intuizioni.

Soluzione

Esiste un linguaggio sperimentale chiamato Napier88 che (in teoria) ha alcuni attributi di essere a prova di disastro. Il linguaggio supporta la persistenza ortogonale e in alcune implementazioni questo si estende (esteso) per includere lo stato dell'intero calcolo. In particolare, quando il sistema di runtime Napier88 ha fatto il check-point di un'applicazione in esecuzione nell'archivio persistente, lo stato del thread corrente verrebbe incluso nel checkpoint. Se l'applicazione si è arrestata in modo anomalo e l'hai riavviata nel modo giusto, puoi riprendere il calcolo dal checkpoint.

Sfortunatamente, ci sono una serie di problemi difficili che devono essere affrontati prima che questo tipo di tecnologia sia pronta per l'uso corrente. Questi includono capire come supportare il multi-threading nel contesto della persistenza ortogonale, capire come consentire a più processi di condividere un archivio persistente e una garbage collection scalabile di negozi persistenti.

E c'è il problema di fare la persistenza ortogonale in un linguaggio tradizionale. Ci sono stati tentativi di fare OP in Java, incluso uno che è stato fatto da persone associate a Sun (il progetto Pjama), ma al momento non c'è nulla di attivo. Gli approcci JDO / Hibernate sono più favoriti in questi giorni.

Dovrei sottolineare che la persistenza ortogonale non è realmente a prova di disastro in senso lato. Ad esempio, non può gestire:

ristabilimento di connessioni, ecc. con " al di fuori " sistemi dopo un riavvio,
bug dell'applicazione che causano la corruzione dei dati persistenti, oppure
perdita di dati a causa di qualcosa che fa crollare il sistema tra i checkpoint.

Per quelli, non credo che ci siano soluzioni generali che sarebbero pratiche.

Altri suggerimenti

Erlang è stato progettato per l'uso nei sistemi di telecomunicazione, dove l'high-rel è fondamentale. Penso che abbiano una metodologia standard per costruire serie di processi comunicativi in ??cui i fallimenti possono essere gestiti con garbo.

ERLANG è un linguaggio funzionale concorrente, adatto per software distribuito, altamente concorrente e tollerante ai guasti. Una parte importante di Erlang è il supporto per il recupero degli errori. La tolleranza ai guasti viene fornita organizzando i processi di un'applicazione ERLANG in strutture ad albero. In queste strutture, i processi parent controllano i fallimenti dei loro figli e sono responsabili del loro riavvio.

Memoria transazionale software (STM) combinata con RAM non volatile probabilmente soddisferebbe la domanda rivista dell'OP.

STM è una tecnica per implementare "transazioni", ad esempio insiemi di azioni che vengono eseguite in modo efficace come un'operazione atomica o per niente. Normalmente lo scopo di STM è quello di consentire a programmi altamente paralleli di interagire su risorse condivise in un modo che sia più facile da comprendere rispetto alla tradizionale programmazione lock-that-resource e che abbia probabilmente un sovraccarico inferiore in virtù del fatto di avere uno stile altamente ottimistico senza lock di programmazione.

L'idea fondamentale è semplice: tutto legge e scrive all'interno di una "transazione". i blocchi sono registrati (in qualche modo!); se due thread sono in conflitto su questi set (conflitti di lettura / scrittura o scrittura / scrittura) alla fine di una delle loro transazioni, uno viene scelto come vincitore e procede e l'altro è costretto a ripristinare il suo stato all'inizio della transazione e rieseguire.

Se si insistesse sul fatto che tutti i calcoli fossero transazioni e che lo stato all'inizio (/ fine) di ciascuna transazione fosse archiviato nella RAM non volatile (NVRAM), un blackout potrebbe essere trattato come un fallimento della transazione con conseguente "rollback" ;. I calcoli procedono solo dagli stati trattati in modo affidabile. NVRAM in questi giorni può essere implementato con memoria Flash o con batteria di riserva. Uno potrebbe aver bisogno di MOLTA NVRAM, poiché i programmi hanno molto stato (vedi la storia dei minicomputer alla fine). In alternativa, è possibile scrivere cambiamenti di stato impegnati nei file di registro scritti su disco; questo è il metodo standard utilizzato dalla maggior parte dei database e da filesystem affidabili.

La domanda attuale con STM è: quanto costa tenere traccia dei potenziali conflitti di transazione? Se l'implementazione di STM rallenta la macchina di un importo apprezzabile, le persone vivranno con schemi leggermente inaffidabili esistenti invece di rinunciare a quella prestazione. Finora la storia non è buona, ma poi la ricerca è in anticipo.

Le persone non hanno generalmente progettato lingue per STM; per scopi di ricerca, hanno principalmente Java migliorato con STM (vedi l'articolo di Communications of ACM di giugno? di quest'anno). Ho sentito che MS ha una versione sperimentale di C #. Intel ha una versione sperimentale per C e C ++. La pagina di Wikipedia ha una lunga lista. E i ragazzi della programmazione funzionale stanno, come al solito, affermando che la proprietà libera dagli effetti collaterali dei programmi funzionali rende STM relativamente banale da implementare in linguaggi funzionali.

Se ricordo bene, negli anni '70 ci fu un notevole lavoro iniziale nei sistemi operativi distribuiti, in cui i processi (codice + stato) potevano viaggiare banalmente da una macchina all'altra. Credo che molti di questi sistemi abbiano esplicitamente consentito il fallimento del nodo e potrebbero riavviare un processo in un nodo fallito dallo stato di salvataggio in un altro nodo. Il lavoro chiave iniziale è stato sul Distributed Computing System di Dave Farber. Poiché la progettazione di linguaggi negli anni '70 era popolare, ricordo che DCS aveva il suo linguaggio di programmazione ma non ricordo il nome. Se DCS non ha consentito l'errore del nodo e il riavvio, sono abbastanza sicuro che lo abbia fatto sui sistemi di ricerca.

EDIT: un sistema del 1996 che appare a prima vista per avere le proprietà che desideri documentato qui . Il suo concetto di transazioni atomiche è coerente con le idee alla base di STM. (Dimostra che non c'è molto di nuovo sotto il sole).

Una nota a margine: negli anni '70, Core Memory era ancora il re. Il core, essendo magnetico, non era volatile in caso di interruzione dell'alimentazione e molti minicomputer (e sono sicuro che i mainframe) presentavano interruzioni dell'alimentazione che avvisavano il software di alcuni millisecondi prima della perdita di energia. Usando quello, si potrebbe facilmente memorizzare lo stato di registro della macchina e spegnerlo completamente. Quando veniva ripristinata l'alimentazione, il controllo tornava a un punto di ripristino dello stato e il software poteva procedere. Molti programmi potrebbero quindi sopravvivere ai lampeggi di alimentazione e riavviarsi in modo affidabile. Ho costruito personalmente un sistema di condivisione del tempo su un minicomputer Data General Nova; potresti effettivamente far funzionare 16 teletipi a tutto volume, prendere un colpo di potenza e tornare indietro e riavviare tutti i teletipi come se nulla fosse successo. Il passaggio dalla cacofonia al silenzio e viceversa è stato sorprendente, lo so, ho dovuto ripeterlo molte volte per eseguire il debug del codice di gestione dell'interruzione dell'alimentazione e, naturalmente, ha fatto un'ottima demo (strappare la spina, silenzio mortale, ricollegarla .. .). Il nome della lingua che ha fatto questo, ovviamente era Assembler: -}

Da quello che so & # 185 ;, Ada viene spesso utilizzato in termini di sicurezza critici sistemi (fail-safe).

Ada era originariamente indirizzato a   sistemi integrati e in tempo reale.

Le caratteristiche degne di nota di Ada includono:   tipizzazione forte, meccanismi di modularità   (pacchetti), controllo del tempo di esecuzione,   elaborazione parallela (attività), eccezione   manipolazione e generici. Ada 95 aggiunto   supporto orientato agli oggetti   programmazione, anche dinamica   spedizione.

Ada supporta i controlli di runtime in ordine   per proteggere dall'accesso a   memoria non allocata, buffer overflow   errori, errori off-by-one, array   errori di accesso e altri rilevabili   bug. Questi controlli possono essere disabilitati in   l'interesse dell'efficienza di runtime,   ma spesso può essere compilato in modo efficiente.   Include anche servizi per aiutare   verifica del programma.

Per questi   ragioni, Ada è ampiamente usato in   sistemi critici, dove qualsiasi anomalia   potrebbe portare a molto grave   conseguenze, cioè morte accidentale   o lesioni. Esempi di sistemi in cui   Ada viene utilizzato includono avionica, arma   sistemi (incluso termonucleare   armi) e veicoli spaziali.

Programmazione della versione N può anche darti utili letture di background.

& # 185; Questa è fondamentalmente una conoscenza che scrive software per la sicurezza incorporato

Dubito che le caratteristiche della lingua che stai descrivendo siano possibili da raggiungere.

E la ragione di ciò è che sarebbe molto difficile definire modalità di guasto comuni e generali e come recuperarle. Pensa per un secondo alla tua applicazione di esempio: un sito Web con un po 'di logica e accesso al database. E diciamo che abbiamo una lingua in grado di rilevare l'arresto dell'alimentazione e il successivo riavvio e in qualche modo recuperare da esso. Il problema è che è impossibile sapere come recuperare la lingua.

Supponiamo che la tua app sia un'applicazione blog online. In quel caso potrebbe essere sufficiente continuare dal punto in cui abbiamo fallito e andare tutto bene. Tuttavia, considera uno scenario simile per una banca online. All'improvviso non è più intelligente continuare dallo stesso punto. Ad esempio, se stavo cercando di prelevare un po 'di denaro dal mio account e il computer è morto subito dopo i controlli, ma prima di eseguire il prelievo, e poi torna indietro una settimana dopo, mi darà i soldi anche se il mio account è nel negativo ora.

In altre parole, non esiste un'unica strategia di recupero corretta, quindi non è qualcosa che può essere implementato nella lingua. Quale lingua può fare è dirti quando succede qualcosa di brutto, ma la maggior parte delle lingue lo supporta già con meccanismi di gestione delle eccezioni. Il resto spetta ai progettisti delle applicazioni pensare.

Esistono molte tecnologie che consentono di progettare applicazioni a tolleranza d'errore. Transazioni di database, code di messaggi durevoli, clustering, hot swap hardware e così via. Ma tutto dipende da requisiti concreti e da quanto l'utente finale è disposto a pagare per tutto.

La maggior parte di tali sforzi - denominati " tolleranza agli errori '- riguardano la hardware, non software.

L'esempio estremo di ciò è Tandem , le cui macchine "non-stop" hanno una ridondanza completa.

L'implementazione della tolleranza agli errori a livello hardware è interessante perché uno stack software è in genere costituito da componenti provenienti da fornitori diversi: l'applicazione software ad alta disponibilità potrebbe essere installata insieme ad altre applicazioni e servizi decisamente traballanti su un sistema operativo che è traballante e utilizza driver di dispositivo hardware che sono decisamente fragili ..

Ma a livello linguistico, quasi tutte le lingue offrono le strutture per un corretto controllo degli errori. Tuttavia, anche con RAII, eccezioni, vincoli e transazioni, questi percorsi di codice vengono raramente testati correttamente e raramente testati insieme in scenari con errori multipli, ed è solitamente nel codice di gestione degli errori che i bug nascondono. Quindi si tratta più della comprensione, della disciplina e dei compromessi del programmatore che delle lingue stesse.

Il che ci riporta alla tolleranza agli errori a livello hardware. Se riesci a evitare il fallimento del collegamento al database, puoi evitare di esercitare il codice di gestione degli errori non corretto nelle applicazioni.

No , non esiste un linguaggio a prova di disastro.

Modifica:

A prova di disastro implica la perfezione. Richiama alla mente le immagini di un processo che applica una certa intelligenza per risolvere in modo logico condizioni sconosciute, non specificate e inaspettate. Non esiste alcun modo in cui un linguaggio di programmazione possa farlo. Se tu, come programmatore, non riesci a capire come il tuo programma fallirà e come ripristinarlo, il tuo programma non sarà nemmeno in grado di farlo.

Il disastro dal punto di vista IT può sorgere in così tante mode che nessun processo può risolvere tutti questi diversi problemi. L'idea che potresti progettare una lingua per affrontare tutti i modi in cui qualcosa potrebbe andare storto è semplicemente errata. A causa dell'astrazione dall'hardware, molti problemi non hanno nemmeno molto senso affrontare un linguaggio di programmazione; eppure sono ancora "catastrofi".

Naturalmente, una volta che inizi a limitare l'ambito del problema; allora possiamo iniziare a parlare dello sviluppo di una soluzione. Quindi, quando smettiamo di parlare di essere a prova di disastro e iniziamo a parlare del recupero da imprevisti sbalzi di tensione, diventa molto più facile sviluppare un linguaggio di programmazione per affrontare tale preoccupazione anche quando, forse, non ha molto senso gestire quel problema a un livello così alto della pila. Tuttavia, mi azzarderò a prevedere che una volta che lo porterai a implementazioni realistiche, diventerà poco interessante come linguaggio da quando è diventato così specifico. cioè usare il mio linguaggio di scripting per eseguire processi batch durante la notte che si riprenderanno da imprevisti sbalzi di tensione e perdita di connessioni di rete (con un po 'di assistenza umana); questo non è un caso aziendale convincente per me.

Per favore, non fraintendetemi. Ci sono alcuni suggerimenti eccellenti all'interno di questa discussione, ma a mio avviso non arrivano a nulla che si avvicini anche a distanza a prova di disastro.

Prendi in considerazione un sistema creato dalla memoria non volatile. Lo stato del programma è persistente in ogni momento e, qualora il processore si fermasse per un certo periodo di tempo, riprenderà dal punto in cui era rimasto al riavvio. Pertanto, il programma è "a prova di disastro" nella misura in cui può sopravvivere a un'interruzione di corrente.

Questo è del tutto possibile, come altri post hanno delineato quando si parla di memoria transazionale del software e di "tolleranza agli errori" ecc. Curioso nessuno ha menzionato "memristor", in quanto offrirebbero un'architettura futura con queste proprietà e forse non lo è anche completamente architettura von Neumann.

Ora immagina un sistema costruito da due di questi sistemi discreti - per una semplice illustrazione, uno è un server di database e l'altro un server di applicazioni per un sito Web di banking online.

Se uno si mette in pausa, cosa fa l'altro? Come gestisce l'improvvisa indisponibilità del suo collaboratore?

Potrebbe essere gestito a livello di lingua, ma ciò significherebbe un sacco di gestione degli errori e simili, e questo è un codice complicato per avere ragione. Non è affatto meglio di dove siamo oggi, dove le macchine non sono controllate ma le lingue provano a rilevare i problemi e chiedono al programmatore di gestirli.

Potrebbe anche mettere in pausa: a livello hardware potrebbero essere collegati tra loro, in modo che dal punto di vista dell'alimentazione siano un sistema. Ma questa non è una buona idea; una migliore disponibilità verrebbe da un'architettura a tolleranza d'errore con sistemi di backup e simili.

Oppure potremmo usare code di messaggi persistenti tra le due macchine. Tuttavia, ad un certo punto questi messaggi vengono elaborati e potrebbero a quel punto essere troppo vecchi! Solo la logica dell'applicazione può davvero funzionare cosa fare in quelle circostanze, e qui torniamo di nuovo alle lingue delegando al programmatore.

Quindi sembra che la protezione dai disastri sia migliore nella forma attuale: alimentatori ininterrotti, server di backup caldi pronti all'uso, percorsi di rete multipli tra host, ecc. E quindi dobbiamo solo sperare che il nostro software sia bug- gratis!

Risposta precisa:

Ada e SPARK sono stati progettati per la massima tolleranza agli errori e per spostare tutti i bug possibili in fase di compilazione anziché in fase di esecuzione. Ada è stata progettata dal Dipartimento della Difesa degli Stati Uniti per i sistemi militari e aeronautici, funzionante su dispositivi integrati in aeromobili. Spark è il suo discendente. C'è un altro linguaggio usato nei primi programmi spaziali statunitensi, HAL / S orientato alla gestione degli errori HARDWARE e del danneggiamento della memoria a causa dei raggi cosmici.

Risposta pratica:

Non ho mai incontrato nessuno in grado di codificare Ada / Spark. Per la maggior parte degli utenti la risposta migliore sono le varianti SQL su un DBMS con failover automatico e clustering dei server. I controlli di integrità garantiscono la sicurezza. Qualcosa come T-SQL o PL / SQL ha una sicurezza transazionale completa, è Turing-complete ed è abbastanza tollerante ai problemi.

Motivo per cui non esiste una risposta migliore:

Per motivi di prestazioni, non è possibile fornire durata per ogni operazione del programma. In tal caso, l'elaborazione rallenterebbe alla velocità della memoria non volatile più veloce. Nella migliore delle ipotesi, le tue prestazioni diminuiranno di mille o milioni di volte, a causa di QUANTO più lento rispetto alla cache della CPU o alla RAM.

Sarebbe l'equivalente di passare da una CPU Core 2 Duo all'antica CPU 8086 - al massimo potresti fare un paio di centinaia di operazioni al secondo. Tranne, questo sarebbe anche più LENTO.

Nei casi in cui si verificano frequenti interruzioni di corrente o guasti hardware, si utilizza qualcosa come un DBMS, che garantisce ACID per ogni operazione importante . Oppure, usi hardware che ha una memoria veloce e non volatile (ad esempio il flash): è ancora molto più lento, ma se l'elaborazione è semplice, va bene.

Nella migliore delle ipotesi la tua lingua ti offre buoni controlli di sicurezza in fase di compilazione per i bug e genererà eccezioni anziché arresti anomali. La gestione delle eccezioni è una caratteristica della metà delle lingue attualmente in uso.

Esistono diversi framework disponibili in commercio Veritas, Sun's HA, IBMs HACMP ecc. ecc. che controllerà automaticamente i processi e li avvierà su un altro server in caso di errore.

Esiste anche un hardware costoso come la gamma Tandem Nonstop di HP che può sopravvivere a guasti hardware interni.

Comunque il software è costruito da persone e le persone amano sbagliare. Considera la storia di ammonimento del programma IEFBR14 fornito con IBM MVS. Fondamentalmente è un programma fittizio NOP che consente ai bit dichiarativi di JCL di accadere senza realmente eseguire un programma. Questo è l'intero codice sorgente originale: -

     IEFBR14 START
             BR    14       Return addr in R14 -- branch at it
             END

Niente codice può essere più semplice? Durante la sua lunga vita questo programma ha effettivamente accumulato una segnalazione di bug ed è ora sulla versione 4.

Questo è 1 bug a tre righe di codice, la versione corrente è quattro volte più grande dell'originale.

Gli errori si insinueranno sempre, assicurati solo di poter recuperare da loro.

This question forced me to post this text

(Its quoted from HGTTG from Douglas Adams:)

Click, hum.

The huge grey Grebulon reconnaissance ship moved silently through the black void. It was travelling at fabulous, breathtaking speed, yet appeared, against the glimmering background of a billion distant stars to be moving not at all. It was just one dark speck frozen against an infinite granularity of brilliant night.

On board the ship, everything was as it had been for millennia, deeply dark and Silent.

Click, hum.

At least, almost everything.

Click, click, hum.

Click, hum, click, hum, click, hum.

Click, click, click, click, click, hum.

Hmmm.

A low level supervising program woke up a slightly higher level supervising program deep in the ship's semi-somnolent cyberbrain and reported to it that whenever it went click all it got was a hum.

The higher level supervising program asked it what it was supposed to get, and the low level supervising program said that it couldn't remember exactly, but thought it was probably more of a sort of distant satisfied sigh, wasn't it? It didn't know what this hum was. Click, hum, click, hum. That was all it was getting.

The higher level supervising program considered this and didn't like it. It asked the low level supervising program what exactly it was supervising and the low level supervising program said it couldn't remember that either, just that it was something that was meant to go click, sigh every ten years or so, which usually happened without fail. It had tried to consult its error look-up table but couldn't find it, which was why it had alerted the higher level supervising program to the problem .

The higher level supervising program went to consult one of its own look-up tables to find out what the low level supervising program was meant to be supervising.

It couldn't find the look-up table .

Odd.

It looked again. All it got was an error message. It tried to look up the error message in its error message look-up table and couldn't find that either. It allowed a couple of nanoseconds to go by while it went through all this again. Then it woke up its sector function supervisor.

The sector function supervisor hit immediate problems. It called its supervising agent which hit problems too. Within a few millionths of a second virtual circuits that had lain dormant, some for years, some for centuries, were flaring into life throughout the ship. Something, somewhere, had gone terribly wrong, but none of the supervising programs could tell what it was. At every level, vital instructions were missing, and the instructions about what to do in the event of discovering that vital instructions were missing, were also missing.

Small modules of software — agents — surged through the logical pathways, grouping, consulting, re-grouping. They quickly established that the ship's memory, all the way back to its central mission module, was in tatters. No amount of interrogation could determine what it was that had happened. Even the central mission module itself seemed to be damaged.

This made the whole problem very simple to deal with. Replace the central mission module. There was another one, a backup, an exact duplicate of the original. It had to be physically replaced because, for safety reasons, there was no link whatsoever between the original and its backup. Once the central mission module was replaced it could itself supervise the reconstruction of the rest of the system in every detail, and all would be well.

Robots were instructed to bring the backup central mission module from the shielded strong room, where they guarded it, to the ship's logic chamber for installation.

This involved the lengthy exchange of emergency codes and protocols as the robots interrogated the agents as to the authenticity of the instructions. At last the robots were satisfied that all procedures were correct. They unpacked the backup central mission module from its storage housing, carried it out of the storage chamber, fell out of the ship and went spinning off into the void.

This provided the first major clue as to what it was that was wrong.

Further investigation quickly established what it was that had happened. A meteorite had knocked a large hole in the ship. The ship had not previously detected this because the meteorite had neatly knocked out that part of the ship's processing equipment which was supposed to detect if the ship had been hit by a meteorite.

The first thing to do was to try to seal up the hole. This turned out to be impossible, because the ship's sensors couldn't see that there was a hole, and the supervisors which should have said that the sensors weren't working properly weren't working properly and kept saying that the sensors were fine. The ship could only deduce the existence of the hole from the fact that the robots had clearly fallen out of it, taking its spare brain, which would have enabled it to see the hole, with them.

The ship tried to think intelligently about this, failed, and then blanked out completely for a bit. It didn't realise it had blanked out, of course, because it had blanked out. It was merely surprised to see the stars jump. After the third time the stars jumped the ship finally realised that it must be blanking out, and that it was time to take some serious decisions.

It relaxed.

Then it realised it hadn't actually taken the serious decisions yet and panicked. It blanked out again for a bit. When it awoke again it sealed all the bulkheads around where it knew the unseen hole must be.

It clearly hadn't got to its destination yet, it thought, fitfully, but since it no longer had the faintest idea where its destination was or how to reach it, there seemed to be little point in continuing. It consulted what tiny scraps of instructions it could reconstruct from the tatters of its central mission module.

"Your !!!!! !!!!! !!!!! year mission is to !!!!! !!!!! !!!!! !!!!!, !!!!! !!!!! !!!!! !!!!!, land !!!!! !!!!! !!!!! a safe distance !!!!! !!!!! ..... ..... ..... .... , land ..... ..... ..... monitor it. !!!!! !!!!! !!!!!..."

All of the rest was complete garbage.

Before it blanked out for good the ship would have to pass on those instructions, such as they were, to its more primitive subsidiary systems.

It must also revive all of its crew.

There was another problem. While the crew was in hibernation, the minds of all of its members, their memories, their identities and their understanding of what they had come to do, had all been transferred into the ship's central mission module for safe keeping. The crew would not have the faintest idea of who they were or what they were doing there. Oh well.

Just before it blanked out for the final time, the ship realised that its engines were beginning to give out too.

The ship and its revived and confused crew coasted on under the control of its subsidiary automatic systems, which simply looked to land wherever they could find to land and monitor whatever they could find to monitor.

Try taking an existing open source interpreted language and see if you could adapt its implementation to include some of these features. Python's default C implementation embeds an internal lock (called the GIL, Global Interpreter Lock) that is used to "handle" concurrency among Python threads by taking turns every 'n' VM instructions. Perhaps you could hook into this same mechanism to checkpoint the code state.

For a program to continue where it left off if the machine loses power, not only would it need to save state to somewhere, the OS would also have to "know" to resume it.

I suppose implementing a "hibernate" feature in a language could be done, but having that happen constantly in the background so it's ready in the event anything bad happens sounds like the OS' job, in my opinion.

It's main feature however should be: If the power dies, and the thing restarts it takes of where it left off (So it not only remembers where it was, it will remember the variable states as well). Also, if it stopped in the middle of a filecopy, it will also properly resume. etc etc.

... ...

I've looked at erlang in the past. However nice it's fault tolerant features it has... It doesn't survive a powercut. When the code restarts you'll have to pick up the pieces

If such a technology existed, I'd be VERY interested in reading about it. That said, The Erlang solution would be having multiple nodes--ideally in different locations--so that if one location went down, the other nodes could pick up the slack. If all of your nodes were in the same location and on the same power source (not a very good idea for distributed systems), then you'd be out of luck as you mentioned in a comment follow-up.

The Microsoft Robotics Group has introduced a set of libraries that appear to be applicable to your question.

What is Concurrency and Coordination Runtime (CCR)?

Concurrency and Coordination Runtime (CCR) provides a highly concurrent programming model based on message-passing with powerful orchestration primitives enabling coordination of data and work without the use of manual threading, locks, semaphores, etc. CCR addresses the need of multi-core and concurrent applications by providing a programming model that facilitates managing asynchronous operations, dealing with concurrency, exploiting parallel hardware and handling partial failure.

What is Decentralized Software Services (DSS)?

Decentralized Software Services (DSS) provides a lightweight, state-oriented service model that combines representational state transfer (REST) with a formalized composition and event notification architecture enabling a system-level approach to building applications. In DSS, services are exposed as resources which are accessible both programmatically and for UI manipulation. By integrating service composition, structured state manipulation, and event notification with data isolation, DSS provides a uniform model for writing highly observable, loosely coupled applications running on a single node or across the network.

Most of the answers given are general purpose languages. You may want to look into more specialized languages that are used in embedded devices. The robot is a good example to think about. What would you want and/or expect a robot to do when it recovered from a power failure?

In the embedded world, this can be implemented through a watchdog interrupt and a battery-backed RAM. I've written such myself.

Depending upon your definition of a disaster, it can range from 'difficult' to 'practicly impossible' to delegate this responsibility to the language.

Other examples given include persisting the current state of the application to NVRAM after each statement is executed. This only works so long as the computer doesn't get destroyed.

How would a language level feature know to restart the application on a new host?

And in the situation of restoring the application to a host - what if significant time had passed and assumptions/checks made previously were now invalid?

T-SQL, PL/SQL and other transactional languages are probably as close as you'll get to 'disaster proof' - they either succeed (and the data is saved), or they don't. Excluding disabling transactional isolation, it's difficult (but probably not impossible if you really try hard) to get into 'unknown' states.

You can use techniques like SQL Mirroring to ensure that writes are saved in atleast two locations concurrently before a transaction is committed.

You still need to ensure you save your state every time it's safe (commit).

If I understand your question correctly, I think that you are asking whether it's possible to guarantee that a particular algorithm (that is, a program plus any recovery options provided by the environment) will complete (after any arbitrary number of recoveries/restarts).

If this is correct, then I would refer you to the halting problem:

Given a description of a program and a finite input, decide whether the program finishes running or will run forever, given that input.

I think that classifying your question as an instance of the halting problem is fair considering that you would ideally like the language to be "disaster proof" -- that is, imparting a "perfectness" to any flawed program or chaotic environment.

This classification reduces any combination of environment, language, and program down to "program and a finite input".

If you agree with me, then you'll be disappointed to read that the halting problem is undecidable. Therefore, no "disaster proof" language or compiler or environment could be proven to be so.

However, it is entirely reasonable to design a language that provides recovery options for various common problems.

In the case of power failure.. sounds like to me: "When your only tool is a hammer, every problem looks like a nail"

You don't solve power failure problems within a program. You solve this problem with backup power supplies, batteries, etc.

If the mode of failure is limited to hardware failure, VMware Fault Tolerance claims similar thing that you want. It runs a pair of virtual machines across multiple clusters, and using what they call vLockstep, the primary vm sends all states to the secondary vm real-time, so in case of primary failure, the execution transparently flips to the secondary.

My guess is that this wouldn't help communication failure, which is more common than hardware failure. For serious high availability, you should consider distributed systems like Birman's process group approach (paper in pdf format, or book Reliable Distributed Systems: Technologies, Web Services, and Applications ).

The closest approximation appears to be SQL. It's not really a language issue though; it's mostly a VM issue. I could imagine a Java VM with these properties; implementing it would be another matter.

A quick&dirty approximation is achieved by application checkpointing. You lose the "die at any moment" property, but it's pretty close.

I think its a fundemental mistake for recovery not to be a salient design issue. Punting responsibility exclusivly to the environment leads to a generally brittle solution intolerant of internal faults.

If it were me I would invest in reliable hardware AND design the software in a way that it was able to recover automatically from any possible condition. Per your example database session maintenance should be handled automatically by a sufficiently high level API. If you have to manually reconnect you are likely using the wrong API.

As others have pointed out procedure languages embedded in modern RDBMS systems are the best you are going to get without use of an exotic language.

VMs in general are designed for this sort of thing. You could use a VM vendors (vmware..et al) API to control periodic checkpointing within your application as appropriate.

VMWare in particular has a replay feature (Enhanced Execution Record) which records EVERYTHING and allows point in time playback. Obviously there is a massive performance hit with this approach but it would meet the requirements. I would just make sure your disk drives have a battery backed write cache.

You would most likely be able to find similiar solutions for java bytecode run inside a java virtual machine. Google fault tolerant JVM and virtual machine checkpointing.

If you do want the program information saved, where would you save it?

It would need to be saved e.g. to disk. But this wouldn't help you if the disk failed, so already it's not disaster-proof.

You are only going to get a certain level of granularity in your saved state. If you want something like tihs, then probably the best approach is to define your granularity level, in terms of what constitutes an atomic operation and save state to the database before each atomic operation. Then, you can restore to the point of that level atomic operation.

I don't know of any language that would do this automatically, sincethe cost of saving state to secondary storage is extremely high. Therefore, there is a tradeoff between level of granularity and efficiency, which would be hard to define in an arbitrary application.

First, implement a fault tolerant application. One where, where, if you have 8 features and 5 failure modes, you have done the analysis and test to demonstrate that all 40 combinations work as intended (and as desired by the specific customer: no two will likely agree).
second, add a scripting language on top of the supported set of fault-tolerant features. It needs to be as near to stateless as possible, so almost certainly something non-Turing-complete.
finally, work out how to handle restoration and repair of scripting language state adapted to each failure mode.

And yes, this is pretty much rocket science.

Windows Workflow Foundation may solve your problem. It's .Net based and is designed graphically as a workflow with states and actions.

It allows for persistence to the database (either automatically or when prompted). You could do this between states/actions. This Serialises the entire instance of your workflow into the database. It will be rehydrated and execution will continue when any of a number of conditions is met (certain time, rehydrated programatically, event fires, etc...)

When a WWF host starts, it checks the persistence DB and rehydrates any workflows stored there. It then continues to execute from the point of persistence.

Even if you don't want to use the workflow aspects, you can probably still just use the persistence service.

As long as your steps were atomic this should be sufficient - especially since I'm guessing you have a UPS so could monitor for UPS events and force persistence if a power issue is detected.

If I were going about solving your problem, I would write a daemon (probably in C) that did all database interaction in transactions so you won't get any bad data inserted if it gets interrupted. Then have the system start this daemon at startup.

Obviously developing web stuff in C is quite slower than doing it in a scripting language, but it will perform better and be more stable (if you write good code of course :).

Realistically, I'd write it in Ruby (or PHP or whatever) and have something like Delayed Job (or cron or whatever scheduler) run it every so often because I wouldn't need stuff updating ever clock cycle.

Hope that makes sense.

To my mind, the concept of failure recover is, most of the time, a business problem, not a hardware or language problem.

Take an example : you have one UI Tier and one subsystem. The subsystem is not very reliable but the client on the UI tier should percieve it as if it was.

Now, imagine that somehow your sub system crash, do you really think that the language you imagine, can think for you how to handle the UI Tier depending on this sub system ?

Your user should be explicitly aware that the subsystem is not reliable, if you use messaging to provide high reliability, the client MUST know that (if he isn't aware, the UI can just freeze waiting a response which can eventually come 2 weeks later). If he should be aware of this, this means that any abstrations to hide it will eventually leak.

By client, I mean end user. And the UI should reflect this unreliability and not hide it, a computer cannot think for you in that case.

"So a language which would remember it's state at any given moment, no matter if the power gets cut off, and continues where it left off."

"continues where it left off" is often not the correct recovery strategy. No language or environment in the world is going to attempt to guess how to recover from a particular fault automatically. The best it can do is provide you with tools to write your own recovery strategy in a way that doesn't interfere with your business logic, e.g.

Exception handling (to fail fast and still ensure consistency of state)
Transactions (to roll back incompleted changes)
Workflows (to define recovery routines that are called automatically)
Logging (for tracking down the cause of a fault)
AOP/dependency injection (to avoid having to manually insert code to do all the above)

These are very generic tools and are available in lots of languages and environments.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow