Is it safe to „let it crash”?

from on 14.09.2013
0
Picture: pull the plug
Pull the plug or “let it crash”?

If something gets out of control, why not just “let it crash”? If you think about critical parts of the infrastructure or critical systems like planes, trains or nuclear reactors – well, obviously you can’t.

Nobody wants their system to crash, but is that necessarily true of all its components? Often it would be even desirable for a misbehaving control unit or software component to stop working at all – and instantly being replaced by a redundant (or better diverse) part – than to compromise the whole system for a longer period of time. Of course it is then necessary to recognize the crash as early as possible and replace the failed component via hot-spare as fast as possible. Detecting and compensating a still actively misbehaving part of hardware or software corrupting the whole system is much more difficult.

The robustness paradigm “let it crash” (LiC) [1] of the Erlang programming language [2] is one approach to isolate and handle local failures. Very light-weight processes enable straightforward concurrency with communication solely based on message passing. Processes are able to monitor and – in case of a process termination – restart each other very swiftly. The exception handling method of choice for a worker process is to terminate itself, if it is unable to handle the situation locally. Supervisor hierarchies ensure appropriate error responses by starting a different process or by restarting a new instance of the terminated one.

But what about robustness and safety? Apparently you can’t have both at the same time, at least not to the extreme. If safety – especially of human life – is at stake, a critical system has to react very fast, for example by immediately shutting down completely or by stopping all unnecessary operation. In most cases this does not delight users.

A perfectly robust system on the other hand will try to recover from failure at any cost and thus might even delay some safety critical reaction.

So, is it even possible to apply a robustness paradigm like Erlang’s “let it crash” to safety-related software projects?

Within Zühlke’s Emerging Technologies Center we did an internal research project [3] on this topic. Have a look at the report to get a detailed answer.

Proof Of Concept Scenario

As a proof-of-concept Erlang’s LiC is used for implementing safety-related functionality in a close-to-reality scenario. The technical background is a fictional telemonitoring system that enables home treatment. Wearing so called “functional clothing” – garment which incorporates wireless connected sensors as well as a power source – patients can be monitored without being restricted in movement by cables or probes. In the scope of their possibilities they can move freely at home, even though their vital signs are checked constantly.

Picture: "Let it crash" - proof of concept scenario

“Let it crash” – proof of concept scenario

Each single sensor measures either heart rate, breathing rate or blood pressure and exists several times redundantly in order to compensate sensor failures and invalid measurement positions. Because power supply is limited – even though body movement and body heat may be used for recharging – at any point of time only one sensor of a kind is powered and active. Via short-range wireless connection (IEEE 802.15.4) measured data gets transferred to a base station located in the same house or room like the patient. This device employs a constant connection with all active sensors, is able to power them on and off and switches to an alternative measurement location if need arises (failure, implausible data). At the same time the base station via dedicated line establishes a connection with the hospital system and continuously transfers the current vital signs for automated evaluation and storage. As a redundant measure, additionally to the hospital system the device itself – to a limited extent – is also able to detect critical situation as well as major measurement failures. In this case it alarms audio-visually, notifies the hospital via dedicated line and, using a GSM modem, sends a short message or places an emergency call to a configurable phone number.

The subject matter of the LiC proof-of-concept is the software development for the base station. The project focuses on the highest possible reliability of measurement data acquisition and transfer of the patient’s vital signs to the hospital system. The maximum number of currently active sensors has to be ensured for supporting a fail-safe power supply. At the same time a minimum temporal coverage of the most important vital signs has to be guaranteed. For every point in time at least two out of three critical values (heart rate, breathing rate and blood pressure) have to be available.

Implementation

The prototypical implementation of parts of the fictional scenario as a PC application in Erlang enables straightforward deployment of different setups of worker processes and supervisors as well as the evaluation of separating business logic from error handling.

The development concentrates on the software of the base station and just simulates the external sensors on the one side and the hospital system on the other. The diagram depicts one example setup, showing a structural view of the available processes and their dependencies.

Picture: structural view of processes and dependencies

Structural view of processes and dependencies

The generic supervisor hierarchy is solely responsible for creating the worker processes (sensor drivers and data collector) and for handling errors by restarting or replacing terminated processes.

The sensor drivers and the data collector on the other hand contain the core functionality (business logic) and no error handling at all. In case of missing sensor values, for example, the sensor driver just terminates and gets replaced. The same happens if there is data available but the values are outside of valid physical limits.

Testing fault-tolerance

A simple and effective variant of testing fault-tolerance is based upon a so called “Chaos Monkey” [4]. This is the name for a single or multiple processes that are being injected into the system under test, having the sole task of randomly terminating other system processes. In traditional systems with a small number of complex tasks or threads this typically leads to complete failure within a very short period of time. For systems following the LiC philosophy this only triggers the process monitoring and thus a fast replacement of the terminated software part.

In spite of the chaos monkey actively deactivating random components, a properly designed system is able to maintain basic functionality.

Results and Conclusions

At the beginning of the proof-of-concept project some application hypotheses have been proposed, based on specific software failure modes derived from IEC 60812 examples. Considering those hypotheses again at the end of the project, the following can be stated regarding the applicability of the “let it crash” paradigm in a safety-releated environment:

  1. Hypothesis: LiC and supervisor hierarchies enable to ensure the execution of critical functions.
    Ill-performing tasks and groups of tasks are stopped and restarted, no matter the cause. For instance, upon malfunction a sensor driver gets terminated and replaced by another one. This of course does not ensure a correct execution per se.
  2. Hypothesis: LiC and supervisor hierarchies enable to detect and prevent the unintended execution of a function.
    When a functional monitor detects a worker executing an unintended function, this worker gets terminated and replaced, thereby preventing the execution. For instance, the calibration of a sensor is aborted when there is another calibration request of higher priority, in order to keep the required number of sensors operational at a time.
  3. Hypothesis: LiC and supervisor hierarchies allow defining and monitoring the conditions for carrying out a critical function.
    Workers and functional monitors can control task execution and results given distinct validation checks. From a LiC perspective this excludes any measures to correct the situation besides restarting affected processes. For instance, a sensor driver validates the data received from its connected sensor before forwarding it to the collector. If a violation is detected the driver terminates itself so the supervisor can start another driver which in turn can connect to another physical sensor.
  4. Hypothesis: In spite of concurrency, carrying out critical functions at a specific instant of time and in specific order is ensured.
    Dismissing real-time requirements, conflicts within well-ordered task sequences can be resolved by terminating blocking processes which violate the order or time constraint at hand. For instance, by means of updating dedicated timestamps for sensor calibration an order is given. Thus life locks in calibration concurrency can be prevented – allowing only one sensor to calibrate at a time – and calibration of any sensor type is guaranteed within a distinct time-interval.
  5. Hypothesis: Unexpected component failures at random points in time as well as erroneous input either do not influence the correct system behavior (reduced quality of service is accepted) or result in the system switching to a safe state.
    Malfunctioning processes are immediately replaced by new ones. Fatal function loss immediately results in system shutdown. For instance, the patient monitoring system is robust with regard to sporadic process crashes as well as to the complete loss of one sensor data type. Losing more than one sensor type however is considered fatal and immediately results in system shutdown.

The missing hard real-time abilities of Erlang pose a problem when it comes to time-critical safety applications. There are strategies to solve this issue, though. In the future those solutions have to be analyzed and developed further. Regarding the applicability of LiC for safety critical systems the underlying Erlang language features have to be evaluated against safety standards like IEC 61508-3. Some research is necessary on whether it is possible to apply LiC in embedded systems without Erlang. Analyzing the underlying language features and corresponding counterparts in other languages or frameworks will provide the necessary information.

References

  1. Armstrong,  J.:  Making  Reliable  Distributed  Systems  in  the  Presence  of  Software  Errors.
    Ph.D. dissertation, Royal Institute of Technology, Stockholm (2003)
  2. Erlang, In http://www.erlang.org
  3. Assessing the Applicability of the Let-it-Crash Paradigm for Safety-Related Software Development
    Mikolaj Trzeciecki, Florian Schwedes, Christoph Woskowski
  4. Chaos  Monkey,  In http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html

Write a comment

Your e-mail address will not be published or shared with third parties. Fields marked with * are required.