Monday, May 19, 2014

Task Death In Safety Critical Systems

Safety critical systems must be designed with an expectation that one or more tasks on the same CPU might fail via hanging, might fail to be scheduled, or otherwise might not execute in a periodic manner as intended, leading to a partial software failure.

Consequences:
Failing to monitor each and every task and failing to mitigate task execution faults can lead to a task “dying” without being recovered. If a task dies, it no longer performs its functions even though other tasks are still operational. This can leave the system in an uncontrolled state, cause loss of fail-safe functionality (if the fail-safe is in a dead task), cause stale recorded failure data (if the task collecting that data dies), etc.

Accepted Practice:
  • Each task must be continually monitored for completion within a defined period. This ensures that the task is scheduled and that it is completed on time. A properly designed watchdog timer that monitors all tasks is an accepted way to accomplish this.
  • If a task fails to periodically complete execution, fault mitigation must be performed. Common fault mitigation techniques include restarting the task or restarting the system. 
Discussion

Embedded systems typically have one or more periodic software tasks that must complete execution within some maximum period (each task has its own deadline, and by default the deadline equals the period unless some other deadline is specified). In a properly designed system, careful analysis and testing has been performed to ensure that each task will continue operation and meet its deadline even under worst case operating conditions.

However, competent safety-critical embedded system designers realize that faults may occur, resulting in a partial failure of the software system, and plan accordingly. Each task must be monitored to ensure the following: each task executes at least once in every period that it is supposed to execute; each task finishes execution without abnormally terminating; and each task completes execution within its predetermined deadline. This means that the system must not only make sure that once a task is scheduled to run it meets its deadline, but also make sure that the task is scheduled to run in the first place.

As an analogy, consider if you have a set of things you need to do every evening, but have a bad memory. One way you might handle this is to put post-it notes on your fridge to remind yourself what needs to be done every evening before you go to sleep. Let’s say you’re very methodical, so you take each note down in turn, do the action, then put it back on the fridge, in order. One of the notes says “check that basement door is locked” because you’ve found the kids have a habit of forgetting to lock it when they come in from playing. For a while this works fine. But one day you get distracted while locking the basement door and set the note down, forgetting to put it back on the fridge. The next night you might forget to check the door and not even realize it (at least not until you wake up in the middle of the night wondering about it – but embedded CPUs don’t have that feature!). What has happened is that a disruption caused one of your tasks to fall off the scheduling list. Just because you finished the entire list before morning doesn’t mean everything got done, because something was missing from the list. The same thing can happen in a computer if one of its tasks dies and doesn’t get put back on the task to-do list – the computer “thinks” it is running all its tasks, but in reality one of the tasks hasn’t been run because it isn’t even on the to-do list.

Selected Sources

Kleidermacher points out that tasks can die, saying “When a thread faults (for example, due to a stack overflow), the kernel should provide some mechanism whereby notification can be sent to the supervisor thread. If necessary, the supervisor can then make a system call to close down the faulted thread, or the entire process, and restart it. The supervisor might also be hooked into a software ‘watchdog’ setup, whereby thread deadlocks and starvation can be detected as well.” (Kleidermacher 2001, pg. 23).

Ganssle refers to the necessity of checking all tasks by saying “This problem multiplies in a system with an RTOS, as a reliable watchdog monitors all of the tasks. If some of the tasks die but others stay alive – perhaps tickling the [Watchdog Timer] – then the system’s operation is at best degraded.”  (Ganssle 2000, p. 125)

Tasks can be expected to die due to a variety of causes, and this is especially likely to happen in a system without hardware support for memory protection, such as OSEK, because software from one task can accidentally corrupt another task's data. Task death is an expected possibility in such an RTOS. For example, a technical summary of the OSEK operating system shows “Failed” as a task state, indicating that tasks can be expected to fail and need to be restarted to restore system operation. (Feiler 2003, Fig 1, shown below).


OSEK tasks can be expected to fail, requiring a task restart. (Feiler 2003, Fig. 1.)


Another contemporaneous real time operating system, VxWorks, did not support memory protection and did not support automatic task restart. As a result, it was dramatically more prone to subtle but significant partial software failures if software applications misbehaved (Koopman 2008, pp. 220-221, recounting work done several years earlier).  In particular, parts of the system would appear to be working, but key pieces of the system would have failed, giving a superficial appearance of a system that was working (the keyboard accepted commands and displayed them on the computer screen) when in fact the system was mostly dead, and programs could not be run on it. (VxWorks has since added memory protection capabilities, so these experimental results do not necessarily represent current VxWorks systems.)

Safety critical system designers should reasonably expect that a task might die (cease to finish computation in a periodic manner). They should have complete analysis to determine which task deaths might lead to an unsafe system (by default, this is all tasks in the system), and should take proactive measures to detect and recover from one or more tasks dying, regardless of the underlying cause -- even if no specific cause for the task death can be imagined by the designers.

References:
  • Feiler, Real-time application development with OSEK: a review of the OSEK standards, Carnegie Mellon University Software Engineering Institute, CMU/SEI-2003-TN-004, Nov 2003.
  • Ganssle, J., The Art of Designing Embedded Systems, Newnes, 2000.
  • Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
  • Koopman, P., DeVale, K. & DeVale, J., "Interface robustness testing: experiences and lessons learned from the Ballista Project," In: Kanoun, K. & Spainhower, L., Eds., Dependability Benchmarking for Computer Systems, IEEE Press, 2008, pp. 201-226.


No comments:

Post a Comment

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...