Monday, May 5, 2014

Mitigating Data Corruption

As previously discussed, data corruption from bit flips can happen from a variety of sources. The results from such faults can be catastrophic. An effective technique should be used to detect both software- and hardware-caused corruption of critical data. Effective techniques include error coding and data replication in various forms.

Consequences:
Unintentional modification of data values can cause arbitrarily bad system behavior, even if only one bit is modified. Therefore, safety critical systems should take measures to prevent or detect such corruptions, consistent with the estimated risk of hardware-induced data corruption and anticipated software quality. Note that even best practices may not prevent corruption due to software concurrency defects.

Accepted Practices:

  • When hardware data corruption may occur and automatic error correction is desired, use a hardware single error correction/multiple error detection circuitry (SECMED) form of Error Detection and Correction circuitry (EDAC), sometimes just called Error Correcting Code circuitry (ECC) for all bytes of RAM. This would protect against hardware memory corruption, including hardware corruption of operating system variables. However, it does not protect against software memory corruption.
  • Use a software approach such as a cyclic redundancy code CRC (preferred), or checksum to detect a corrupted program image, and test for corruption at least when the system is booted.
  • Use a software approach such as keeping redundant copies to detect software data corruption of RAM values.
  • Use fault injection to test data corruption detection and correction mechanisms.
  • Perform memory tests to ensure there are no hard faults.

Discussion:

Safety critical systems must protect against data corruption to avoid small changes in data which can render the system unsafe. Even a single bit in memory changing in the wrong way could cause a system to change from being safe to unsafe. To guard against this, various schemes for memory corruption detection and prevention are used.


Hardware and Software Are Both Corruption Sources

Hardware memory corruption occurs when a radiation event, voltage fluctuation, source of electrical noise, or other cause makes one or more bits flip from one value to another. In non-volatile memory such as flash memory, wearout, low programming voltage, or electrical charge leakage over time can also cause bits in memory to have an incorrect value. Mitigation techniques for these types of memory errors include the use of hardware error detection/correction codes (sometimes called “EDAC”) for RAM, and typically the use of a “checksum” for flash memory to ensure that all the bytes, when “added up,” give the same total as they did when the program image was stored in flash.

If hardware memory error detection support is not available, RAM can also be protected with some sort of redundant storage. A common practice is to store two copies of a value in two different places in RAM, often with one copy inverted or otherwise manipulated. It's important to avoid storing the two copies next to each other to avoid problems of errors that corrupt adjacent bits in memory. Rather, there should be two entirely different sections of memory for mirrored variables, with each section having only one copy of each mirrored variable. That way, if a small chunk of memory is arbitrarily corrupted, it can at most affect one of the two copies of any mirrored variable. Error detection codes such as checksums can also be used, and provide a tradeoff of increased computation time when a change is made vs. requiring less storage space for error detection information as compared to simple replication of data.

Software memory corruption occurs when one part of a program mistakenly writes data to a place that should only be written to by another part of the program due to a software defect. This can happen as a result of a defect that produces an incorrect pointer into memory, due to a buffer overflow (e.g., trying to put 17 bytes of data into a 16 byte storage area), due to a stack overflow, or due to a concurrency defect, among other scenarios.

Hardware error detection does not help in detecting software memory corruption, because the hardware will ordinarily assume that software has permission to make any change it likes. (There may be exceptions if hardware has a way to “lock” portions of memory from modifications, which is not the case here.) Software error detection may help if the corruption is random, but may not help if the corruption is a result of defective software following authorized RAM modification procedures that just happen to point to the wrong place when modifications are made. While various approaches to reduce the chance of accidental data corruption can be envisioned, acceptable practice for safety critical systems in the absence of redundant computing hardware calls for, at a minimum, storing redundant copies of data. There must also be a recovery plan such as system reboot or restoration to defaults if a corruption is detected.

Data Mirroring

A common approach to providing data corruption protection is to use a data mirroring approach in which a second copy of a variable having a one’s complement value is stored in addition to the ordinary variable value. A one’s complement representation of a number is computed by inverting all the bits in a number. So this means one copy of the number is stored normally, and the second copy of that same number is stored with all the bits inverted (“complemented”). As an example, if the original number is binary “0000” the one’s complement mirror copy would be “1111.” When the number is read, both the “0000” and the “1111” are read and compared to make sure they are exact complements of each other. Among other things, this approach gives protection against a software defect or hardware corruption that sets a number of RAM locations to all be the same value. That sort of corruption can be detected regardless of the constant corruption value put into RAM, because two mirrored copies can’t have the same value unless at least one of the pair has been corrupted (e.g., if all zeros are written to RAM, both copies in a mirrored pair will have the value “0000,” indicating a data corruption has occurred).

Mirroring can also help detect hardware bit flip corruptions. A bit flip is when a binary value (a 0 or 1), is corrupted to have the opposite value (changing to a 1 or 0 respectively), which in turn corrupts the value of the number stored at the memory location suffering one or more bit flips. So long as only one of two mirror values suffers a bit flip, that corruption will be detectable because the two copies won’t match properly as exact complements of each other.

A good practice is to ensure that the mirrored values are not adjacent to each other so that an erroneous multi-byte variable update is less likely to modify both the original and mirrored copy. Such mirrored copies are vulnerable to a pair of independent bit flips that just happen to correspond to the same position within each of a pair of complemented stored values. Therefore, for highly critical systems a Cyclic Redundancy Check (CRC) or other more advanced error detection method is recommended.

It is important to realize that all memory values that can conceivably cause a system hazard need to be protected by mirroring, not just a portion of memory. For example a safety-critical Real Time Operating System will have values in memory that control task scheduling. Corruption of these variables can lead to task death or other problems if the RTOS doesn't protect data integrity, even if the application software does use mirroring. Note that there are multiple ways for an RTOS to protect its data integrity from software and hardware defects beyond this, such as via using hardware access protection. But, if the only mechanism being used in a system to prevent memory corruption is mirroring, the RTOS has to use it too or you have a vulnerability.

Selected Sources

Automotive electronics designers knew as early as 1982 that data corruption could be expected in automotive electronics. Seeger writes: “Due to the electrically hostile environment that awaits a microprocessor based system in an automobile, it is necessary to use extra care in the design of software for those systems to ensure that the system is fault tolerant. Common faults that can occur include program counter faults, altered RAM locations, or erratic sensor inputs.” (Seeger 1982, abstract, emphasis added). Automotive designers generally accepted the fact that RAM location disruptions would happen in automotive electronics (due to electromagnetic interference (EMI), radiation events, or other disturbances), and had to ensure that any such disruption would not result in an unsafe system.

Stepner, in a paper on real time operating systems that features a discussion of OSEK (the trade name of an automotive-specific real time operating system), states with regard to avoiding corruption of data: “One technique is the redundant storage of critical variables, and comparison prior to being used. Another is the grouping of critical variables together and keeping a CRC over each group.” (Stepner 1999, pg. 155).

Brown says “We’ve all heard stories of bit flips that were caused by cosmic rays or EMI” and goes on to describe a two-out-of-three voting scheme to recover from variable corruption. (Brown 1998 pp. 48-49). A variation of keeping only two copies permits detection but not correction of corruption. Brown also acknowledges that designers must account for software data corruption, saying “Another, and perhaps more common, cause of memory corruption is a rogue pointer, which can run wild through memory leaving a trail of corrupted variables in its wake. Regardless of the cause, the designer of safety-critical software must consider the threat that sometime, somewhere, a variable will be corrupted.” (id., p. 48).

Kleidermacher says: “When all of an application’s threads share the same memory space, any thread could—intentionally or unintentionally— corrupt the code, data, or stack of another thread. A misbehaved thread could even corrupt the kernel’s own code or internal data structures. It’s easy to see how a single errant pointer in one thread could easily bring down the entire system, or at least cause it to behave unexpectedly.” (Kleidermacher 2001, pg. 23). Kleidermacher advocates hardware memory protection, but in the absence of a hardware mechanism, software mechanisms are required to mitigate memory corruption faults.

Fault injection is a way to test systems to see how they respond to faults in memory or elsewhere (see also an upcoming post on that topic). Fault injection can be performed in hardware (e.g., by exposing a hardware circuit to a radiation source or by using hardware test features to modify bit values), or injected via software means (e.g., slightly modifying software to permit flipping bits in memory to simulate a hardware fault). In a research paper, Vinter used a hybrid hardware/software fault injection technique to corrupt bits in a computer running an automotive-style engine control application. The conclusions of this paper start by saying: “We have demonstrated that bit-flips inside a central processing unit executing an engine control program can cause critical failures, such as permanently locking the engine’s throttle at full speed.” (Vinter 2001). Fault injection remains a preferred technique for determining whether there are data corruption vulnerabilities that can result in unsafe system behavior.

References:
  • Brown, D., “Solving the software safety paradox,” Embedded System Programming, December 1998, pp. 44-52.
  • Kleidermacher, D. & Griglock, M., Safety-Critical Operating Systems, Embedded Systems Programming, Sept. 2001, pp. 22-36.
  • Seeger, M., Fault-Tolerant Software Techniques, SAE Report 820106, International Congress & Exposition, Society of Automotive Engineers, 1982, pp. 119-125.
  • Stepner, D., Nagarajan, R., & Hui, D., Embedded application design using a real-time OS, Design Automation Conference, 1999, pp. 151-156.
  • Vinter, J., Aidemark, J., Folkesson, P. & Karlsson, J., Reducing critical failures for control algorithms using executable assertions and best effort recovery, International Conference on Dependable Systems and Networks, 2001, pp. 347-356.


2 comments:

  1. Dear Mr Koopman,

    First of all, thanks a lot for your very interesting blog.
    Regarding this post, I have a question for you regarding data mirroring and efficiency/coverage regarding corruptions due to software faults. Indeed, we can consider this "mechanism" as well appropriate to ensure tolerence to random hw failures in case for example mechanism such as ECC is not available. Regarding corruption due to software faults, what is your position regarding added value/coverage of such a mechanism? Regarding software fault, the strategy could be first of all based on fault avoidance/removal du ring the development phase through in particular code analysis and in particular with the support of tools allowing to detect rutime errors that could lead to memory corruption (pointers issues...). In a second time, in case of coexistence in the same sw with components with lower integrity level, to use the support of hw mechanisms such as MPU. It can be difficult to mirror all the safety related data of a sw but maybe it could be applied on the most safety critical (i.e that can lead directly to an unwanted safety related event/static variables and identified through safety analysis).
    Thanks!
    Best regards

    ReplyDelete
  2. Mirroring will help with at least some software faults and some hardware faults. How much it helps depends upon the structure of your software and how you implement mirroring (for example, does the subroutine that implement mirrored variable writing check the task ID to make sure it is being called from a critical task and not a non-critical task?). It also, as you suggest, depends on what fraction of memory values are mirrored.

    I'd be very careful mirroring only some things and claiming that you've covered all the critical data -- it can be very difficult to know what indirect results will happen from corrupting data.

    If you need to go down that path, I'd suggest running some fault injection campaigns to see what happens when variables get corrupted. Don't forget that data on the stack can be corrupted too.

    A significant limitation of mirroring and error detection in general is that some faults happen in the logic and peripherals of the CPU -- not just data stored in RAM.

    ReplyDelete

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...