Soft errors or transient faults—caused by high-energy particles that lead to an unexpected change in the transistor logic—have long been considered the main reliability challenge for many mission-critical applications. Conventionally, hardware-level soft-error resilience techniques have been employed in mission- and safety-critical systems, like spacecraft and enterprise systems. However, hardware solutions come with high costs owing to the need for system redesign. Software approaches are attractive, as they can deliver flexible and affordable solutions. They can be especially useful for mixed-critical systems, where flexible software techniques can provide reliability based on task requirements.
An ideal error-resilient scheme should provide complete, effective, and timely recovery from soft errors. Some software-level fault tolerance techniques are incomplete because they provide error detection and assume some sort of checkpoint/roll-back for recovery. However, in effort to avoid high error-recovery latency, frequency checkpoints are required, which then impose prohibitive performance overhead. In-application fault-tolerant techniques can potentially eliminate the need for full-system checkpointing and memory replication, while providing efficient and timely error handling by combining both error detection and recovery within the application itself. Unfortunately, existing in-application error-tolerant schemes are significantly weaker than their underlying detection schemes, due to the vulnerability added by complex error recovery routines.
Researchers at Arizona State University have developed a lightweight checkpoint technique for resilience against soft errors. The technique provides effective, safe, and timely soft error detection and recovery using software. Resilience against data flow errors and control flow errors is provided in critical or mixed-critical applications in each basic block or at critical basic blocks. Verified register preservation is provided at each basic block, along with memory preservation checkpoints. In this manner, soft errors are quickly detected and addressed. The register and memory preservation further allows for safe re-execution from recoverable soft errors. Control flow errors can also be detected at the beginning and/or end of each basic block.
This innovation is covered by U.S. Pat. No. 10,997,027.
• Transient fault reduction in digital circuits
• Software-level fault-tolerant techniques
• Mission- and safety-critical systems
Benefits and Advantages
• Verified Register File Preservation: Transformation not only preserves registers’ value into memory (no latent error), but also ensures correct occurrence of preserving process
• Single Memory-Location Checkpointing: Instead of checkpointing entire memory state, the state of each memory location is temporarily preserved before the corresponding writes to those locations
• Safe and Timely Recovery: Instead of performing recovery regardless of error propagation scope, a diagnosis routine is invoked only when safe
Related Publication: InCheck: An In-application Recovery Scheme for Soft Errors