The increasing use of digital systems in everyday life has made reliability a key factor in the design of modern microprocessors. Soft errors are caused by high-energy particles, power supply noises, transistor variability, and can modify the logic value stored in microprocessor memory elements, which can cause a timing or functional failure. Historically, soft errors were considered only a challenge for high-altitude applications because most of the high-energy particles are cascaded by the earth’s atmosphere before they reach ground level. However, the problem is now expanding to terrestrial-level particles due to changes in the atmosphere.
Software-level soft error tolerant schemes are promising because against hardware-based solutions, they can be applied on commercial-off-the-shelf processors selectively, either to only the safety/mission-critical applications, or only to the critical parts of an application.
Researchers at Arizona State University have developed NEMESIS, a novel compiler-level fine-grain soft error technique for detection, diagnosis and recovery that can provide a high degree of error-resiliency. NEMESIS runs three versions of computations and detects soft errors by checking the results of all memory write and branch operations. In the case of a mismatch, the NEMESIS recovery routine reverts the effect of error from the architectural state of the program and resumes normal execution of the program.
- Autonomous vehicles
- Implantable medical devices
- High-performance computing
- Protection against hardware malfunctions for safety/security applications
Benefits & Advantages
- Able to detect all soft errors
- Both control and data flow detection and recovery
- Can recover from 97% of detected errors
- Software-only reliability solution
- Safe stop if an error is unrecoverable
Related Publication: NEMESIS: A software approach for computing in presence of soft errors | IEEE Conference Publication