Soft Errors Recovery Using Core Surprise Removal.
In this work we deal with the challenge of soft-errors in modern processors. Radiation-induced soft errors have emerged as a key challenge in computer system design. Exponentially increasing transistor counts will drive per-chip fault rates correspondingly higher unless new technologies are employed. The prevailing view is that if our industry is to continue to provide customers with the level of reliability they expect, system designers must address this challenge directly. Currently, a single soft-error in one of the cores on a thousand core machine might bring down the whole system. In this work we propose using core surprise removal as a technique for overcoming chip-originated unrecoverable soft errors. We implement our proposed technique in the Linux kernel and evaluate our implementation on a real system and on a virtualized environment. We show that by adding only 50 lines of kernel code, our system became soft error tolerant.