University of Surrey

Test tubes in the lab Research in the ATI Dance Research

Improving processor reliability using software protection techniques.

Nezzari, Yasser (2020) Improving processor reliability using software protection techniques. Doctoral thesis, University of Surrey.

[img]
Preview
Text
Yasser PhD Final.pdf - Version of Record
Available under License Creative Commons Attribution Non-commercial Share Alike.

Download (4MB) | Preview

Abstract

The use of Commercial Off-The-Shelf (COTS) processors is increasingly attractive for the space domain, especially with emerging high demand applications in Earth observation and communications. An order of magnitude improvement in on-board processing capability with less size, mass, and power is possible, however, COTS parts still lag in terms of reliability in the space environment. Costly protection techniques to ensure resilience to Single Event Effects (SEEs) is required. Whilst current software reliability techniques are only capable of detecting errors, and performing partial recovery, our research offers a step change for both error detection and recovery without degradation in fault coverage. This targets modern multicore processors. This research presents a novel software technique Automatic Compiler Error-Detection and Recovery (ACEDR) for software error detection and recovery. This technique is capable of covering both the CPU and Memory of COTS processing architectures, where the corruption of data (RAM and CPU registers) accessed by instructions can be corrected. ACEDR does not require additional hardware modifications in order to have the capability of error detection and recovery. ACEDR is based on LLVM compiler framework, where it adds redundant instructions to the original code at compile-time, and inserts check instructions (voter function call) to enable it to decide the right outcome out of the three redundant instructions at run-time. To achieve high coverage, both CPU instructions, like the arithmetic and logic operations, and memory instructions, responsible for Reading/Writing (R/W) from/to memory have been triplicated and protected. This work does not provide protection to jump/conditional jump instructions, also bit flips that would transform instructions into other instructions are not considered in this work. The LLVM modifications consist of modifying the optimization phase of the compilation process, by adding two passes, an analysis and a transformation one. Analysis pass will provide information about the code, consisting of instruction types and some statistics that can be utilised to analyse the confidence level later. The transformation pass adds the protection code, where it takes information provided by the analysis and uses it to add the appropriate protection where needed. This research also proposes an adaptive protection for the multicore processors, motivated by the fact that traditional software protection techniques are mostly focused on error detection, and ignore the recovery part. Our research offers a step change not only with its ability to detect and recover errors, but also with the ability to reduce the overhead while keeping high error coverage. This research offers the ability for the running software mode to change in real time, and provide the proper error detection and recovery technique depending on the error rate on orbit. Novel error detection and protection techniques have been implemented to provide the operation mode with the necessary resilience. This includes Instructions-TMR (ITMR), Threads-TMR (TTMR) and the combination of both ITMR and TTMR. Reliability prediction models have been presented in this research, in order to estimate the reliability. The reliability equations will model the whole processing architecture using multiple parameters related to the hardware architecture and the environment. In order to determine the reliability added using our software protection techniques, all of the benchmarks have been tested using software fault injection, where an LLVM fault injector has been developed. The reliability model is starting from the basic model of Markov chain. The reliability predictions depends on many factors specific to the hardware architecture used, including λ, the error rate of the Single Event Upsets (SEU). λ changes depending on the cross section of the component. Another factor is the access rate (depends on the hit/miss rate of the component). Other parameters that effect the reliability predictions are specific to the CPU, like the number of cores and pipeline stages. Our model also takes into account the sensitivity of the different instruction types that can be found in the different benchmarks that have been tested under a SEU rate λ. The prediction model is estimating the worst-case scenario, and does not consider the case where an error has occurred before writing to memory or loading to CPU registers. This research has been validated by the mean of fault injection, where both the protected and the unprotected codes have been injected. The outcomes of the injected codes have been compared to find ACEDR and the adaptive solution’s ability to detect and recover from SEUs errors. The fault injector will go through the code and randomly flips a bit of one of its instruction’s data. This includes both the CPU and memory instructions. The inclusion of both the CPU and memory instruction’s data makes the injector more realistic, with respect to SEUs behaviour in real world. ACEDR improves the reliability of the system by reducing the error rate of the injection experiment simulating SEUs. The error rate is defined as the number of the injections that have caused an error divided by the total number of injections. ACEDR provides up to 99% improvement for some benchmarks. This research has been tested in two machines; Intel core i5-3470 with 3.2 GHz frequency and a Raspberry Pi 3. On the 1st processing platform the overhead was less than 15% and on the 2nd platform the overhead was less than 17%. Unlike other techniques in the literature that only provide error detection and/or partial recovery, our pure software-based protection technique offers high rate for both the detection and recovery, relative to the high error rate that has been injected. This is due to the variety of data and CPU register types it replicates. This research triplicates i32, i32*, i1, i8, i8*, i64, float & double, float & double pointers data and instruction types, in addition to replicating both memory and CPU registers. This newly developed software-based technique is notably beneficial, when designers do not have the luxury of modifying the hardware, but they still need resilience against SEUs in a computer system. In ACEDR the overhead was low since our work is adding redundant but independent instructions that uses the CPU pipeline to execute instructions in parallel, without having a bottleneck. For the adaptive multicore solution, both predictions and injection experiments confirm that the best reliability results were obtained when the combined protection techniques were used, where the error rates dropped to an interval between 0% & 0.60%. The 2nd best results were obtained when Instructions TMR (ITMR) was used, where the error rate dropped to an interval between 0% & 3.97%. Using Threads TMR (TTMR) has dropped the error rates to an interval of 3.51% to 14.73% which we do not recommend for mission critical systems. The error detection and recovery came with an overhead between 14.97% & 131.70% for the combined protection techniques, 6.44% to 87.54% when ITMR was used and 10.32% to 41.14% when TTMR was used. In the adaptive solution, the overheads added when protecting using the combined technique (TTMR & ITMR) was higher than when only TTMR or the ITMR was used for protection. The main reason for the delays was the creation of new threads and joining them, in addition to the delays added by the voting function, and the addition of redundant instructions using the ITMR. However, the reliability was dramatically improved when the combined protection techniques were used. The reliability predictions of ACEDR and the adaptive solution have been compared against the reliability obtained from the fault injection experiment, after finding a correlation between the two results. This work would be highly valuable, both to satellites/space and in general computing such as in aircraft, automotive, server farms, and medical equipment (or anywhere that needs safety critical performance) as hardware gets smaller and more susceptible.

Item Type: Thesis (Doctoral)
Divisions : Theses
Authors : Nezzari, Yasser
Date : 30 April 2020
Funders : University of Surrey
DOI : 10.15126/thesis.00854093
Contributors :
ContributionNameEmailORCID
http://www.loc.gov/loc.terms/relators/THSBridges, ChristopherC.P.Bridges@surrey.ac.uk
Depositing User : Yasser Nezzari
Date Deposited : 19 May 2020 12:12
Last Modified : 20 May 2020 07:31
URI: http://epubs.surrey.ac.uk/id/eprint/854093

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year


Information about this web site

© The University of Surrey, Guildford, Surrey, GU2 7XH, United Kingdom.
+44 (0)1483 300800