FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS - PowerPoint PPT Presentation

FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS Vanessa VARGAS PhD candidate in Nano Electronics and Nano T echnologies Université de Grenoble Alpes - France Professor at Universidad de las Fuerzas Armadas ESPE Department of Electrical and Electronics- Ecuador

OUTLINE Introduction M otivation Background Work Done Conclusions 2

INTRODUCTION 3

INTRODUCTION 4

INTRODUCTION Start Task 1 Task 2 Task n 5 End 5

OUTLINE Introduction M OTIVATION Background Work Done Conclusions 6

M OTIVATION SUPERCOMPUTERS Top500 (June 2016) 1er de Top500 : Sunway TaihuLight - Sunway M PP , NRCPC, 93.01 Petaflops Sunway SW26010 260C 1.45GHz, Sunway NRCPC 10,649,600 cores 15.31 MW National Supercomputing Center in Wuxi China Many-core 2nd de Top500 : Thiane-2, NUDT, 33.86 Petaflops ivybridge 12c/ proc, 2.2GHz + Intel XeonPhi, 3 120 000 cores 17.81 MW TH Express-2, National University of defense technology, China 7

M OTIVATION In HPC systems, the use of many-core processors is crucial to satisfy the growing demand of performance and reliability without a critical increase of power consumption. 8

M OTIVATION This exponential growth face many challenges: Power • Limited power budget Space • Fit in available floor space Cost • Fixed financial budget Memory technology • Feed compute power & cost efficiently Network technology • Connect nodes power & cost efficiently Software • S cale to utilize the growing compute capacity RELIABILITY • Failure rates should not grow with machine size And others … 9

M OTIVATION CONCERNING THE RELIABILITY Evaluate fault tolerance technique under radiation and fault injection campaigns. Evaluate the impact of the use of fault tolerance techniques on performance and energy consumption. F IGURE 1. R ADIATION E XPERIM ENT 10

OUTLINE Introduction M otivation BACKGROUND • M ultiprocessing modes • Fault T olerance Work Done Conclusions 11

M UL TI-PROCESSING M ODES F IGURE 2. S CHEM ESOF AMP AND SMP PROCESSING M ODES • Single OS is responsible for achieving parallelism in the application. SM P • It dynamically distributes the tasks among the cores, manages the organization of task completion, and controls the shared resources. • The cores run independently of each other, with or without OS. AM P • They have their own private memory space, although there is a common infrastructure for inter-core communications. 12

FAUL T TOLERANCE A system is considered as fault tolerant when facing a fault, it continues working correctly. Fault tolerance can be obtained by redundancy. 1 • Spatial Redundancy 2 • T emporal Redundany 3 • Both of them 13

Spatial vs temporal redundancy SPATIAL TEM PORAL It uses the same physical components It uses different physical components It can separate identical data signals in It can separate identical data signals in time space ADVANT AGE ADVANT AGE • Fewer components. • It lacks an inherent maximum operating frequency. DISADVANT AGES DISADVANT AGES • Latency penalty. • It requires more area and components. • It has a maximum operating frequency and therefore not used in commercial processes faster • Penalty in performance 14 Source: Radiation Effects and Soft Errors in Integrated Circuits and Electronic Devices

FAUL T TOLERANCE IN M UL TICORE Taking advantage of the multiplicity of cores, various redundancy techniques can be considered. 1 • T emporal redundancy 2 • Data value redundancy • Information redundancy for error 3 detection in multicore designs 4 • Redundancy in execution For evaluating any technique it is possible to do it by fault injection or by radiation test campaigns. 15

Redundancy in execution The replication of state machine is used Replication copies of a process is performed. Copies follow the same sequence of execution and produce the same result if inputs are the same. It should ensure that redundant processes not diverge in the absence of failures. Divergent causes are: Nondeterministic In multi-core Asynchronous functions signals • Access to shared memory (gettimeofday) The record / replay method ensures that access to shared memory is done in the same order. 16

Redundancy in execution Unreliable State Machine Error Checking Reliable system Replication and Recovery system Double Modular Triple Modular Deterministic Redundancy with Record/ Replay Redundancy with Multithreading checkpoint/ Fault Masking rollback 17

Redundancy in execution • by using locks, barriers and creating Deterministic threads. multithreading • Problem: Slow down application. Double Modular • It allows error detection. Redundancy DMR Triple Modular • It allows error detection and Redundancy TMR correction by a voter. 18

Redundancy in execution Mixed • Deteministic Multithreading Modelling • DMR Source: Hamid M ushtaq, Zaid Al-Ars, Koen Bertels “Fault T olerance on M ulticore Processors using Deterministic M ultithreading” F IGURE 3. E XAM PLEOF R EDUNDANCY IN E XECUTION 19

OUTLINE Introduction Motivation Background WORK DONE • Freescale P2041RDB • TM R in AM P mode • Fault Injection in SM P • Radiation T ests in AM P y SM P mode • KALRAY M PPA-256 (M ulti Purpose Processing Array) • Fault Injection in AM P mode • Radiation T ests in AM P mode • Fault Injection in mixed mode • Evaluating Fault T olerance T echnique Conclusions 20

FREESCALE P2041 F IGURE 4. Q OR IQ P2041 M EM ORY ARCHITECTURE Built on • Power Architectures technology M anufactured • 45nm SOI technology Based on • four e500mc cores( 32-bit superscalar processor ) Operation Frequency • up to 1.5 GHz 21

TM R in AM P mode 22

TM R in AM P mode FIGURE 5. F AULT I NJECTION STRATEGY IN P ROCESSOR R EGISTER

TM R in AM P mode EXPERIM ENT • It was run 50000 times. • Injection of one or two SEUs per execution. FIGURE 6. F T -I NJECTION C ONS AUL EQUENCES RESULTS • 20% of injected faults have no detectable consequences (silent faults). • If one SEU is injected per execution, the error rate reaches 78% and the TM R corrects 99.99% of them. • On the other hand, if two SEUs are injected, the error rate reaches 93% while the error correction factor decreases to 85%. 24

TM R in AM P mode FIGURE 7. F T -I NJECTION C ONS EQUENCESIN P ROCES OR R EGIS AUL S TERS 25

FAUL T INJECTION IN SM P TABLE I. A PPLICATIONS S UM M ARY 26

FAUL T INJECTION IN SM P Two test campaigns were performed on each selected application: a) Fault injection in processor registers. b) Fault injection in memory region TABLE II. F AULT - I NJECTION C AM PAIGNS 27

FAUL T INJECTION IN SM P FIGURE 8. P ROPOSED S OFTWARE F AULT -I NJECTION IN M EM ORY R EGION

FAUL T INJECTION IN APPLICATION RUNNING IN SM P Register MM Register TSP 84,38% 65,39% 34,19% 13,52% 1,47% 0,63% 0,16% 0,27% Silent faults Result Exceptions Timeouts errors FIGURE 9. F T -I NJECTION C ONS EQUENCESIN P ROCES OR R EGIS AUL S TERS Memory MM Memory TSP 96,59% These campaigns target only the private code memory: 59,82% The initial process stack memory, The thread’s stacks memory, and The process’ heap memory. 23,32% 14,25% 2,60% 1,49% 1,92% 0,02% Silent Result Exceptions Timeouts faults errors FIGURE 10. F AULT -I NJECTION C ONSEQUENCESIN M EM ORY R EGION 29

RADIATION TES TS F IGURE 11. C ONSEQUENCESOF RADIATION TEST CAM PAIGNS • From the results, one can see that the reliability of an application depends on the software environment characteristics: • Operating system. • Multiprocessing mode used. • 30 Characteristics of application.

RADIATION TES TS IN SM P M ODE FIGURE 12. E RROR CLASSIFICATION ACCORDING TO OS FAULT The obtained results revealed that errors may occur in SMP mode, even if the OS is in idle mode. 31

RADIATION TES TS F IGURE 13. SEE CONSEQUENCESACCORDING TO THE SCENARIO IM PLEM ENTED . T HE CONFIDENCE INTERVALSARE SHOWN BY M EANSOF THE RED LINES . 32

KALRAY M PP A-256 M anufactured • TS MC CMOS 28HP technology. • multi-banked local static memory Compute (SM EM ) of 2M B shared by the 16(PE) + Cluster 1(RM ). • 256 Processing Engine (PE) and 32 Integrates Resource Management (RM) cores. • 2 groups of quadcore. Each 128 KB I/ O cluster shared. Based on • Core VLIW 32-bit/ 64-bit architecture. Operation • 100 MHz to 600 MHz. frequency Power • 15 W to 25 W. Consumption Peaks performance • 634 GFLOPS and 316 GFLOPS for single at 600 M Hz and double-precision respectively. Clustered • 16 compute clusters (CCs) and 2 I/ O clusters per device. architecture F IGURE 14: MPPA- 256 M EM ORY ARCHITECTURE 33

Fault Tolerance Approach on M PP A Implemented at application level, it uses the 2 I/ O to improve the reliability of the application. • Core 0 Initializes intercluster communications 1) • Core 0 generates a pthread per core: 2) Core 1, 2 • M aster of group of computing cluster Core 4,5,6 • Voters of the results (TM R –arbiter) • Arbiter of the final results. It logs the Core 3 results Core 7 (only of • Fault Injector. I/ O 0) 34

FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS - PowerPoint PPT Presentation

FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS Vanessa VARGAS PhD candidate in Nano Electronics and Nano T echnologies Universit de Grenoble Alpes - France Professor at Universidad de las Fuerzas Armadas ESPE Department of

1 Faul, M. V. (2011) Contemporary global education goals: narrative and networks. DRAFT ONLY

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

are Heard Stephen Faul, VP, Strategic Communications Dr. Michelle Gauthier, VP, Public Policy

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

SOCIAL JUSTICE 101: Based on Materials from Teaching Tolerance About Teaching Tolerance

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

1 What is PSQ? Restricted O.D. Tolerance Straightness Long Bars Tolerance 316L & 416

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Software Quality & Software Quality Assurance p. 1 Software Quality A definition of

Chiffre : A Confjgurable Hardware Fault Injection Framework for RISC-V Systems Schuyler Eldridge

Empirical Software Engineering Research with Industry: Top 10 Challenges Claes Wohlin | CESI

A Comfortable TestPlayer for Analyzing Statistical Usage Testing Strategies Winfried Dulz

KeY Version for MISRA C Daniel Larsson KeY Symposium Gteborg, June 2005 KeY Version for MISRA

Fi Fingerprinting t g the C Check cker Po Policies of Pa Parallel File Systems Runzhou Han ,

Systems Security: Side-channel attacks Stjepan Picek s.picek@tudelft.nl Delft University of

Challenging Anti-fragile Blockchain systems Miguel Gonzlez Univ. Lille 1 1 What is

FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS - PowerPoint PPT Presentation

FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS Vanessa VARGAS PhD candidate in Nano Electronics and Nano T echnologies Universit de Grenoble Alpes - France Professor at Universidad de las Fuerzas Armadas ESPE Department of

1 Faul, M. V. (2011) Contemporary global education goals: narrative and networks. DRAFT ONLY

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

are Heard Stephen Faul, VP, Strategic Communications Dr. Michelle Gauthier, VP, Public Policy

Adaptability and Fault Tolerance Adaptability and Fault Tolerance Rog rio rio de Lemos de

SOCIAL JUSTICE 101: Based on Materials from Teaching Tolerance About Teaching Tolerance

General Principles of Fault- Tolerance Daniel Gottesman Perimeter Institute Whats Left For

Roadmap for Section 10.1 The Notion of Fault-Tolerance Fault-Tolerance Support in NTFS Volume

Fault Tolerance in Message Passing Fault Tolerance in Message Passing and in Action and in

CSci 5105 Introduction to Distributed Systems Fault Tolerance Last Time Replication and

Fibre bundle framework for unitary quantum fault tolerance Lucy Liuxuan Zhang University of

Towards an Efficient Fault-Tolerance Scheme for GLB Claudia Fohry, Marco Bungart and Jonas Posner

Challenging Malicious Inputs with Fault Tolerance Techniques Bruno Luiz Agenda Threats

1 What is PSQ? Restricted O.D. Tolerance Straightness Long Bars Tolerance 316L &amp; 416

Fault Tolerance at Speed Todd L. Montgomery @toddlmontgomery About me What type of Fault

Distributed Systems 5. Fault Tolerant Systems Fault-Tolerance - 1 Lszl Bszrmnyi

Software Quality &amp; Software Quality Assurance p. 1 Software Quality A definition of

Chiffre : A Confjgurable Hardware Fault Injection Framework for RISC-V Systems Schuyler Eldridge

Empirical Software Engineering Research with Industry: Top 10 Challenges Claes Wohlin | CESI

A Comfortable TestPlayer for Analyzing Statistical Usage Testing Strategies Winfried Dulz

KeY Version for MISRA C Daniel Larsson KeY Symposium Gteborg, June 2005 KeY Version for MISRA

Fi Fingerprinting t g the C Check cker Po Policies of Pa Parallel File Systems Runzhou Han ,

Systems Security: Side-channel attacks Stjepan Picek s.picek@tudelft.nl Delft University of

Challenging Anti-fragile Blockchain systems Miguel Gonzlez Univ. Lille 1 1 What is

1 What is PSQ? Restricted O.D. Tolerance Straightness Long Bars Tolerance 316L & 416

Software Quality & Software Quality Assurance p. 1 Software Quality A definition of