faul t tolerance for m ul ti core and m any core
play

FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS - PowerPoint PPT Presentation

FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS Vanessa VARGAS PhD candidate in Nano Electronics and Nano T echnologies Universit de Grenoble Alpes - France Professor at Universidad de las Fuerzas Armadas ESPE Department of


  1. FAUL T TOLERANCE FOR M UL TI-CORE AND M ANY-CORE PROCESSORS Vanessa VARGAS PhD candidate in Nano Electronics and Nano T echnologies Université de Grenoble Alpes - France Professor at Universidad de las Fuerzas Armadas ESPE Department of Electrical and Electronics- Ecuador

  2. OUTLINE Introduction M otivation Background Work Done Conclusions 2

  3. INTRODUCTION 3

  4. INTRODUCTION 4

  5. INTRODUCTION Start Task 1 Task 2 Task n 5 End 5

  6. OUTLINE Introduction M OTIVATION Background Work Done Conclusions 6

  7. M OTIVATION SUPERCOMPUTERS Top500 (June 2016) 1er de Top500 : Sunway TaihuLight - Sunway M PP , NRCPC, 93.01 Petaflops Sunway SW26010 260C 1.45GHz, Sunway NRCPC 10,649,600 cores 15.31 MW National Supercomputing Center in Wuxi China Many-core 2nd de Top500 : Thiane-2, NUDT, 33.86 Petaflops ivybridge 12c/ proc, 2.2GHz + Intel XeonPhi, 3 120 000 cores 17.81 MW TH Express-2, National University of defense technology, China 7

  8. M OTIVATION In HPC systems, the use of many-core processors is crucial to satisfy the growing demand of performance and reliability without a critical increase of power consumption. 8

  9. M OTIVATION This exponential growth face many challenges: Power • Limited power budget Space • Fit in available floor space Cost • Fixed financial budget Memory technology • Feed compute power & cost efficiently Network technology • Connect nodes power & cost efficiently Software • S cale to utilize the growing compute capacity RELIABILITY • Failure rates should not grow with machine size And others … 9

  10. M OTIVATION CONCERNING THE RELIABILITY Evaluate fault tolerance technique under radiation and fault injection campaigns. Evaluate the impact of the use of fault tolerance techniques on performance and energy consumption. F IGURE 1. R ADIATION E XPERIM ENT 10

  11. OUTLINE Introduction M otivation BACKGROUND • M ultiprocessing modes • Fault T olerance Work Done Conclusions 11

  12. M UL TI-PROCESSING M ODES F IGURE 2. S CHEM ESOF AMP AND SMP PROCESSING M ODES • Single OS is responsible for achieving parallelism in the application. SM P • It dynamically distributes the tasks among the cores, manages the organization of task completion, and controls the shared resources. • The cores run independently of each other, with or without OS. AM P • They have their own private memory space, although there is a common infrastructure for inter-core communications. 12

  13. FAUL T TOLERANCE A system is considered as fault tolerant when facing a fault, it continues working correctly. Fault tolerance can be obtained by redundancy. 1 • Spatial Redundancy 2 • T emporal Redundany 3 • Both of them 13

  14. Spatial vs temporal redundancy SPATIAL TEM PORAL It uses the same physical components It uses different physical components It can separate identical data signals in It can separate identical data signals in time space ADVANT AGE ADVANT AGE • Fewer components. • It lacks an inherent maximum operating frequency. DISADVANT AGES DISADVANT AGES • Latency penalty. • It requires more area and components. • It has a maximum operating frequency and therefore not used in commercial processes faster • Penalty in performance 14 Source: Radiation Effects and Soft Errors in Integrated Circuits and Electronic Devices

  15. FAUL T TOLERANCE IN M UL TICORE Taking advantage of the multiplicity of cores, various redundancy techniques can be considered. 1 • T emporal redundancy 2 • Data value redundancy • Information redundancy for error 3 detection in multicore designs 4 • Redundancy in execution For evaluating any technique it is possible to do it by fault injection or by radiation test campaigns. 15

  16. Redundancy in execution The replication of state machine is used Replication copies of a process is performed. Copies follow the same sequence of execution and produce the same result if inputs are the same. It should ensure that redundant processes not diverge in the absence of failures. Divergent causes are: Nondeterministic In multi-core Asynchronous functions signals • Access to shared memory (gettimeofday) The record / replay method ensures that access to shared memory is done in the same order. 16

  17. Redundancy in execution Unreliable State Machine Error Checking Reliable system Replication and Recovery system Double Modular Triple Modular Deterministic Redundancy with Record/ Replay Redundancy with Multithreading checkpoint/ Fault Masking rollback 17

  18. Redundancy in execution • by using locks, barriers and creating Deterministic threads. multithreading • Problem: Slow down application. Double Modular • It allows error detection. Redundancy DMR Triple Modular • It allows error detection and Redundancy TMR correction by a voter. 18

  19. Redundancy in execution Mixed • Deteministic Multithreading Modelling • DMR Source: Hamid M ushtaq, Zaid Al-Ars, Koen Bertels “Fault T olerance on M ulticore Processors using Deterministic M ultithreading” F IGURE 3. E XAM PLEOF R EDUNDANCY IN E XECUTION 19

  20. OUTLINE Introduction Motivation Background WORK DONE • Freescale P2041RDB • TM R in AM P mode • Fault Injection in SM P • Radiation T ests in AM P y SM P mode • KALRAY M PPA-256 (M ulti Purpose Processing Array) • Fault Injection in AM P mode • Radiation T ests in AM P mode • Fault Injection in mixed mode • Evaluating Fault T olerance T echnique Conclusions 20

  21. FREESCALE P2041 F IGURE 4. Q OR IQ P2041 M EM ORY ARCHITECTURE Built on • Power Architectures technology M anufactured • 45nm SOI technology Based on • four e500mc cores( 32-bit superscalar processor ) Operation Frequency • up to 1.5 GHz 21

  22. TM R in AM P mode 22

  23. TM R in AM P mode FIGURE 5. F AULT I NJECTION STRATEGY IN P ROCESSOR R EGISTER

  24. TM R in AM P mode EXPERIM ENT • It was run 50000 times. • Injection of one or two SEUs per execution. FIGURE 6. F T -I NJECTION C ONS AUL EQUENCES RESULTS • 20% of injected faults have no detectable consequences (silent faults). • If one SEU is injected per execution, the error rate reaches 78% and the TM R corrects 99.99% of them. • On the other hand, if two SEUs are injected, the error rate reaches 93% while the error correction factor decreases to 85%. 24

  25. TM R in AM P mode FIGURE 7. F T -I NJECTION C ONS EQUENCESIN P ROCES OR R EGIS AUL S TERS 25

  26. FAUL T INJECTION IN SM P TABLE I. A PPLICATIONS S UM M ARY 26

  27. FAUL T INJECTION IN SM P Two test campaigns were performed on each selected application: a) Fault injection in processor registers. b) Fault injection in memory region TABLE II. F AULT - I NJECTION C AM PAIGNS 27

  28. FAUL T INJECTION IN SM P FIGURE 8. P ROPOSED S OFTWARE F AULT -I NJECTION IN M EM ORY R EGION

  29. FAUL T INJECTION IN APPLICATION RUNNING IN SM P Register MM Register TSP 84,38% 65,39% 34,19% 13,52% 1,47% 0,63% 0,16% 0,27% Silent faults Result Exceptions Timeouts errors FIGURE 9. F T -I NJECTION C ONS EQUENCESIN P ROCES OR R EGIS AUL S TERS Memory MM Memory TSP 96,59% These campaigns target only the private code memory: 59,82% The initial process stack memory, The thread’s stacks memory, and The process’ heap memory. 23,32% 14,25% 2,60% 1,49% 1,92% 0,02% Silent Result Exceptions Timeouts faults errors FIGURE 10. F AULT -I NJECTION C ONSEQUENCESIN M EM ORY R EGION 29

  30. RADIATION TES TS F IGURE 11. C ONSEQUENCESOF RADIATION TEST CAM PAIGNS • From the results, one can see that the reliability of an application depends on the software environment characteristics: • Operating system. • Multiprocessing mode used. • 30 Characteristics of application.

  31. RADIATION TES TS IN SM P M ODE FIGURE 12. E RROR CLASSIFICATION ACCORDING TO OS FAULT The obtained results revealed that errors may occur in SMP mode, even if the OS is in idle mode. 31

  32. RADIATION TES TS F IGURE 13. SEE CONSEQUENCESACCORDING TO THE SCENARIO IM PLEM ENTED . T HE CONFIDENCE INTERVALSARE SHOWN BY M EANSOF THE RED LINES . 32

  33. KALRAY M PP A-256 M anufactured • TS MC CMOS 28HP technology. • multi-banked local static memory Compute (SM EM ) of 2M B shared by the 16(PE) + Cluster 1(RM ). • 256 Processing Engine (PE) and 32 Integrates Resource Management (RM) cores. • 2 groups of quadcore. Each 128 KB I/ O cluster shared. Based on • Core VLIW 32-bit/ 64-bit architecture. Operation • 100 MHz to 600 MHz. frequency Power • 15 W to 25 W. Consumption Peaks performance • 634 GFLOPS and 316 GFLOPS for single at 600 M Hz and double-precision respectively. Clustered • 16 compute clusters (CCs) and 2 I/ O clusters per device. architecture F IGURE 14: MPPA- 256 M EM ORY ARCHITECTURE 33

  34. Fault Tolerance Approach on M PP A Implemented at application level, it uses the 2 I/ O to improve the reliability of the application. • Core 0 Initializes intercluster communications 1) • Core 0 generates a pthread per core: 2) Core 1, 2 • M aster of group of computing cluster Core 4,5,6 • Voters of the results (TM R –arbiter) • Arbiter of the final results. It logs the Core 3 results Core 7 (only of • Fault Injector. I/ O 0) 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend