Containment Domains Resilience Mechanisms and Tools Toward Exascale - - PowerPoint PPT Presentation

containment domains resilience
SMART_READER_LITE
LIVE PREVIEW

Containment Domains Resilience Mechanisms and Tools Toward Exascale - - PowerPoint PPT Presentation

Containment Domains Resilience Mechanisms and Tools Toward Exascale Resilience Mattan Erez The University of Texas at Austin 2 (c) Mattan Erez Yes, resilience is an exascale concern Checkpoint-restart not good enough on its own


slide-1
SLIDE 1

Mattan Erez The University of Texas at Austin

Containment Domains Resilience

Mechanisms and Tools Toward Exascale Resilience

slide-2
SLIDE 2
  • Yes, resilience is an exascale concern

– Checkpoint-restart not good enough on its own – Commercial datacenters face different problems – Heterogeneity keeps growing – Correctness also at risk (integrity)

2 (c) Mattan Erez

slide-3
SLIDE 3
  • Containment Domains (CDs)

– Isolate application resilience from system – Increase performance and efficiency – Simplify defensive (resilient) codes – Adapt hardware and software

  • Portable Performant Resilient Proportional

3 (c) Mattan Erez

slide-4
SLIDE 4
  • Efficient resilience is an exascale problem

4 (c) Mattan Erez

slide-5
SLIDE 5
  • Failure rate possibly too high for checkpoint/restart
  • Correctness also at risk

5 (c) Mattan Erez

0% 20% 40% 60% 80% 100% Performance Efficiency CDs, NT h-CPR, 80% g-CPR, 80%

slide-6
SLIDE 6
  • Energy also problematic

6 (c) Mattan Erez

0% 10% 20% 2.5PF 10PF 40PF 160PF640PF 1.2EF 2.5EF Energy Overhead CDs, NT h-CPR, 80% gCPR, 80%

slide-7
SLIDE 7
  • Something bad every ~minute at exascale
  • Something bad every year commercially

– Smaller units of execution – Different requirements – Different ramifications

7 (c) Mattan Erez

slide-8
SLIDE 8
  • Rapid adoption of new technology and accelerators

– Again, potential mismatch with commercial setting

8 (c) Mattan Erez

slide-9
SLIDE 9
  • So who’s responsible for resilience?
  • Hardware?
  • Software?
  • Algorithm?

9 (c) Mattan Erez

slide-10
SLIDE 10
  • Can hardware alone solve the problem?
  • Yes, but costly

– Significant and fixed/hidden overheads – Different tradeoffs in commercial settings

10 (c) Mattan Erez

slide-11
SLIDE 11
  • Fixed overhead examples (estimated)

Both energy and/or throughput

– Up to ~25% chipkill correct vs. chipkill detect – 20 – 40% for pipeline SDC reduction – >2X for arbitrary correction – Even greater overhead if approximate units allowed

11 (c) Mattan Erez

slide-12
SLIDE 12
  • Relaxed reliability and precision

– Some lunacy (rare easy-to-detect errors + parallelism) – Lunatic fringe: bounded imprecision – Lunacy: live with real unpredictable errors

40

20

15 12 8 2

5 8

12 18

10

20

30

40 50

Today Scaled Researchy Some lunacy Lunatic fringe Lunacy Arith. Headroom

Rough estimated numbers for illustration purposes (c) Mattan Erez 12

slide-13
SLIDE 13
  • Can software do it alone?

– Detection likely very costly – Recovery effectiveness depends on error/failure frequency – Tradeoffs more limited

13 (c) Mattan Erez

slide-14
SLIDE 14
  • Locality and hierarchy are key

– Hierarchical constructs – Distributed operation

  • Algorithm is key:

– Correctness is a range

14 (c) Mattan Erez

slide-15
SLIDE 15
  • Containment Domains

elevate resilience to first-class abstraction

– Program-structure abstractions – Composable resilient program components – Regimented development flow – Supporting tools and mechanisms – Ideally combined with adaptive hardware reliability

  • Portable Performant Resilient Proportional

15 (c) Mattan Erez

slide-16
SLIDE 16

16 (c) Mattan Erez

slide-17
SLIDE 17

17 (c) Mattan Erez

slide-18
SLIDE 18
  • CDs help bridge the gap

– Help us figure out exactly how – Open source: lph.ece.utexas.edu/public/CDs bitbucket.org/cdresilience/cdruntime

18 (c) Mattan Erez

slide-19
SLIDE 19

CDs Embed Resilience within Application

  • Express resilience as a tree of CDs

– Match CD, task, and machine hierarchies – Escalation for differentiated error handling

  • Semantics

– Erroneous data never communicated – Each CD provides recovery mechanism

  • Components of a CD

– Preserve data on domain start – Compute (domain body) – Detect faults before domain commits – Recover from detected errors

19 (c) Mattan Erez

Root CD Child CD

slide-20
SLIDE 20

Mapping example: SpMV

20 (c) Mattan Erez

𝑵

Matrix M

𝑾

Vector V

void task<inner> SpMV(in M, in Vi, out Ri) { cd = GetC etCur urren rentC tCD() ()

  • >Crea

reate teAnd AndBeg Begin(); (); cd->Prese eserv rve(m e(mat atrix rix, , si size, ze, kCop Copy); ); forall(…) reduce(…) SpMV(M[…],Vi[…],Ri[…]); cd->Complet Complete(); e(); } void task<leaf> SpMV(…) { cd = GetC tCur urrentC rentCD() ()

  • >Crea

reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof

  • f(V

(Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); }

slide-21
SLIDE 21

Mapping example: SpMV

21 (c) Mattan Erez

𝑵𝟏𝟏 𝑵𝟏𝟐 𝑵𝟐𝟏 𝑵𝟐𝟐

Matrix M

𝑾𝟏

Vector V

𝑾𝟐

void task<inner> SpMV(in M, in Vi, out Ri) { cd = GetC etCur urren rentC tCD() ()

  • >Crea

reate teAnd AndBeg Begin(); (); cd->Prese eserv rve(m e(mat atrix rix, , si size, ze, kCop Copy); ); forall(…) reduce(…) SpMV(M[…],Vi[…],Ri[…]); cd->Complet Complete(); e(); } void task<leaf> SpMV(…) { cd = GetCu tCurre rrentC ntCD() ()

  • >Crea

reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof

  • f(V

(Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); }

slide-22
SLIDE 22

Mapping example: SpMV

22 (c) Mattan Erez

𝑵𝟏𝟏 𝑵𝟏𝟐 𝑵𝟐𝟏 𝑵𝟐𝟐

𝑾𝟏 𝑾𝟐 𝑾𝟏 𝑾𝟐

𝑵𝟏𝟏 𝑵𝟏𝟐 𝑵𝟐𝟏 𝑵𝟐𝟐

Matrix M

𝑾𝟏

Vector V

𝑾𝟐

Distributed to 4 nodes

void task<leaf> SpMV(…) { cd = GetC tCur urrentC rentCD() ()

  • >Crea

reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof

  • f(V

(Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); }

slide-23
SLIDE 23

Mapping example: SpMV

23 (c) Mattan Erez

𝑵𝟏𝟏 𝑵𝟏𝟐 𝑵𝟐𝟏 𝑵𝟐𝟐

Matrix M

𝑾𝟏

Vector V

𝑾𝟐

Distributed to 4 nodes

void task<leaf> SpMV(…) { cd = GetC tCur urrentC rentCD() ()

  • >Crea

reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof

  • f(V

(Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); }

slide-24
SLIDE 24

Concise abstraction for complex behavior

24 (c) Mattan Erez

Local copy or regen Sibling Parent (unchanged)

void task<leaf> SpMV(…) { cd = GetC tCur urrentC rentCD() ()

  • >Crea

reate teAnd AndBeg Begin(); (); cd->Pr Prese eserve rve(M, M, si sizeo zeof(M) (M), , kRe kRef); ); cd->Prese eserve rve(Vi, Vi, size izeof

  • f(V

(Vi), i), kCop Copy); ); for r=0..N for c=rowS[r]..rowS[r+1] resi[r]+=data[c]*Vi[cIdx[c]]; cd->CDAs DAsse sert rt(id idx > > pre prevId vIdx, , kSoft kSoft); ); prevC=c; cd->Compl mplete ete(); ); }

slide-25
SLIDE 25
  • General abstractions – a “language” for resilience

25 (c) Mattan Erez

Local copy or regen Sibling Parent (unchanged)

B C

A

B C

A

B C

A

=

?

B C

A

B C

A

=

? B C

A

Replicate in space or time or none?

B C

A

slide-26
SLIDE 26
  • CDs natural fit for:

– Hierarchical SPMD – Task-based systems

  • CDs still general:

– Opportunistic approaches to add hierarchical resilience – Always fall back to more checkpoint-like mappings

26 (c) Mattan Erez

slide-27
SLIDE 27
  • Reminder of why you care

27 (c) Mattan Erez

slide-28
SLIDE 28
  • CDs enable per-experiment/system “optimality”

– (Portable) Use same resilience abstractions across programming models and implementations

  • MPI ULFM? MPI-Reinit? OpenMP? UPC++? Legion?

– Don’t keep rethinking correctness and recovery

  • CPU, GPU, FPGA accelerator, memory accelerator, … ?

– (Performant) Resilient patterns that scale

  • Hierarchical / local
  • Aware of application semantics
  • Auto-tuned efficiency/reliability tradeoffs

– (Resilient) Defensive coding

  • Algorithms, implementations, and systems
  • Reasonable default schemes
  • Programmer customization

– (Proportional) Adapt hardware and software redundancy

28 (c) Mattan Erez

slide-29
SLIDE 29

– Annotations, persistence, reporting, recovery, tools

29 (c) Mattan Erez

CD-annotated Applications/Libraries CD Runtime System Hardware Compiler Support

Debugger

CD-App Mapper

User Interaction for customized error detection /handling / tolerance / injection

Auto-tuner Interface Profiling & Visualizatio n Interface

CD Auto Tuner Sight

Scaling Tool (LWM2)

Error Handling

Unified Runtime Error Detector

Persistence Layer

State Preservation

Communication Logging

Runtime Logging

Communication Runtime Library (Legion + GasNet)

Legion + Libc BLCR CD-Storage Mapping Interface

Low-Level Machine Check HW/SW I/F Error Reporting Architecture

PFS Buddy DRAM SSD HDD

External Tool Internal Tool Future Plan

CD Runtime System Architecture

slide-30
SLIDE 30
  • CD usage flow

– Annotate – Profile and extrapolate CD tree – Supply machine characteristics – Analyze and auto-tune

  • Flexible preservation, detection, and recovery

– Refine tradeoffs and repeat – Execute and monitor

  • CD management and coordination
  • Distributed and hierarchical preservation
  • Distributed and hierarchical recovery

30 (c) Mattan Erez

slide-31
SLIDE 31
  • CD annotations express intent

– CD hierarchy for scoping and consistency – Preservation directives and hints exploit locality – Correctness abstractions

  • Detectors and tolerances

– Recovery customization – Debug/test interface

  • Work in progress: http://lph.ece.utexas.edu/users/CDAPI

31 (c) Mattan Erez

slide-32
SLIDE 32
  • State preservation and restoration API
  • – Hierarchical
  • Per CD (level)
  • Match storage hierarchy
  • Maximize locality and minimize overhead

– Proportional

  • Preserve only when worth it (skip preserve calls)
  • Exploit inherent redundancy
  • Utilize regeneration

32 (c) Mattan Erez

slide-33
SLIDE 33

33 (c) Mattan Erez

Local copy or regen Sibling Parent (unchanged)

slide-34
SLIDE 34

LULESH CD mapping example

34 (c) Mattan Erez

slide-35
SLIDE 35

Autotuned CDs perform well

35 (c) Mattan Erez

Peak System Performance

NT SpMV HPCCG

0% 20% 40% 60% 80% 100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Performance Efficiency CDs, NT h-CPR, 80% g-CPR, 80%

0% 20% 40% 60% 80% 100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Performance Efficiency CDs, SpMV h-CPR, 50% g-CPR, 50%

0% 20% 40% 60% 80% 100%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Performance Efficiency

CDs, HPCCG h-CPR, 10% g-CPR, 10%

slide-36
SLIDE 36

0% 10% 20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Energy Overhead

CDs, NT h-CPR, 80% gCPR, 80%

CDs improve energy efficiency at scale

36 (c) Mattan Erez

Peak System Performance 0% 10% 20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Energy Overhead

CDs, SpMV h-CPR, 50% g-CPR, 50%

0% 10% 20%

2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF

Energy Overhead

CDs, HPCCG h-CPR, 10% g-CPR, 10%

NT SpMV HPCCG

slide-37
SLIDE 37

10X failure rate emphasizes CD benefits

37 (c) Mattan Erez

Peak Performance

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF Energy Overhead Performance Efficiency CDs, NT h-CPR, 80% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF Energy Overhead Performance Efficiency CDs, SpMV h-CPR, 50% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 2.5PF 10PF 40PF 160PF 640PF 1.2EF 2.5EF Energy Overhead Performance Efficiency CDs, HPCCG h-CPR, 10% Energy Overhead

slide-38
SLIDE 38
  • Can be implicit with right programming model

– For example, Legion

38 (c) Mattan Erez

slide-39
SLIDE 39
  • Use Legion copies for

CD preservation

  • Optimize for

efficiency

– When to add copies – Where to put copies to survive failures – When to free copies

  • Account for different

failure modes and rates

39 (c) Mattan Erez

slide-40
SLIDE 40

40 (c) Mattan Erez

slide-41
SLIDE 41
  • Portable correctness

– Resilience perspective

41 (c) Mattan Erez

slide-42
SLIDE 42
  • Correctness abstractions

– Detectors – Requirements – Recovery

42 (c) Mattan Erez

slide-43
SLIDE 43
  • What can go wrong?

– Application crash – Process crash – Process unresponsive – Failed communication – Hardware

  • Cache error
  • Memory error
  • TLB error
  • Node offline

43 (c) Mattan Erez

slide-44
SLIDE 44
  • What can go wrong?

– Lost resource – Wrong value

  • Specific address?
  • Specific access?
  • Specific computation?

– Degraded resource

  • Who detects?

How reported?

44 (c) Mattan Erez

slide-45
SLIDE 45
  • System-provided detectors

  • Control response granularity
  • User-specified detectors

  • Consistent and unified reporting & analysis

45 (c) Mattan Erez

slide-46
SLIDE 46

Express correctness intent

  • Notifies auto-tuner of detection capability
  • Enables error elision

  • Auto- add redundancy to meet requested level of reliability

  • Customize action

46 (c) Mattan Erez

slide-47
SLIDE 47
  • Bounded Approximate Duplication

+ − × √x2

(c) Mattan Erez 47

slide-48
SLIDE 48
  • Bounded Approximate Duplication

+ − × √x2

(c) Mattan Erez 48

slide-49
SLIDE 49
  • Analysis/decision-support/tuning

49 (c) Mattan Erez

slide-50
SLIDE 50
  • Example: integrity tools

– Selective injection by CD and error type – Integrate with CD-level detectors

  • Only inject “SDCs”

– “Fuzzing” tools for completeness – Analytical modeling for tradeoffs

  • Energy, memory, performance

Injector Error Model Error Mgmt Framework

Iterate till Unmasked

Detector Model

CD

(c) Mattan Erez 50

slide-51
SLIDE 51
  • Straigtforward bottom-up analysis

– Analytical solutions for simple trees – More computation for complex graphs

51 (c) Mattan Erez

Compute Preserve Detect Re-execution overhead

Time

slide-52
SLIDE 52

52 (c) Mattan Erez

slide-53
SLIDE 53
  • Example: what-if reliability/resilience tradeoffs
  • Should all memory be heavily ECC protected?

– Much cheaper to recognize anomalies in some apps – Much cheaper to do detection only – …

  • Mechanisms for adapting ECC scheme are known

– Though not implemented in any product

53 (c) Mattan Erez

slide-54
SLIDE 54
  • Another application example

54 (c) Mattan Erez

slide-55
SLIDE 55
  • TOORSES fault-tolerant hierarchical solver

– Brian Austin, Eric Roman, and Xiaoye Li (LBNL) – Hierarchical semi-separable representation

55 (c) Mattan Erez

slide-56
SLIDE 56
  • Add CDs at different granularities

– Hierarchical and partial preservation

  • Add algorithmic and cheap detection
  • Compare to:

– Algorithmic recovery with redundant computation

56 (c) Mattan Erez

0% 20% 40% 60% 80% 100% 1E-3 3E-3 1E-2 3E-2 1E-1 3E-1 1E+0 Performance effiency Error injection rate (#/s) Coarse Medium Fine Encoded

slide-57
SLIDE 57
  • Bottom line – expected benefits significant:

– Isolate application correctness from system – Use same resilience abstractions across programming models – Enable efficient resilience patterns that scale

  • Reasonable default schemes
  • Programmer customization

– Auto-tune efficiency/reliability tradeoffs – Adapt to system and experiment dynamics

  • All open source

– lph.ece.utexas.edu/public/CDs and bitbucket.org/cdresilience/cdruntime – lph.ece.utexas.edu/users/hamartia and bitbucket.org/lph_tools/hamartia_suite

57 (c) Mattan Erez

slide-58
SLIDE 58
  • Status

– Mostly-sequential functional CD runtime released – MPI implementation mostly done (some merging left)

  • Bitbucket.org/cdresilience/cdruntime

– cdCUDA prototype – Abstractions, for now, seem sufficient

  • But, not enough users yet

– Rudimentary implementation only

  • Lots of opportunities for improvement

– Storage (object stores across hierarchy?) – OS/R (better isolation mechanisms, automation) – Communication (reduce recovery overhead)

– Other programming models coming along

  • UPC++ prototype in progress
  • Legion integration

58 (c) Mattan Erez

slide-59
SLIDE 59
  • Obviously, I’m just the figure head

– Former UT students:

  • Jinsuk Chung, Ikhwan Lee, Minsoo Rhu, Michael Sullivan,

Doe Hyun Yoon

– Current UT students:

  • Chun-Kai Chang, Seong-Lyong Gong, Chanyong Hu, Tommy Huynh

Dong Wan Kim, Yongkee Kwon, Kyushick Lee, Sangkug Lym, Song Zhang

– Collaborators:

  • LBNL: Brian Austin, Dan Bonachea, Paul Hargrove, Eric Roman
  • Cray: Larry Kaplan and team
  • NVIDIA: Siva Hari, Tim Tsai

– Funding (overall, not just CDs)

  • DOE ASCR: ECRP, FastForward, X-Stack, Resilience, PSAAP II
  • DARPA: UHCP
  • NSF: CAREER
  • DOD: Fellowship
  • TACC and NERSC compute facilities

59 (c) Mattan Erez

slide-60
SLIDE 60
  • Containment Domains

– Abstract resilience constructs that span system layers – Hierarchical and Distributed operation for locality – Scalable to large systems with high energy efficiency – Heterogeneous to match disparate error/failure effects – Proportional and effectively balanced – Tunable resilience specialized to application/system – Analyzable and auto-tuned – Open source: lph.ece.utexas.edu/public/CDs bitbucket.org/cdresilience/cdruntime

  • Portable Performant Resilient Proprtional

60 (c) Mattan Erez

slide-61
SLIDE 61
  • Backup

61 (c) Mattan Erez

slide-62
SLIDE 62
  • An aside on error modeling and injection

62 (c) Mattan Erez

slide-63
SLIDE 63
  • High-fidelity modeling in Veracity

63 (c) Mattan Erez

slide-64
SLIDE 64
  • Multiple, detailed, low-level fault models to explore faults

at circuit and microarchitectural levels

– Focused on particle strikes and voltage droop – Circuit-level simulation and analysis is performed both statically (pre-characterization) and dynamically (runtime)

  • Low-level models combined with hierarchical injection and

simulation frameworks

– Simulate the effect of faults at points in an application throughout the entire microarchitecture – Multiple levels of abstraction, providing speedup at higher levels

  • Finally, simulation results will enable production of high-

level models

– Characterize error patterns, fault rates, and portions of the microarchitecture vulnerable – Models will provide insight into software resiliency

(c) Mattan Erez 64

slide-65
SLIDE 65

Accurate Low-level Injection and Modeling

  • Modeling OpenSPARC T1 microprocessor

– Free, open source – Existing FPGA version of project – High-performance, multithreaded

  • Model low-level fault propagation

– Full fidelity with fast hierarchical simulation

  • Circuit-level
  • RTL / microarchitectural-level
  • ISA-level

– Inject errors at the circuit-level – Data and control fault injection

(c) Mattan Erez 65

slide-66
SLIDE 66

Fast Hierarchical Low-level Inject / Simulation

  • Transition between levels

– Circuit, RTL, ISA levels – Speed / granularity benefits – Maintain fidelity across levels

  • Novel, accurate RTL → ISA

switching algorithm1

– Maintains full fidelity – Detect if can terminate early

  • Framework enables:

– Fast, accurate analysis of fault propagation – Detect masked / unmasked faults – Fault injection in any logic (data, control) – Improvement over timeout-based detection

[1] Y. Yuan, “Exploring Hierarchical Fault Injection Simulation for Evaluating the System-level Impact of Single-Event Upsets”, The University of Texas at Austin, 2015. ISA (e.g., Simics, QEMU) RTL (e.g., VCS) Circuit - inject error (e.g., SPICE) Transition only when can maintain fidelity Start simulation Ready to inject error Slow Circuit-level granularity Transition after 1 cycle to maintain fidelity Moderate speed Microarchitectural granularity Fast SW-visible granularity Early termination

Simulation hierarchy levels (c) Mattan Erez 66

slide-67
SLIDE 67

Hierarchical Simulation Advancements

  • Replace RTL simulation with FPGA

emulation

– Fast, natural target for RTL – OpenSPARC includes FPGA support – Tool NIFD (created by our group)

  • Reads/writes FPGA register state
  • Used to inject errors into FPGA
  • No FPGA recompilation for different

tests

  • Support multi-fault / multi-cycle

errors

– Already supports particle strike fault injection – Can mimic voltage droop – Greater variety of fault scenarios

Simulation hierarchy levels

ISA (e.g., Simics, QEMU) FPGA Circuit - inject error (e.g., SPICE) Transition only when can maintain fidelity Start simulation Slow Circuit-level granularity Ready to inject error(s) Transition after 1 cycle to maintain fidelity Fast Microarchitectural granularity Fast SW-visible granularity Program completion Halt FPGA Read state Write state Resume FPGA RTL

(c) Mattan Erez 67

slide-68
SLIDE 68

High-Fidelity Hardware Faults Modeling

  • On-demand transistor-accurate fault injection with

workload-specific distributional properties

  • Use model for fault injector (FI):
  • Higher level tool injects input vector to FI
  • Returns an error-output for inputs

with non-zero probability of error

  • Initial fault model
  • Voltage droop
  • Possible methodologies

68

Two-phase with pre- characterization Runtime simulation Strategy (1) Error profiling, (2) Look-up Full run-time error evaluation Simulation speed Fast Slow Memory usage *Potentially High Low

*High: O(n) where n is the number of critical paths

Fault Injector input pair error-output, or null

(c) Mattan Erez

slide-69
SLIDE 69

Fault Injection Methodology

  • Phase 1: Pre-characterization
  • Builds a model of possible error-outputs and their probabilities for every

possible outputs

  • Model outputs

1. Error-outputs and corresponding input pairs 2. Probability producing any error-outputs 3. Conditional probability of error-outputs

  • Phase 2: Runtime error-outputs generation
  • For an input with a non-zero probability of producing an error, generates
  • ne out of possible set of error-outputs
  • With probability relative to total

69 (c) Mattan Erez

slide-70
SLIDE 70

Pre-characterization

Synthesize RTL STA Vnominal lib. Vmin lib. STA Vnomlib. Gate-level netlist (.v) Slack Paths with S < 0 Slack Critical V* for N paths πi using interpolation, Output error O(πi) Identify input patterns that generate each path using ATPG

πi : IiA x IiB Vnom S Pi Pj Vmin V*

I1A X I1B π1,1:V*(π1,1):O(π1,1) π1,2:V*(π1,2):O(π1,2) π1,3:V*(π1,3):O(π1,

3)

I2A X I2B π2,1:V*(π2,1):O(π2,1) IiA X IiB πi,1:V*(πi,1):O(πi,1)

Error profile database .edb

(c) Mattan Erez 70

slide-71
SLIDE 71
  • Medium-fidelity fast error modeling and injection in

AEDAM

71 (c) Mattan Erez

slide-72
SLIDE 72

Instruction-level Injection

  • Pin-based Fault Injection

– Generic Error/Detector API for reusable models across injectors and languages – Instruction-driven injection, lower fidelity

  • Hierarchical fault model execution
  • Selective injection

– Filters based upon instruction & code region

  • Comprehensive coverage

– Open-source tools allows for distributed and large-scale error simulations

  • Allows for high-level model generation

– Fast, application injection – Monte-Carlo injection methodology Pin Injector Error Model Error Managemen t Framework

Iterate till Unmasked

Detector Model

Program

(c) Mattan Erez 72

slide-73
SLIDE 73

Gate-Level Injection

  • Injection into synthesized, gate-level

netlists

– Models transient and stuck-at faults – Simulates faults at a unit-level (e.g. ALU) – Filter eligible fault locations by gate/latch

  • Verilog simulation

– Iteratively find unmasked error patterns – Optional detector support for modeling fault detection hardware

  • Connects with Pin Injector for hierarchical

simulation and injection

Higher-level Injector (e.g. Pin) Gate-level Injector Error Mgmt Framework

Iterate till Unmasked

(c) Mattan Erez 73

slide-74
SLIDE 74
  • Modeling DRAM errors

– Aware of ECC options – Aware of memory architecture – Aware of memory fault modes

*redbox: fault injection point (c) Mattan Erez 74

slide-75
SLIDE 75

75 (c) Mattan Erez *redbox: fault injection point