EPVF: AN ENHANCED PROGRAM VULNERABILITY FACTOR METHODOLOGY FOR - - PowerPoint PPT Presentation

epvf an enhanced program vulnerability factor methodology
SMART_READER_LITE
LIVE PREVIEW

EPVF: AN ENHANCED PROGRAM VULNERABILITY FACTOR METHODOLOGY FOR - - PowerPoint PPT Presentation

EPVF: AN ENHANCED PROGRAM VULNERABILITY FACTOR METHODOLOGY FOR CROSS-LAYER RESILIENCE ANALYSIS Bo Fang , Qining Lu , Karthik Pattabiraman , Matei Ripeanu and Sudhanva Gurumurthi * The University of British Columbia, Canada


slide-1
SLIDE 1

EPVF: AN ENHANCED PROGRAM VULNERABILITY FACTOR METHODOLOGY FOR CROSS-LAYER RESILIENCE ANALYSIS

Bo Fang☨, Qining Lu ☨, Karthik Pattabiraman ☨, Matei Ripeanu ☨ and Sudhanva Gurumurthi * ☨ The University of British Columbia, Canada *Cloud Innovation Lab, IBM, USA

1

slide-2
SLIDE 2

Wh What ar are we we fa facing?

§SoC soft error trends: overall FIT rate per SoC is increasing [DATE 2014, Chandra AMD]

SoC SER FIT rate per node

1 10 100 1000 200 150 100 50

Memory SER Logic SER 2

slide-3
SLIDE 3

Wh Why So Software-ba based Fa Fault To Tolerance

§Hardware-based techniques

3

Device/Circuit Level Architectural Level Operating System Level Application Level

Impactful Errors Hardware Faults

Software-based techniques: more cost-effective

slide-4
SLIDE 4

Mi Mitigating Si Silent Da Data Co Corruptio ion (SDC) C): Ke Key to to Er Error Re Resilience

4

Normal execution

Fault SDC Crash Hang Benign Error Incorrect

  • utput
slide-5
SLIDE 5

Er Error Resilience Es Estimation: Ac Accuracy vs Co Cost

5

Accuracy Cost

FI

High resource consumption, low `predictive power Conservative estimation of Error Resilience

AVF/ PVF

[HPCA2010,MICRO2003]

Goal

slide-6
SLIDE 6

Id Identifying SDC-ca causing Bits

§ AVF/PVF: Identify Architecturally Correct Execution (ACE) Bits [MICRO03, HPCA10]

6

Total bits for execution ACE bits

e(nhanced)PVF: a methodology that distinguishes crash-causing bits from ACE bits

SDC- causing bits Crash- causing bits

slide-7
SLIDE 7

PV PVF An Analysis [Sr

Sridharan, , HP HPCA10’]

§ ACE Bits= ∑ 𝐶𝑗𝑢𝑡 𝑗𝑜 𝑆𝑗

* +,-

§ Total Bits = ∑ 𝐶𝑗𝑢𝑡 𝑗𝑜 𝑆𝑗

. +,-

§ PVF =

/01 2+34 56378 2+34 = 88.9%

7

R1 = LD R2 R4 = ADD R1, R3 R5 = ADD R6*4, R7 ST R4, R5 R8 = LD R2

ADDR1 R2 R1 R3 R4 ADDR2 R5 R6 R7 R8 LD LD ADD ADD ST ADD ADD

slide-8
SLIDE 8

Ou Our Approach: eP ePVF

§ Source of crashes

§ Segmentation faults (99% of crashes are due to segfaults)

§ Direct crash-causing bits

§ Crash model

§ Indirect crash-causing bits

§ Propagation model

8

ADDR1 R2 R1 R3 R4 ADDR2 R5 R6 R7 R8 LD LD ADD ADD ST ADD ADD

Source of crashes Segfaults Others

slide-9
SLIDE 9

Ov Overall methodology

PVF- Identify ACE bits Obtaining Program Trace Crash Model Propagation Model Identify bits that cause a program to make an invalid memory access and crash Identify bits on the backward slice of bits that directly cause crashes

9

slide-10
SLIDE 10

Cr Crash model

§ Determining the bits that cause an out-of-bound memory access § Applied on every memory instruction R2 ∈ [addr_min, addr_max]

01110001010010…

R2

OS Info

PVF- Identify ACE bits Obtaining Program Trace Crash Model Propagation Model

R1 = LD R2 R4 = ADD R1, R3 R5 = ADD R6*4, R7 ST R4, R5 R8 = LD R2 R1 = LD R2

vma_start vma_end ESP

10

slide-11
SLIDE 11

Pr Propagation model

§ Identifying all possible bits that can affect the bits identified by the crash model

Crash model

min(R5),max(R5) max(R6) = (max(R5) – R7)/4 min(R6) = (min(R5) – R7)/4 max(R7) = max(R5) – R6*4 min(R7) = min(R5) – R6*4

11

PVF- Identify ACE bits Obtaining Program Trace Crash Model Propagation Model

R1 = LD R2 R4 = ADD R1, R3 R5 = ADD R6*4, R7 ST R4, R5 R8 = LD R2 R5 = ADD R6*4 + R7 ST R4, R5

slide-12
SLIDE 12

Ov Overall eP ePVF me methodology

PVF- Identify ACE bits Obtaining Program Trace Crash Model Propagation Model ePVF Bits that potentially lead to SDCs

12

slide-13
SLIDE 13

Ex Experimental setup

§ Scientific benchmarks

§ 8 from Rodinia [IISWC 09] § Matrix Multiplication § LULESH: DOE proxy app [IPDPS 2013]

§ Fault Model § LLFI [DSN 14]

§ 3,000 runs per benchmark

13

slide-14
SLIDE 14

Ev Evaluation

§ RQ1: Accuracy of the models § RQ2: Effectiveness of the ePVF methodology § RQ3: Performance

14

Total bits for execution ACE bits SDC- causing bits Crash- causing bits

slide-15
SLIDE 15

RQ RQ1: Accuracy of the models

§ Recall § Precision

50% 60% 70% 80% 90% 100%

Recall of the Model 50% 60% 70% 80% 90% 100%

Precision of the Model

Our models achieve average 89% recall and 92% precision

15

FI experiments Crash trials Pick the flipped bit for a crash trail Check that bit for the model Randomly pick a bit from the models Flip the exact bit during the execution Check if a crash occurs

50% 60% 70% 80% 90% 100%

Recall of the Model FI experiments Crash trials Pick the flipped bit for a crash trail Check that bit for the model

slide-16
SLIDE 16

RQ

  • RQ1. Accuracy of the Models

16

On average, 90% of the time the ePVF methodology is accurate to identify crash-causing bits

Total bits for execution ACE bits SDC- causing bits Crash- causing bits

slide-17
SLIDE 17

RQ RQ2: Effectiveness of the eP ePVF

§ SDC estimate using PVF analysis, ePVF analysis and Fault Injection

0% 20% 40% 60% 80% 100% PVF value ePVF value SDC rate from FI

ePVF significantly tightens the upper bound of estimated SDCs by 61% on average

17

slide-18
SLIDE 18

eP ePVF-in informed Duplic licatio ion

§ Rank instructions based on their ePVF value § ePVF value per instruction =

/01 :+34 ;0<74=;>?74+@A :+34 /01 :+34

§ Higher the ePVF value, Higher chance to lead to SDCs § Duplication highly-ranked ePVF instructions § 30% more SDC coverage than hot-path duplication for the same performance overhead

18

slide-19
SLIDE 19

RQ RQ3: Performance

§ Modeling time ranges from 30s (lavaMD) to ~ 4 hours (pathfinder).

§ Depending on the size of the DDG, hence the number of dynamic instructions

§ Optimization (Sampling and Extrapolation)

§ Intuition – scientific applications usually have repetitive behaviors.

0% 15% 30% 45% predicted ePVF computed ePVF

Extrapolated ePVF values based on 10% of the graph, and showing less than 1% difference on average

19

slide-20
SLIDE 20

Co Conclu lusio ion

§ ePVF removes the crash-causing bits from PVF to get a more accurate estimate of SDC rate.

§ A crash model that predicts direct crash-causing bits § A propagation model that identifies bit that lead to direct crash-causing bits § Implementation with LLVM compiler § Drive selective protection of SDC-causing instructions Email: bof@ece.ubc.ca Code: https://github.com/flyree/enhancedPVF

20