XED:%EXPOSING%ON,DIE%ERROR% DETECTION%INFORMATION%FOR% - - PowerPoint PPT Presentation

xed exposing on die error detection information for
SMART_READER_LITE
LIVE PREVIEW

XED:%EXPOSING%ON,DIE%ERROR% DETECTION%INFORMATION%FOR% - - PowerPoint PPT Presentation

XED:%EXPOSING%ON,DIE%ERROR% DETECTION%INFORMATION%FOR% STRONG%MEMORY%RELIABILITY Prashant%Nair,%Georgia%Tech Vilas&Sridharan ," AMD&Inc.&&&&& Moinuddin Qureshi,&Georgia Tech ISCA%43,)June)20 th 2016


slide-1
SLIDE 1

XED:%EXPOSING%ON,DIE%ERROR% DETECTION%INFORMATION%FOR% STRONG%MEMORY%RELIABILITY

Prashant%Nair,%Georgia%Tech Vilas&Sridharan,"AMD&Inc.&&&&& Moinuddin Qureshi,&Georgia Tech

ISCA%43,)June)20th 2016 Seoul,)Republic)of)Korea

slide-2
SLIDE 2

INTRODUCTION

2

DRAM&Scaling&! High&Capacity&Memories Two&types&of&DRAM&faults Scaling&Faults

b

Aspect Ratio Aspect Ratio = H/b Aspect Ratio of Storage Node Technology Node (nm)

Source: S. J. Hong (Hynix), IEDM 2010 H

70 60 50 40 30 20 10 20 40 60 80 100

Figure 2: Exponential increase in aspect ratio of DRAM cells with scaling to smaller technology nodes (redrawn from [5])

[ArchShield ISCA’13,%CiDRA HPCA’15]

slide-3
SLIDE 3

INTRODUCTION

DRAM&Scaling&! High&Capacity&Memories Two&types&of&DRAM&faults

3

Fault& Mode Transient& Fault Rate&(FIT) Permanent& Fault Rate&(FIT) Bit 14.2 18.6 Word 1.4 0.3 Column 1.4 5.6 Row 0.2 8.2 Bank 0.8 10 *Total 18 42.7

Runtime&Faults

Sridharan et.%al.%SC13

Scaling&Faults

b

Aspect Ratio Aspect Ratio = H/b Aspect Ratio of Storage Node Technology Node (nm)

Source: S. J. Hong (Hynix), IEDM 2010 H

70 60 50 40 30 20 10 20 40 60 80 100

Figure 2: Exponential increase in aspect ratio of DRAM cells with scaling to smaller technology nodes (redrawn from [5])

[ArchShield ISCA’13,%CiDRA HPCA’15]

slide-4
SLIDE 4

ON,DIE%ECC:%MITIGATE%SCALING%FAULTS

4

DRAM&vendors&plan&to&use&“OnWDie&ECC”&

  • Mitigates&scaling&faults&transparently
  • Enables&good&DIMM&with&bad&chips (yield)
  • Part&of:&LPDDR4,&DDR4,&DDR5&(proposed)
slide-5
SLIDE 5

ON,DIE%ECC:%MITIGATE%SCALING%FAULTS

5

CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP

x8%DIMM REQUEST,A DATA

slide-6
SLIDE 6

ON,DIE%ECC:%MITIGATE%SCALING%FAULTS

6

CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP

x8%DIMM REQUEST,A DATA ECC

slide-7
SLIDE 7

ON,DIE%ECC:%MITIGATE%SCALING%FAULTS

7

64WBits 64Bits 8WBits (72,64)&ECC On(Die,ECC:,Single,Error,Correction,,Double, Error,Detection,Code,(SECDED)

slide-8
SLIDE 8

ON,DIE%ECC:%MITIGATE%SCALING%FAULTS

8

64WBits&Correct&Data On(Die,ECC,fixes,scaling,faults,invisibly

Detect Correct

slide-9
SLIDE 9

9

CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP

Runtime(faults

  • Chip&faults&common
  • Need&strong&ECC

Fault& Mode Transient& Fault Rate&(FIT) Permanent& Fault Rate&(FIT) Bit 14.2 18.6 Word 1.4 0.3 Column 1.4 5.6 Row 0.2 8.2 Bank 0.8 10 *Total 18 42.7

MITIGATING%RUNTIME%FAULTS

ECC,DIMM (9,Chips)

slide-10
SLIDE 10

10

CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP

Runtime(faults

  • Chip&faults&common
  • Need&strong&ECC

*Sridharan+*SC13 Fault& Mode Transient& Fault Rate&(FIT) Permanent& Fault Rate&(FIT) Bit 14.2 18.6 Word 1.4 0.3 Column 1.4 5.6 Row 0.2 8.2 Bank 0.8 10 *Total 18 42.7

MITIGATING%RUNTIME%FAULTS

slide-11
SLIDE 11

CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip

11

CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP

Runtime(chip(faults(! Chipkill (strong&ECC) 18%DRAM%Chips READ

MITIGATING%RUNTIME%FAULTS

slide-12
SLIDE 12

12

CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP

Runtime(chip(faults(! Chipkill (strong&ECC) Cost: 18 Chips,&Performance&and&Power&Inefficient 18%DRAM%Chips

MITIGATING%RUNTIME%FAULTS

slide-13
SLIDE 13

13

GOAL%AND%CHALLENGE

GOAL:&Use&OnWDie&ECC&to&mitigate&runtime&faults “ChipkillWlevel&reliability&using&x8&ECCWDIMM” CHALLENGE:&OnWDie&ECC&is&invisible,&expose&it& without&changing&the&memory&interface

slide-14
SLIDE 14

OUTLINE

14

  • BACKGROUND
  • XED
  • CASE&STUDIES
  • EVALUATION
  • SUMMARY
slide-15
SLIDE 15

USING%PARITY%+%FAILED%LOCATION

15

What&if&the&chip&can&inform&that&it&failed?

CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 ECC

Memory,Controller

slide-16
SLIDE 16

USING%PARITY%+%FAILED%LOCATION

16

Fix&chipWfaults&using&only&9&Chips

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 FAIL D3 D4 D5 D6 D7 PA

Memory,Controller

Parity&+&Location&! Reconstruct&Data&for&Faulty&Chip What&if&the&chip&can&inform&that&it&failed?

slide-17
SLIDE 17

17

XED:%EXPOSED%ON,DIE%ERROR%DETECTION

XED&consists&of&three&components

  • Strong&detection&in&addition&to&SEC
  • ParityWbased&correction
  • Transparently&identifying&faulty&chip
slide-18
SLIDE 18

XED:%ON,DIE%ECC%AS%DETECTION%CODE

18

OnWDie&Error&Correction&Code Data 64WBits

Detect Correct

Corrects? Detects? SingleWBit&Failures

  • Chip&Failures
slide-19
SLIDE 19

XED:%ON,DIE%ECC%AS%DETECTION%CODE

19

OnWDie&Error&Strong&Detection + Correction&Code OnWDie&ECC&can&detect&chipWfailures

Corrects? Detects? SingleWBit&Failures

  • Chip&Failures
  • (99.9%)

Data 64WBits

Detect Correct

CRC,8%ATM,code%instead%of%Hamming,code

slide-20
SLIDE 20

XED:%RAID,3%BASED%CORRECTION

20

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP OnWDie&ECC&detected&it Reconstruct&Data&in&Failed&Chip

If&we&could&expose&OnWDie&Error&Detection&! Chipkill

slide-21
SLIDE 21

EXPOSE%ON,DIE%ERROR%INFO

OPTION&1:&Use&additional&wires

21

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA

Memory,Controller

FAIL

slide-22
SLIDE 22

EXPOSE%ON,DIE%ERROR%INFO

OPTION&1:&Use&additional&wires

22

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA

Memory,Controller

Failed

Incompatible&with&DDR&memory&standards Needs&a&new&protocol Worse&for&pinWconstrained&future&systems!

slide-23
SLIDE 23

EXPOSE%ON,DIE%ERROR%INFO

OPTION&2:&Use&additional&burst/transaction

23

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA

Memory,Controller

slide-24
SLIDE 24

EXPOSE%ON,DIE%ERROR%INFO

OPTION&2:&Use&additional&burst/transaction

24

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA

Memory,Controller

OK OK FAIL OK OK OK OK OK OK

slide-25
SLIDE 25

EXPOSE%ON,DIE%ERROR%INFO

OPTION&2:&Use&additional&burst/transaction

25

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA

Memory,Controller

OK OK FAIL OK OK OK OK OK OK

Additional&12.5%&to&100%&bandwidth&overheads Performance&and&Power&Inefficient Expose&OnWDie&error&detection&with&minor&changes

slide-26
SLIDE 26

XED:%ON,DIE%ERROR%INFO%FOR%FREE

26

On&detecting&an&error,&the&DRAM&chip&sends&a&64W bit&“CatchWWord”&(CW)&instead&of&data

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP

Memory,Controller

D0 D1 CW D3 D4 D5 D6 D7 PA 64(bits

slide-27
SLIDE 27

XED:%MUX%TO%SEND%CATCH,WORDS

27

Simple&MUX&to&chose&between&Data&and&CatchWWord Data&or&CW 64WBits

Detect Correct CW

Yes

slide-28
SLIDE 28

XED:%ON,DIE%ERROR%INFO%FOR%FREE

28

On&detecting&an&error,&the&DRAM&chip&sends&a&64W bit&“CatchWWord”&(CW)&instead&of&data

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP

Memory,Controller

D0 D1 CW D3 D4 D5 D6 D7 PA

64(bit,Catch(Words identify,the,faulty,chip Chips&provisioned&with&a&unique&CatchWWord& No&additional&wires/bandwidth&overheads Compatible&with&existing&memory&protocols

slide-29
SLIDE 29

WHY%DO%CATCH,WORDS%WORK?

29

Catch&Word&(CW)&≠ Valid&Data&(D2)

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 CW D3 D4 D5 D6 D7 PA 64(bits

slide-30
SLIDE 30

WHY%DO%CATCH,WORDS%WORK?

30

Catch&Word&(CW)&≠ Valid&Data&(D2) Then&! PA&≠ D0& D1& CW& …& D7

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 CW D3 D4 D5 D6 D7 PA

Location&Identified

slide-31
SLIDE 31

WHY%DO%CATCH,WORDS%WORK?

31

Catch&Word&(CW)&≠ Valid&Data&(D2) Then&! PA&≠ D0& D1& CW& …& D7

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 CW D3 D4 D5 D6 D7 PA

D2&=&D0& D1& D3& …& PA Location&Identified

slide-32
SLIDE 32

WHY%DO%CATCH,WORDS%WORK?

32

Catch&Word&(CW)&= Valid&Data&(D2)

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 CW D3 D4 D5 D6 D7 PA 64(bits

slide-33
SLIDE 33

WHY%DO%CATCH,WORDS%WORK?

33

Catch&Word&(CW)&= Valid&Data&(D2)&[Collision] Then&! PA&= D0& D1& CW& …& D7

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 CW D3 D4 D5 D6 D7 PA

No&Error&as&Parity&Matches Catch(Word,collision:,Doesn’t,affect,correctness

slide-34
SLIDE 34

COLLISIONS:%NOT%A%PROBLEM

  • A chip&stores&64&bits/cacheWline&! 264&combinations&
  • However&even&a&16Gb&chip&has&only&228&cachelines
  • Even&if&this&entire&chip&contained&different&data&

there&are&nearly&263.99 data&combinations&free!

34

The&catchWword&will&most&likely&not&collide

slide-35
SLIDE 35

OUTLINE

35

  • BACKGROUND
  • XED
  • CASE&STUDIES
  • EVALUATION
  • SUMMARY
slide-36
SLIDE 36

XED%FOR%SCALING%ERRORS

On,Die%ECC

  • Single&Error&Correction
  • Always&detects&scaling&errors&(singleWbit)

36

slide-37
SLIDE 37

CASE%STUDY%1:%SINGLE%SCALING%FAULT

Scaling&fault&within&a&single&chip

37

Parity&reconstructs&data&from&chip&with&scaling&error

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP

Memory,Controller

D0 D1 CW D3 D4 D5 D6 D7 PA 64(bits

  • No(SDC,(No(DUE
slide-38
SLIDE 38

CASE%STUDY%2:%MULTIPLE%SCALING%FAULTS

Scaling&faults&within&multiple&chips

38

Disable&XED&+&Retry

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP

Memory,Controller

D0 D1 CW 3 D3 D4 D6 D7 PA 64(bits CW 5 64(bits

  • No(SDC,(No(DUE
slide-39
SLIDE 39

CASE%STUDY%3:%CHIP%FAULT

CatchWWord&identifies&the&faulty&chip

39

Parity&reconstructs&data&from&failed&chip

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP

Memory,Controller

D0 D1 CW 3 D3 D4 D6 D7 PA 64(bits D5 No(SDC,(No(DUE

slide-40
SLIDE 40

CASE%STUDY%4:%CHIP%+%SCALING%FAULT

Parity&detects&error&even&after&retry&! Chip&Failure

40

Disable&XED&+&Diagnosis&to&locate&chip&failure

CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP

Memory,Controller

D0 D1 CW 3 D3 D4 D6 D7 PA 64(bits CW 5 64(bits

  • Very(Small(

SDC((10>6),(DUE((10>13)

slide-41
SLIDE 41

OUTLINE

41

  • BACKGROUND
  • XED
  • CASE&STUDIES
  • EVALUATION
  • SUMMARY
slide-42
SLIDE 42

EVALUATION

42

USIMM&:&8&Cores,&4&Channels,&2&Ranks,&8&Banks FaultSim*: Memory&Reliability&Simulator

  • Real&World&Fault&Data
  • 7&year&system&lifetime,
  • Billion&MonteWCarlo&Trails
  • Metric:&Probability&of&System&Failure
  • Scaling&FaultWRate:&10W4

*,Nair,et.,al.,HiPEAC 2016

slide-43
SLIDE 43

RESULTS:%RELIABILITY

XED&vs&Commercial&ECC&schemes

43

XED,provides,strong reliability,while,using,fewer chips

10-5 10-4 10-3 10-2 10-1 1 2 3 4 5 6 7

Probability of System Failure (Log Scale) Years

SECDED: ECC-DIMM ChipKill XED

43x 4x

(9 Chips) (9 Chips) (18 Chips)

slide-44
SLIDE 44

RESULTS:%PERFORMANCE%AND%EDP

44

0.9 1 1.1 1.2 1.3 SECDED XED CHIPKILL

Normalized& Execution&Time

21%

Lower%the%better

slide-45
SLIDE 45

0.8 1 1.2 1.4 SECDED XED CHIPKILL

Lower%the%better

RESULTS:%PERFORMANCE%AND%EDP

45

0.9 1 1.1 1.2 1.3 SECDED XED CHIPKILL

Normalized& Execution&Time

21%

Normalized Memory&EnergyWDelay&Product (EDP)

34%

Lower%the%better

Execution,time:,21%,,,,,,,EDP,:,34%,,,,,,,,,,,,

slide-46
SLIDE 46

OUTLINE

46

  • BACKGROUND
  • XED
  • CASE&STUDIES
  • EVALUATION
  • SUMMARY
slide-47
SLIDE 47

SUMMARY

  • DRAM&Scaling&introduces&errors&! OnWDie&ECC
  • OnWDie&ECC&is&invisible&to&the&memory&system
  • Exposing&OnWDie&ECC:&Efficient&Runtime&ECC&
  • XED

– Exposes&OnWDie&Error&Detection&using&CatchWWords – 2X&fewer&chips&as&compared&to&Chipkill – 4X&higher&reliability&as&compared&to&Chipkill – 21%&lower&execution&time&as&compared&to&Chipkill

  • XED&! No&change&in&memory&protocols

47

slide-48
SLIDE 48

THANK%YOU

48

“You(are(in(a(pitiable(condition,(if(you(have(to(conceal(what(you(wish(to(tell” W Publilius Syrus

On,Die% ECC

slide-49
SLIDE 49

BACKUP

slide-50
SLIDE 50

RANDOM%DATA?

  • 1. Lower&randomization&! Longer&time&till&collision
  • 2. Current&systems&anyway&scramble&data&for&fidelity
  • What&if&only&half&the&data&is&random
  • 1. Then&average&time&for&collision&increases&by&2x

(&3.2&Million&Years&! 6.4&Million&Years) 2.&&Less&random&data&increases&collision&time

  • DIMMs&today&store&scrambled&(randomized)&data
  • 1. To&equalize&the&number&of&1’s&and&0’s
  • 2. Reduce&Bit&Error&Rate&on&the&bus&
  • 3. Scrambling&using&address&based&hash
slide-51
SLIDE 51

MTTF:%XED%VS%CHIPKILL

X

Chipkill (18Wchips) 2WChip&Failures

X

FAILED

X X

XED&(9Wchips) FAILED

slide-52
SLIDE 52

MTTF:%XED%VS%CHIPKILL

X

Chipkill 2WChip&Failures

X

FAILED XED&(9Wchips) FAILED

X X

slide-53
SLIDE 53

MTTF:%XED%VS%CHIPKILL

X

Chipkill 2WChip&Failures

X

PASSED XED&(9Wchips) FAILED

X X

! Extend&to&MultiWChip&Failures

slide-54
SLIDE 54

SDC%AND DUE

Table IV SDC AND DUE RATE OF XED

Source of Vulnerability Rate over 7 years XED: Scaling-Related Faults No SDC or DUE XED: Row/ Column/ Bank Failure 1.4×10−13 (SDC) XED: Word Failure 6.1×10−6 (DUE) Data Loss from Multi-Chip Failures 5.8×10−4

slide-55
SLIDE 55

ADDITIONAL%BURST/TRANSACTION

Memory Power Execution Time

Chipkill Chipkill Chipkill Double Chipkill Normalized Value

Expose On−Die ECC using Additional Transaction Expose On−Die ECC using an Extra Burst

0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 Double

slide-56
SLIDE 56

XED%VS%LOT,ECC

1.06 1.08 Normalized Execution Time LOTECC (with Write−Coalescing)

SPEC PARSEC BIOBENCH COMM GMEAN

XED 1.00 1.02 1.04