XED:%EXPOSING%ON,DIE%ERROR% DETECTION%INFORMATION%FOR% - - PowerPoint PPT Presentation
XED:%EXPOSING%ON,DIE%ERROR% DETECTION%INFORMATION%FOR% - - PowerPoint PPT Presentation
XED:%EXPOSING%ON,DIE%ERROR% DETECTION%INFORMATION%FOR% STRONG%MEMORY%RELIABILITY Prashant%Nair,%Georgia%Tech Vilas&Sridharan ," AMD&Inc.&&&&& Moinuddin Qureshi,&Georgia Tech ISCA%43,)June)20 th 2016
INTRODUCTION
2
DRAM&Scaling&! High&Capacity&Memories Two&types&of&DRAM&faults Scaling&Faults
b
Aspect Ratio Aspect Ratio = H/b Aspect Ratio of Storage Node Technology Node (nm)
Source: S. J. Hong (Hynix), IEDM 2010 H
70 60 50 40 30 20 10 20 40 60 80 100
Figure 2: Exponential increase in aspect ratio of DRAM cells with scaling to smaller technology nodes (redrawn from [5])
[ArchShield ISCA’13,%CiDRA HPCA’15]
INTRODUCTION
DRAM&Scaling&! High&Capacity&Memories Two&types&of&DRAM&faults
3
Fault& Mode Transient& Fault Rate&(FIT) Permanent& Fault Rate&(FIT) Bit 14.2 18.6 Word 1.4 0.3 Column 1.4 5.6 Row 0.2 8.2 Bank 0.8 10 *Total 18 42.7
Runtime&Faults
Sridharan et.%al.%SC13
Scaling&Faults
b
Aspect Ratio Aspect Ratio = H/b Aspect Ratio of Storage Node Technology Node (nm)
Source: S. J. Hong (Hynix), IEDM 2010 H
70 60 50 40 30 20 10 20 40 60 80 100
Figure 2: Exponential increase in aspect ratio of DRAM cells with scaling to smaller technology nodes (redrawn from [5])
[ArchShield ISCA’13,%CiDRA HPCA’15]
ON,DIE%ECC:%MITIGATE%SCALING%FAULTS
4
DRAM&vendors&plan&to&use&“OnWDie&ECC”&
- Mitigates&scaling&faults&transparently
- Enables&good&DIMM&with&bad&chips (yield)
- Part&of:&LPDDR4,&DDR4,&DDR5&(proposed)
ON,DIE%ECC:%MITIGATE%SCALING%FAULTS
5
CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP
x8%DIMM REQUEST,A DATA
ON,DIE%ECC:%MITIGATE%SCALING%FAULTS
6
CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP
x8%DIMM REQUEST,A DATA ECC
ON,DIE%ECC:%MITIGATE%SCALING%FAULTS
7
64WBits 64Bits 8WBits (72,64)&ECC On(Die,ECC:,Single,Error,Correction,,Double, Error,Detection,Code,(SECDED)
ON,DIE%ECC:%MITIGATE%SCALING%FAULTS
8
64WBits&Correct&Data On(Die,ECC,fixes,scaling,faults,invisibly
Detect Correct
9
CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP
Runtime(faults
- Chip&faults&common
- Need&strong&ECC
Fault& Mode Transient& Fault Rate&(FIT) Permanent& Fault Rate&(FIT) Bit 14.2 18.6 Word 1.4 0.3 Column 1.4 5.6 Row 0.2 8.2 Bank 0.8 10 *Total 18 42.7
MITIGATING%RUNTIME%FAULTS
ECC,DIMM (9,Chips)
10
CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP
Runtime(faults
- Chip&faults&common
- Need&strong&ECC
*Sridharan+*SC13 Fault& Mode Transient& Fault Rate&(FIT) Permanent& Fault Rate&(FIT) Bit 14.2 18.6 Word 1.4 0.3 Column 1.4 5.6 Row 0.2 8.2 Bank 0.8 10 *Total 18 42.7
MITIGATING%RUNTIME%FAULTS
CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip
11
CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP
Runtime(chip(faults(! Chipkill (strong&ECC) 18%DRAM%Chips READ
MITIGATING%RUNTIME%FAULTS
12
CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP
Runtime(chip(faults(! Chipkill (strong&ECC) Cost: 18 Chips,&Performance&and&Power&Inefficient 18%DRAM%Chips
MITIGATING%RUNTIME%FAULTS
13
GOAL%AND%CHALLENGE
GOAL:&Use&OnWDie&ECC&to&mitigate&runtime&faults “ChipkillWlevel&reliability&using&x8&ECCWDIMM” CHALLENGE:&OnWDie&ECC&is&invisible,&expose&it& without&changing&the&memory&interface
OUTLINE
14
- BACKGROUND
- XED
- CASE&STUDIES
- EVALUATION
- SUMMARY
USING%PARITY%+%FAILED%LOCATION
15
What&if&the&chip&can&inform&that&it&failed?
CHIP CHIP CHIP CHIP CHIP CHIP CHIP ECC Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 ECC
Memory,Controller
USING%PARITY%+%FAILED%LOCATION
16
Fix&chipWfaults&using&only&9&Chips
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 FAIL D3 D4 D5 D6 D7 PA
Memory,Controller
Parity&+&Location&! Reconstruct&Data&for&Faulty&Chip What&if&the&chip&can&inform&that&it&failed?
17
XED:%EXPOSED%ON,DIE%ERROR%DETECTION
XED&consists&of&three&components
- Strong&detection&in&addition&to&SEC
- ParityWbased&correction
- Transparently&identifying&faulty&chip
XED:%ON,DIE%ECC%AS%DETECTION%CODE
18
OnWDie&Error&Correction&Code Data 64WBits
Detect Correct
Corrects? Detects? SingleWBit&Failures
- Chip&Failures
XED:%ON,DIE%ECC%AS%DETECTION%CODE
19
OnWDie&Error&Strong&Detection + Correction&Code OnWDie&ECC&can&detect&chipWfailures
Corrects? Detects? SingleWBit&Failures
- Chip&Failures
- (99.9%)
Data 64WBits
Detect Correct
CRC,8%ATM,code%instead%of%Hamming,code
XED:%RAID,3%BASED%CORRECTION
20
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP OnWDie&ECC&detected&it Reconstruct&Data&in&Failed&Chip
If&we&could&expose&OnWDie&Error&Detection&! Chipkill
EXPOSE%ON,DIE%ERROR%INFO
OPTION&1:&Use&additional&wires
21
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA
Memory,Controller
FAIL
EXPOSE%ON,DIE%ERROR%INFO
OPTION&1:&Use&additional&wires
22
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA
Memory,Controller
Failed
Incompatible&with&DDR&memory&standards Needs&a&new&protocol Worse&for&pinWconstrained&future&systems!
EXPOSE%ON,DIE%ERROR%INFO
OPTION&2:&Use&additional&burst/transaction
23
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA
Memory,Controller
EXPOSE%ON,DIE%ERROR%INFO
OPTION&2:&Use&additional&burst/transaction
24
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA
Memory,Controller
OK OK FAIL OK OK OK OK OK OK
EXPOSE%ON,DIE%ERROR%INFO
OPTION&2:&Use&additional&burst/transaction
25
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA
Memory,Controller
OK OK FAIL OK OK OK OK OK OK
Additional&12.5%&to&100%&bandwidth&overheads Performance&and&Power&Inefficient Expose&OnWDie&error&detection&with&minor&changes
XED:%ON,DIE%ERROR%INFO%FOR%FREE
26
On&detecting&an&error,&the&DRAM&chip&sends&a&64W bit&“CatchWWord”&(CW)&instead&of&data
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP
Memory,Controller
D0 D1 CW D3 D4 D5 D6 D7 PA 64(bits
XED:%MUX%TO%SEND%CATCH,WORDS
27
Simple&MUX&to&chose&between&Data&and&CatchWWord Data&or&CW 64WBits
Detect Correct CW
Yes
XED:%ON,DIE%ERROR%INFO%FOR%FREE
28
On&detecting&an&error,&the&DRAM&chip&sends&a&64W bit&“CatchWWord”&(CW)&instead&of&data
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP
Memory,Controller
D0 D1 CW D3 D4 D5 D6 D7 PA
64(bit,Catch(Words identify,the,faulty,chip Chips&provisioned&with&a&unique&CatchWWord& No&additional&wires/bandwidth&overheads Compatible&with&existing&memory&protocols
WHY%DO%CATCH,WORDS%WORK?
29
Catch&Word&(CW)&≠ Valid&Data&(D2)
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 CW D3 D4 D5 D6 D7 PA 64(bits
WHY%DO%CATCH,WORDS%WORK?
30
Catch&Word&(CW)&≠ Valid&Data&(D2) Then&! PA&≠ D0& D1& CW& …& D7
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 CW D3 D4 D5 D6 D7 PA
Location&Identified
WHY%DO%CATCH,WORDS%WORK?
31
Catch&Word&(CW)&≠ Valid&Data&(D2) Then&! PA&≠ D0& D1& CW& …& D7
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 CW D3 D4 D5 D6 D7 PA
D2&=&D0& D1& D3& …& PA Location&Identified
WHY%DO%CATCH,WORDS%WORK?
32
Catch&Word&(CW)&= Valid&Data&(D2)
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 CW D3 D4 D5 D6 D7 PA 64(bits
WHY%DO%CATCH,WORDS%WORK?
33
Catch&Word&(CW)&= Valid&Data&(D2)&[Collision] Then&! PA&= D0& D1& CW& …& D7
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP D0 D1 CW D3 D4 D5 D6 D7 PA
No&Error&as&Parity&Matches Catch(Word,collision:,Doesn’t,affect,correctness
COLLISIONS:%NOT%A%PROBLEM
- A chip&stores&64&bits/cacheWline&! 264&combinations&
- However&even&a&16Gb&chip&has&only&228&cachelines
- Even&if&this&entire&chip&contained&different&data&
there&are&nearly&263.99 data&combinations&free!
34
The&catchWword&will&most&likely¬&collide
OUTLINE
35
- BACKGROUND
- XED
- CASE&STUDIES
- EVALUATION
- SUMMARY
XED%FOR%SCALING%ERRORS
On,Die%ECC
- Single&Error&Correction
- Always&detects&scaling&errors&(singleWbit)
36
CASE%STUDY%1:%SINGLE%SCALING%FAULT
Scaling&fault&within&a&single&chip
37
Parity&reconstructs&data&from&chip&with&scaling&error
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP
Memory,Controller
D0 D1 CW D3 D4 D5 D6 D7 PA 64(bits
- No(SDC,(No(DUE
CASE%STUDY%2:%MULTIPLE%SCALING%FAULTS
Scaling&faults&within&multiple&chips
38
Disable&XED&+&Retry
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP
Memory,Controller
D0 D1 CW 3 D3 D4 D6 D7 PA 64(bits CW 5 64(bits
- No(SDC,(No(DUE
CASE%STUDY%3:%CHIP%FAULT
CatchWWord&identifies&the&faulty&chip
39
Parity&reconstructs&data&from&failed&chip
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP
Memory,Controller
D0 D1 CW 3 D3 D4 D6 D7 PA 64(bits D5 No(SDC,(No(DUE
CASE%STUDY%4:%CHIP%+%SCALING%FAULT
Parity&detects&error&even&after&retry&! Chip&Failure
40
Disable&XED&+&Diagnosis&to&locate&chip&failure
CHIP CHIP CHIP CHIP CHIP CHIP CHIP Parity Chip CHIP
Memory,Controller
D0 D1 CW 3 D3 D4 D6 D7 PA 64(bits CW 5 64(bits
- Very(Small(
SDC((10>6),(DUE((10>13)
OUTLINE
41
- BACKGROUND
- XED
- CASE&STUDIES
- EVALUATION
- SUMMARY
EVALUATION
42
USIMM&:&8&Cores,&4&Channels,&2&Ranks,&8&Banks FaultSim*: Memory&Reliability&Simulator
- Real&World&Fault&Data
- 7&year&system&lifetime,
- Billion&MonteWCarlo&Trails
- Metric:&Probability&of&System&Failure
- Scaling&FaultWRate:&10W4
*,Nair,et.,al.,HiPEAC 2016
RESULTS:%RELIABILITY
XED&vs&Commercial&ECC&schemes
43
XED,provides,strong reliability,while,using,fewer chips
10-5 10-4 10-3 10-2 10-1 1 2 3 4 5 6 7
Probability of System Failure (Log Scale) Years
SECDED: ECC-DIMM ChipKill XED
43x 4x
(9 Chips) (9 Chips) (18 Chips)
RESULTS:%PERFORMANCE%AND%EDP
44
0.9 1 1.1 1.2 1.3 SECDED XED CHIPKILL
Normalized& Execution&Time
21%
Lower%the%better
0.8 1 1.2 1.4 SECDED XED CHIPKILL
Lower%the%better
RESULTS:%PERFORMANCE%AND%EDP
45
0.9 1 1.1 1.2 1.3 SECDED XED CHIPKILL
Normalized& Execution&Time
21%
Normalized Memory&EnergyWDelay&Product (EDP)
34%
Lower%the%better
Execution,time:,21%,,,,,,,EDP,:,34%,,,,,,,,,,,,
OUTLINE
46
- BACKGROUND
- XED
- CASE&STUDIES
- EVALUATION
- SUMMARY
SUMMARY
- DRAM&Scaling&introduces&errors&! OnWDie&ECC
- OnWDie&ECC&is&invisible&to&the&memory&system
- Exposing&OnWDie&ECC:&Efficient&Runtime&ECC&
- XED
– Exposes&OnWDie&Error&Detection&using&CatchWWords – 2X&fewer&chips&as&compared&to&Chipkill – 4X&higher&reliability&as&compared&to&Chipkill – 21%&lower&execution&time&as&compared&to&Chipkill
- XED&! No&change&in&memory&protocols
47
THANK%YOU
48
“You(are(in(a(pitiable(condition,(if(you(have(to(conceal(what(you(wish(to(tell” W Publilius Syrus
On,Die% ECC
BACKUP
RANDOM%DATA?
- 1. Lower&randomization&! Longer&time&till&collision
- 2. Current&systems&anyway&scramble&data&for&fidelity
- What&if&only&half&the&data&is&random
- 1. Then&average&time&for&collision&increases&by&2x
(&3.2&Million&Years&! 6.4&Million&Years) 2.&&Less&random&data&increases&collision&time
- DIMMs&today&store&scrambled&(randomized)&data
- 1. To&equalize&the&number&of&1’s&and&0’s
- 2. Reduce&Bit&Error&Rate&on&the&bus&
- 3. Scrambling&using&address&based&hash
MTTF:%XED%VS%CHIPKILL
X
Chipkill (18Wchips) 2WChip&Failures
X
FAILED
X X
XED&(9Wchips) FAILED
MTTF:%XED%VS%CHIPKILL
X
Chipkill 2WChip&Failures
X
FAILED XED&(9Wchips) FAILED
X X
MTTF:%XED%VS%CHIPKILL
X
Chipkill 2WChip&Failures
X
PASSED XED&(9Wchips) FAILED
X X
! Extend&to&MultiWChip&Failures
SDC%AND DUE
Table IV SDC AND DUE RATE OF XED
Source of Vulnerability Rate over 7 years XED: Scaling-Related Faults No SDC or DUE XED: Row/ Column/ Bank Failure 1.4×10−13 (SDC) XED: Word Failure 6.1×10−6 (DUE) Data Loss from Multi-Chip Failures 5.8×10−4
ADDITIONAL%BURST/TRANSACTION
Memory Power Execution Time
Chipkill Chipkill Chipkill Double Chipkill Normalized Value
Expose On−Die ECC using Additional Transaction Expose On−Die ECC using an Extra Burst
0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 Double
XED%VS%LOT,ECC
1.06 1.08 Normalized Execution Time LOTECC (with Write−Coalescing)
SPEC PARSEC BIOBENCH COMM GMEAN