Citadel: Efficiently Protecting Stacked Memory From Large - - PowerPoint PPT Presentation

citadel efficiently protecting stacked memory from large
SMART_READER_LITE
LIVE PREVIEW

Citadel: Efficiently Protecting Stacked Memory From Large - - PowerPoint PPT Presentation

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David Roberts - AMD Research Moinuddin Qureshi - Georgia Tech INTRODUCTION TO 3D DRAM DRAM


slide-1
SLIDE 1

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures

Dec 15th 2014 MICRO-47 Cambridge UK

Prashant Nair - Georgia Tech David Roberts - AMD Research Moinuddin Qureshi - Georgia Tech

slide-2
SLIDE 2
  • DRAM systems face a bandwidth wall
  • Stack DRAM Dies over each other 3D DRAM
  • Use Through Silicon Vias (TSV) to connect Dies
  • Higher density of TSV Higher Bandwidth
2

INTRODUCTION TO 3D DRAM

2

Courtesy MICRON, Extremetech

Go 3D to Scale Bandwidth Wall

slide-3
SLIDE 3
  • 3D DRAM Communicate using TSVs
  • A New Failure Mode: TSV Failures
  • TSV Failures Large Granularity Failures

FAILURES IN 3D DRAM

Memory Dies Bank - N Bank - 0 TSVs Channel - 0 Channel - K Logic Die

3

TSVs Present New Kind of Large Granularity Failures

slide-4
SLIDE 4

A NEW FAILURE MODE FROM TSVs

4

TSVs conduit for Address and Data

  • Mainly Two Types TSV Faults

– Data (Incorrect Data fetched from DRAM Die) – Address (Incorrect address presented to DRAM Die)

Logic Die TSVs Address TSV Fault

TSV Faults cause unavailability of Data and Addresses

DataTSV Fault

slide-5
SLIDE 5

EFFECT OF TSV FAULTS

  • Data TSV Fault Few Columns Faulty
  • Address TSV Fault 50% Memory Loss
5

TSVs can cause failures at multiple granularities

DRAM Bank Row Decoder Column Decoder

  • Addr. TSVs

Data TSVs Faulty Data TSV Faulty Addr. TSV Address TSV fault: 50% memory unavailable

slide-6
SLIDE 6

IMPACT OF TSV FAULTS

6

Efficient Techniques to Mitigate TSV Faults 10-1 10-2 10-3

  • Prob. System Failure

TSV Faults Yes No 1 System: 8GB Stacked Memory (HBM)

  • Prob. System Failure Prob(Uncorrectable Error)

22X

slide-7
SLIDE 7 7

OTHER FAILURES STILL PRESENT

7
  • Bit
  • Word
  • Column
  • Row
  • Bank

Apart from TSV Faults, 3D DRAM will also continue to have other multi-granularity failures

Single DRAM Die (Top View) Banks TSVs Stacked Memory DRAM Dies ECC Die

slide-8
SLIDE 8

3D DRAM: FAILURE RATE

8

Die Failure Mode Permanent Fault Rate (FIT) Bit 148.8 Word 2.4 Column 10.5 Row 32.8 Bank 80

  • 1. Large Granularity Faults are as likely as Bit Faults
  • 2. Low Cost Solutions Required For Large Faults

* *Projected from Sridharan et. al. : DRAM Field Study

} = 125.7

✔ SECDED ✖ SECDED

slide-9
SLIDE 9

Current Systems Naturally Stripe Data Across Chips

  • ChipKill : MiHgate Large Failures (Whole Chip)

CHIP CHIP CHIP CHIP CHIP CHIP CHIP CHIP

CONVENTIONAL SCHEMES

9

Cache Line = 64 Bytes ChipKill relies on data striping to tolerate large granularity failures

slide-10
SLIDE 10

CHIPKILL IN STACKED MEMORY

  • A request activates at least 8 Banks or 8 Channels
10

Single DRAM Die (Top View) Bank Data : 8B Cache Line = 64 Bytes At least 8X activation power, 8X DRAM parallelism

slide-11
SLIDE 11

COST OF STRIPING IN 3D DRAM

11

1.0 1.1 1.2 Slowdown 1 3 5

  • Norm. Active Power

Across Banks Across Channels Same Bank

25% More Execution Time 4.8X More Activation Power Striping data across banks/channels in 3D is costly 2 4

slide-12
SLIDE 12 12 12

Develop Efficient Solutions to Mitigate TSV and

  • ther Large Granularity Faults in Stacked Memory

without striping data

GOAL

slide-13
SLIDE 13

OUTLINE

13
  • Introduction and Background
  • Citadel
  • Scheme - 1 : TSV-SWAP
  • Scheme - 2 : Three Dimensional Parity (3DP)
  • Scheme - 3 : Dynamic Dual Grain Sparing (DDS)
  • Summary
slide-14
SLIDE 14

CITADEL: AN OVERVIEW

14
  • Runtime TSV Sparing (TSV-SWAP)
  • RAID-5 across 3 dimensions (Tri dimensional parity)
  • Spare Faults Regions (Dual Granularity Sparing)

Enable robust stacked memory at very low overheads

DRAM Dies ECC Die Tri Dimensional Parity TSV SWAP Dual Granularity Sparing

slide-15
SLIDE 15

OUTLINE

15
  • Introduction and Background
  • Citadel
  • Scheme - 1 : TSV-SWAP
  • Scheme - 2 : Three Dimensional Parity (3DP)
  • Scheme - 3 : Dynamic Dual Grain Sparing (DDS)
  • Summary
slide-16
SLIDE 16

DESIGN-TIME TSV SPARING

Designers provision spares TSVs alongside Data TSVs and Address TSVs

16

DRAM Bank Row Decoder Column Decoder SPARE TSVs

Additional Spare TSVs can replace faulty TSVs

slide-17
SLIDE 17

DESIGN-TIME TSV SPARING: OPERATION

  • Deactivate Broken TSVs
  • Activate SPARE TSVs
17

DRAM Bank Row Decoder Column Decoder Faulty Data TSV Faulty Addr. TSV Address TSV fault: 50% memory unavailable SPARE TSVs

✖ ✖

Deactivation of Faulty TSVs and Activation of Spare TSVs is performed at design time

slide-18
SLIDE 18

DESIGN-TIME TSV SPARING: PROBLEMS AddiHonal TSVs are required for TSV Sparing and What happens if TSVs turn faulty at runHme?

18
slide-19
SLIDE 19

TSV-SWAP: RUNTIME TSV SPARING

  • Few Data TSVs as Standby TSVs
  • Replicate Standby Data in ECC
19

DRAM Bank Row Decoder Column Decoder (standby TSV)

Data TSVs reused as Standby TSVs STEP-1: CREATE STANDBY TSVs

Data Cache ECC Standby

slide-20
SLIDE 20

TSV-SWAP: RUNTIME TSV SPARING

20

DRAM Bank Row Decoder Column Decoder (standby TSV) Address TSV fault: 50% memory unavailable

Data vs Address TSV Faults Using CRC-32+BIST

  • CRC-32 address + data
  • BIST diagnoses faulty TSVs

STEP-2: DETECTING FAULTY TSVs

slide-21
SLIDE 21

TSV-SWAP: RUNTIME TSV SPARING

21

DRAM Bank Row Decoder Column Decoder (standby TSV) SWAP Address TSV fault: 50% memory unavailable

TSV-SWAP is a runtime technique that does not rely on additional spare TSVs

Swap Faulty TSVs with Standby TSVs at runHme

STEP-3: REDIRECTING FAULTY TSVs

slide-22
SLIDE 22

EFFECTIVENESS OF TSV-SWAP

22

10-1 10-2 10-3

  • Prob. Of System Failure

Rate: One TSV Fault Every 7 years No TSV Fault With TSV Faults TSV SWAP Almost IDEAL TSV-SWAP is Effective at Tolerating TSV Faults

slide-23
SLIDE 23

OUTLINE

23
  • Introduction and Background
  • Citadel
  • Scheme - 1 : TSV-SWAP
  • Scheme - 2 : Three Dimensional Parity (3DP)
  • Scheme - 3 : Dynamic Dual Grain Sparing (DDS)
  • Summary
slide-24
SLIDE 24

TRI DIMENSIONAL PARITY (3DP)

  • Use RAID-5 like scheme over three dimensions
  • Detect using CRC-32
  • Correct using Parity

– Bank Level (BL) Parity – Row Level (RL-H) Parity per die – Row Level (RL-V) Parity across dies

24

Die 1 Die 2 Die 8 BL Parity (Dimension 1) RL-H Parity Dimension 2 RL-V Parity (Dimension 3)

Three Dimensions Help In Multi-Fault Handling

slide-25
SLIDE 25

3DP: DATA CORRECTION

If Fault Compute Parity and Correct

  • 1-Small Fault RL-H or RL-V
  • 2-Small Faults RL-H and RL-V
  • 2 Small + 1 Large Fault

RL-H and RL-V and BL

25

Die 1 Die 2 Die 8 BL Parity RL-H Parity RL-V Parity

Multiple Multi-granularity Faults Are Corrected At Runtime

slide-26
SLIDE 26

OVERHEADS IN UPDATING PARITY

  • RL-H and RL-V Parity just 32 KB stored in SRAM
  • BL Parity is 128 MB stored in DRAM
  • UpdaHng BL Parity has performance overhead
  • Employ Demand Caching of BL Parity in LLC
  • MiHgate overheads of updaHng BL Parity
26

Demand Caching of BL Parity Has 85% Hit Rate And Mitigates Performance Overheads

slide-27
SLIDE 27

EFFECTIVENESS OF 3DP

27

3DP is 7X Stronger Than A ChipKill-Like Scheme 7X 10-2 10-3 10-4 3DP ChipKill-Like

  • Prob. Of System Failure
slide-28
SLIDE 28

OUTLINE

28
  • Introduction and Background
  • Citadel
  • Scheme - 1 : TSV-SWAP
  • Scheme - 2 : Three Dimensional Parity (3DP)
  • Scheme - 3 : Dynamic Dual Grain Sparing (DDS)
  • Summary
slide-29
SLIDE 29

WHY SPARE FAULTY DATA?

  • Correcting Large Faults Has Performance Overhead
  • To prevent accumulation of faults
29

Sparing Mitigates Performance Overheads and Enhances Reliability

slide-30
SLIDE 30

TRACKING STRUCTURES IN SPARING

  • Row Level Tracking

– Large Indirection Structure – Sparing Area Used Efficiently

  • Bank Level Tracking

– Small Indirection Structure – Sparing Area Used Inefficiently

30

Ideally We Need Small Indirection Structures Which Use Spare Area Efficiently

Indirection Structure (Large) Spare Area Indirection Structure (Small) Spare Area

slide-31
SLIDE 31

BIMODAL FAILURES

  • Observa3on : Either < 4 or > 4000 row failures
31

66.8% 33.2% 4 16 64 256 1K 4K Affecting less than 4 rows Affecting more than 4000 rows Spare Faulty Regions At Two Granularities Number of Faulty Rows in a Faulty Bank

slide-32
SLIDE 32

DYNAMIC DUAL GRAIN SPAIRING

  • Provision Spare Area for Two Granularities
32

Banks Faulty Die Spare Banks ECC Die CRC32 + Data of Standby TSVs Use an entire spare row Bit Fault Word Fault Bank fault Use a spare bank

Dual Grain Sparing Efficiently Uses Spare Area

slide-33
SLIDE 33

CITADEL: RESULTS

33

Citadel provides 700X more resilience, consuming only 4% additional power and 1% additional execution time

700X 3DP+ DDS ChipKill-Like

  • Prob. Of System Failure

10-3 10-4 10-5 10-6

Scheme Slowdown Active Power ChipKill 1.25 3.8X Citadel 1.01 1.04X

System: 8GB HBM @ DDR3-1600 Baseline: No Protection + Same Bank

slide-34
SLIDE 34

OUTLINE

34
  • Introduction and Background
  • Citadel
  • Scheme - 1 : TSV-SWAP
  • Scheme - 2 : Three Dimensional Parity (3DP)
  • Scheme - 3 : Dynamic Dual Grain Sparing (DDS)
  • Summary
slide-35
SLIDE 35

SUMMARY

  • 3D stacking can enable high bandwidth DRAM
  • Newer failure modes like TSV failures
  • Striping data to protect against faults is costly
  • Citadel enables robust and efficient 3D DRAM by:

– TSV-SWAP runtime TSV SPARING – Handling multiple-faults using 3DP – Isolating faults using DDS

  • Citadel provides all benefits of stacking at 700X

higher resilience without the need for striping data

35
slide-36
SLIDE 36

Thank You

Questions?

36
slide-37
SLIDE 37

BACKUP SLIDES

37
slide-38
SLIDE 38

CAUSES OF TSV FAULTS

Recent papers*+ shows that

  • 1. TSVs prone to EM-induced voiding effects*+
  • 2. Interfacial cracks thermal-mechanical stress*+
  • 3. EM-induced voids increase TSV resistance,

causing path delay faults and TSV open defects*+

  • 4. Micro-Bump faults+

*Li Jiang et. al. [DAC 2013]

+Krishnendu C. et. al. [IRPS 2012]

38
slide-39
SLIDE 39

TSV-SWAP REPAIR CIRCUIT

39
slide-40
SLIDE 40

PARITY CACHE: HIT RATE

40

Benchmarks