VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An - - PowerPoint PPT Presentation

vlsid 2016
SMART_READER_LITE
LIVE PREVIEW

VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An - - PowerPoint PPT Presentation

VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures Ishan Thakkar , Sudeep Pasricha Department of Electrical and Computer Engineering


slide-1
SLIDE 1

Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures

Ishan Thakkar, Sudeep Pasricha Department of Electrical and Computer Engineering Colorado State University, Fort Collins, CO, U.S.A. {ishan.thakkar, sudeep}@colostate.edu

VLSID 2016

KOLKATA, INDIA January 4-8, 2016

DOI 10.1109/VLSID.2016.13

slide-2
SLIDE 2
  • Introduction
  • Background on DRAM Structure and Refresh

Operation

  • Related Work
  • Contributions
  • Evaluation Setup
  • Evaluation Results
  • Conclusion

Outline

1

slide-3
SLIDE 3
  • Introduction
  • Background on DRAM Structure and Refresh

Operation

  • Related Work
  • Contributions
  • Evaluation Setup
  • Evaluation Results
  • Conclusion

Outline

2

slide-4
SLIDE 4

Introduction

3

  • Main memory is DRAM
  • It is a critical component of all computing systems: server, desktop,

mobile, embedded, sensor

  • DRAM stores data in cell capacitor
  • Fully charged cell-capacitor  logic ‘1’
  • Fully discharged cell-capacitor  logic ‘0’
  • DRAM cell loses data over time, as cell-capacitor leaks charge over

time

  • For temperatures below 85°C, DRAM cell loses data in 64ms
  • For higher temperatures, DRAM cell loses data at faster rate

DRAM: Dynamic Random Access Memory

Word Line Bit Line Cell Capacitor Access Transistor

To preserve data integrity, the charge on each DRAM cell (cell-capacitor) must be periodically restored or refreshed.

slide-5
SLIDE 5
  • Introduction
  • Background on DRAM Structure and Refresh

Operation

  • Related Work
  • Contributions
  • Evaluation Setup
  • Evaluation Results
  • Conclusion

Outline

4

slide-6
SLIDE 6

Background on DRAM Structure

5

  • Based on their structure, DRAMs are classified in

two categories:

  • 1. 2D DRAMs: Planar single layer DRAMs
  • 2. 3D-Stacked DRAMs: Multiple 2D DRAM layers stacked
  • n one-another using TSVs
  • 2D DRAM structure

TSV: Through Silicon Via

2D DRAM Structure Hierarchy Chip Bank Subarray Bitcell Rank

slide-7
SLIDE 7

2D DRAM: Rank and Chip Structure

6

<N> <N> <N>

. . . <N>

Mux

DRAM Chip

<N>

DRAM Rank DRAM Chip

  • 2D DRAM rank:
  • Multiple chips work in tandem
slide-8
SLIDE 8

3D-Stacked DRAM Structure

7

HMC Structure Hierarchy Vault Bank Subarray Bitcell Hybrid Memory Cube

In this paper, we consider Hybrid Memory Cube (HMC), which is as a standard for 3D-Stacked DRAMs defined by a consortium of industries

slide-9
SLIDE 9

DRAM Bank Structure

8

Sense Amplifiers Sense Amplifiers Row Address Decoder Row Buffer

Columns Rows Subarray

Column Mux

Data bits

Bank Core Bank Peripherals

Column Address Decoder

3D-Stacked and 2D DRAMs have similar bank structures

slide-10
SLIDE 10

DRAM Subarray Structure

9

Sense Amps Row Address Word Line Bit Line Cell Capacitor Access Transistor

Word Line Bit Line

Sense Amp Sense Amp Sense Amp

DRAM Cell DRAM Cell

3D-Stacked and 2D DRAMs have similar subarray structures

slide-11
SLIDE 11

All bitlines of the bank are pre-charged to 0.5 VDD

Basic DRAM Operations

10

Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec.

=ID? =ID?

EN EN Global Address Latch

Row Buffer

Column Mux Column Address Decoder

PRECHARGE

slide-12
SLIDE 12

The target row is

  • pened,

Basic DRAM Operations

11

Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec.

=ID? =ID?

EN EN Global Address Latch

Row Buffer

Column Mux Column Address Decoder

PRECHARGE ACTIVATION

Row Address

Row 4 Subarray ID: 1

Row 4

slide-13
SLIDE 13

The target row is

  • pened,

then it’s captured by SAs

Basic DRAM Operations

12

Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec.

=ID? =ID?

EN EN Global Address Latch

Row Buffer

Column Mux Column Address Decoder

PRECHARGE ACTIVATION

Row Address

Row 4 Subarray ID: 1

Row 4

slide-14
SLIDE 14

Basic DRAM Operations

13

Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec.

=ID? =ID?

EN EN Global Address Latch

Row Buffer

Column Mux Column Address Decoder

PRECHARGE ACTIVATION

Row Address

Row 4 Subarray ID: 1

Row 4

SAs drive each bitline fully either to VDD or 0V – restore the

  • pen row

Row 4

slide-15
SLIDE 15

Basic DRAM Operations

14

Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec.

=ID? =ID?

EN EN Global Address Latch

Row Buffer

Column Mux Column Address Decoder

PRECHARGE ACTIVATION

Row Address

Row 4 Subarray ID: 1

Row 4 Row 4

Open row is stored in global row buffer

slide-16
SLIDE 16

Basic DRAM Operations

15

Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec.

=ID? =ID?

EN EN Global Address Latch

Row Buffer

Column Mux Column Address Decoder

PRECHARGE ACTIVATION READ

Row Address

Row 4 Subarray ID: 1

Row 4 Row 4

Column 1

Target data block is selected, and then multiplexed

  • ut from row

buffer

slide-17
SLIDE 17

Basic DRAM Operations

16

Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec.

=ID? =ID?

EN EN Global Address Latch

Row Buffer

Column Mux Column Address Decoder

PRECHARGE ACTIVATION READ

Row Address

Row 4 Subarray ID: 1

Row 4 Row 4

Column 1

A duet of PRECHARGE-ACTIVATION operations restores/refreshes the target row  dummy PRECHARGE-ACTIVATION operations are performed to refresh the rows

slide-18
SLIDE 18

Refresh: 2D Vs 3D-Stacked DRAMs

17

  • 3D-Stacked DRAMs have
  • Higher capacity/density  more rows need to be refreshed
  • Higher power density  higher operating temperature (>85°C)

 smaller retention period (time before DRAM cells lose data)

  • f 32ms than that of 64ms for 2D DRAMs
  • Thus, refresh problem for 3D-Stacked DRAMs is more critical
  • Therefore, in this study, we target a standardized 3D-Stacked

DRAM architecture HMC

Refresh

Dummy ACTIVATION-PRECHARGE are performed on all rows every retention cycle (32 ms) To prevent long pauses  a JEDEC standardized Distributed Refresh method is used

slide-19
SLIDE 19

Background: Refresh Operation

18

  • Distributed Refresh – JEDEC standardized method
  • A group of 𝑜 rows are refreshed every 3.9μs
  • A group of 𝑜 rows form a ‘Refresh Bundle (RB)’
  • Size of RB increases w/ increase in DRAM capacity  increases tRFC

Example Distributed Refresh Operation – 1Gb HMC Vault

RB1

tRFC tREFI = 3.9µs

RB2

tRFC tREFI = 3.9µs

RB8192

tRFC tREFI = 3.9µs

Retention Cycle = 32ms

Size of RB is 16

tREC

tRFC

Row1 tRC Row2 tRC Row3 tRC Row4 tRC Row15 tRC Row16 tRC tREC tREC tREC tREFI: Refresh Interval tRFC: Refresh Cycle Time tRC: Row Cycle Time

tRFC = time taken to refresh entire RB

slide-20
SLIDE 20

Performance Overhead of Distributed Refresh

19

Source: J Liu+, ISCA 2012

Performance overhead of refresh increases with increase in device capacity

slide-21
SLIDE 21

Energy Overhead of Distributed Refresh

20

Source: J Liu+, ISCA 2012

Energy overhead of refresh increases with increase in device capacity

slide-22
SLIDE 22

Energy Overhead of Distributed Refresh

21

Source: J Liu+, ISCA 2012

Energy overhead of refresh increases with increase in device capacity

Refresh is a growing problem, which needs to be addressed to realize low-latency, low-energy DRAMs

slide-23
SLIDE 23
  • Introduction
  • Background on DRAM Structure and Refresh

Operation

  • Related Work
  • Contributions
  • Evaluation Setup
  • Evaluation Results
  • Conclusion

Outline

22

slide-24
SLIDE 24

Related Work

23

We improve upon Scattered Refresh Scattered Refresh improves upon Per-bank Refresh and All-bank Refresh

slide-25
SLIDE 25

All-Bank Refresh Vs Per-Bank Refresh

24

  • Distributed Refresh can be implemented at two

different granularities

  • All-bank Refresh: All banks are refreshed simultaneously,

and none of the banks is allowed to serve any request until refresh is complete

  • Supported by all general purpose DDRx DRAMs
  • DRAM operation is completely stalled  no. of available banks (#AB)

is zero

  • Exploits bank-level parallelism (BLP) for refreshing  smaller tRFC
  • Per-bank Refresh: Only one bank is refreshed at a time, so

all other banks are allowed to serve other requests

  • Supported by LPDDRx DRAMs
  • #AB > 0
  • No BLP  larger value of tRFC

tRFC: Refresh Cycle Time

slide-26
SLIDE 26

All-Bank Refresh Vs Per-Bank Refresh

25

All-Bank Refresh

tRC: Row Cycle Time

  • Smaller value of tRFC
  • Number of available banks (#AB) = 0

 DRAM operation is completely stalled

tRFC: Refresh Cycle Time Dummy ACTIVATION-PRECHARGE

  • perations for refresh command

Per-Bank Refresh

  • #AB > 0
  • No BLP  larger value of tRFC

Both All-bank Refresh and Per-bank Refresh have drawbacks and they can be improved

L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID

slide-27
SLIDE 27

Scattered Refresh

26

Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4

  • Improves upon Per-bank Refresh – uses subarray-level parallelism

(SLP) for refresh

  • Each row of RB is mapped to a different subarray
  • SLP gives opportunity to overlap PRECHARGE with next ACTIVATE

 reduces tRFC

Source: T Kalyan+, ISCA 2012

Scattered

L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID

How does Scattered Refresh compare to Per-bank Refresh and All-bank Refresh?

slide-28
SLIDE 28

All-Bank Scattered

Scattered Refresh

27

Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4

Per-Bank

Room for improvement - Scattered Refresh tRFC for All-bank Refresh < tRFC for Scattered Refresh < tRFC for Per-bank Refresh

slide-29
SLIDE 29
  • Introduction
  • Background on DRAM Structure and Refresh

Operation

  • Related Work
  • Contributions
  • Evaluation Setup
  • Evaluation Results
  • Conclusion

Outline

28

slide-30
SLIDE 30

Contributions

29

  • Crammed Refresh: Per-bank Refresh + All-bank Refresh
  • 2 banks are refreshed in parallel, instead of 1 bank in Per-bank Refresh

and all banks in All-bank Refresh

  • Massed Refresh: Crammed Refresh + Scattered Refresh
  • 2 banks are refreshed in parallel
  • Uses SLP in both banks being refreshed

#AB: Number of banks available to serve other requests while remaining banks are being refreshed #BLP: Bank-level Parallelism #SLP: Subarray-level Parallelism

Only 2 banks are refreshed in parallel – proof of concept More than 2 banks can also be chosen Idea is to keep balance between #AB and BLP for refresh

slide-31
SLIDE 31

Scattered Crammed Per-Bank

Crammed Refresh – tRFC Timing

30

Example Crammed Refresh Operation – HMC Vault – Refresh Bundle size of 4

  • Bank-level parallelism (BLP) for refresh
  • Only 2 banks are refreshed in parallel  #AB>0

L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID

tRFC for Crammed Refresh < tRFC for Scattered Refresh

slide-32
SLIDE 32

Massed Crammed

Massed Refresh – tRFC Timing

31

Example Massed Refresh Operation – HMC Vault – Refresh Bundle size of 4

Per-Bank

  • Bank-level parallelism (BLP) +

Subarray-level parallelism (SLP) for refresh

tRFC for Massed Refresh < tRFC for Crammed Refresh

How to implement BLP and SLP together?

L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID

slide-33
SLIDE 33

Subarray-level Parallelism (SLP)

32

Global Row-address Latch Per-Subarray Row-address Latch

Source: Y Kim+, ISCA 2012

Global Row-address Latch hinders SLP

slide-34
SLIDE 34

Bank-level Parallelism (BLP)

33

Physical Address Latch LayerAddr[2] RowAddr[14] BankAddr[1] 17-bit Address Counter Refresh Scheduler Address Calculator Control

Refresh Controller Physical Addr Decoder

Row Addr Latch LayerID LID BankID BID Mask

EN

Memory die 1 Memory die 2 Memory die 3 Memory die 4 Logic Base (LoB) Vault Controller TSV Launch Pads To Banks

BLP is implemented by masking BankID during refresh

slide-35
SLIDE 35
  • Introduction
  • Background on DRAM Structure and Refresh

Operation

  • Related Work
  • Contributions
  • Evaluation Setup
  • Evaluation Results
  • Conclusion

Outline

34

slide-36
SLIDE 36

Evaluation Setup

35

  • Trace-driven simulation for PARSEC benchmarks
  • Memory access traces extracted from detailed cycle-accurate simulations

using gem5

  • These memory traces were then provided as inputs to the DRAM simulator

DRAMSim2

  • Energy, timing and area analysis
  • CACTI-3DD based simulation – based on 4Gb HMC quad model
  • DRAMSim2 configuration
  • Configured DRAMSim2 using CACTI-3DD results
slide-37
SLIDE 37
  • Introduction
  • Background on DRAM Structure and Refresh

Operation

  • Related Work
  • Motivation
  • Massed Refresh Technique
  • Evaluation Setup
  • Evaluation Results
  • Conclusion

Outline

36

slide-38
SLIDE 38

Results I – Energy, Timing, Area

37

slide-39
SLIDE 39

Results II – Throughput

38

Crammed refresh achieves 7.1% and 2.9% more throughput on average

  • ver distributed per-bank refresh and scattered refresh respectively

PARSEC Benchmarks

Massed refresh achieves 8.4% and 4.3% more throughput on average

  • ver distributed per-bank refresh and scattered refresh respectively
slide-40
SLIDE 40

Results III – Energy Delay Product (EDP)

39

Crammed refresh achieves 6.4% and 2.7% less EDP on average over distributed per-bank refresh and scattered refresh respectively

PARSEC Benchmarks

Massed refresh achieves 7.5% and 3.9% less EDP on average over distributed per-bank refresh and scattered refresh respectively

slide-41
SLIDE 41
  • Introduction
  • Background on DRAM Structure and Refresh

Operation

  • Related Work
  • Motivation
  • Massed Refresh Technique
  • Evaluation Setup
  • Evaluation Results
  • Conclusion

Outline

40

slide-42
SLIDE 42

Conclusions

41

  • Proposed Massed Refresh technique exploits
  • Bank-level as well as subarray-level parallelism while refresh
  • perations
  • Proposed Crammed Refresh and Massed Refresh techniques
  • Improve throughput and energy-efficiency of DRAM
  • Crammed Refresh improves upon state-of-the-art
  • 7.1% & 6.4% improvements in throughput and EDP over the

distributed per-bank refresh

  • 2.9% & 2.7% improvements in throughput and EDP over the

scattered refresh schemes respectively

  • Massed Refresh improves upon state-of-the-art
  • 8.4% & 7.5% improvements in throughput and EDP over the

distributed per-bank refresh

  • 4.3% & 3.9% improvements in throughput and EDP over the

scattered refresh schemes respectively

slide-43
SLIDE 43
  • Questions / Comments ?

Thank You

42