Impr mproving DR ving DRAM P M Per erfor ormanc mance e by P - - PowerPoint PPT Presentation

impr mproving dr ving dram p m per erfor ormanc mance e
SMART_READER_LITE
LIVE PREVIEW

Impr mproving DR ving DRAM P M Per erfor ormanc mance e by P - - PowerPoint PPT Presentation

Impr mproving DR ving DRAM P M Per erfor ormanc mance e by P y Par arallelizing R allelizing Refr efreshes eshes with A with Accesses esses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim,


slide-1
SLIDE 1

Impr mproving DR ving DRAM P M Per erfor

  • rmanc

mance e by P y Par arallelizing R allelizing Refr efreshes eshes with A with Accesses esses

Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu

Kevin Chang

slide-2
SLIDE 2

Ex Executiv ecutive Summar e Summary y

  • DRAM refr

efresh in esh inter erfer eres with memor es with memory ac y accesses esses

– Degrades system performance and energy efficiency – Becomes exacerbated as DRAM density increases

  • Goal: Serve memory accesses in parallel with refreshes to

reduce refresh interference on demand requests

  • Our mechanisms:

– 1. Enable more parallelization between refreshes and accesses across different banks with new per-bank refresh scheduling algorithms – 2. Enable serving accesses concurrently with refreshes in the same bank by exploiting DRAM subarrays

  • Improve system performance and energy efficiency for a wide

variety of different workloads and DRAM densities

– 20.2% and 9.0% for 8-core systems using 32Gb DRAM – Very close to the ideal scheme without refreshes

2

slide-3
SLIDE 3

Outline Outline

  • Motiv
  • tivation and Key Ideas

tion and Key Ideas

  • DRAM and Refresh Background
  • Our Mechanisms
  • Results

3

slide-4
SLIDE 4

Refr efresh P esh Penalt enalty y

Proc

  • cessor

essor

Memor emory y Con

  • ntr

troller

  • ller

4

DR DRAM M

Refr efresh esh Read ead Da Data ta Capacit apacitor

  • r

Access ess tr transist ansistor

  • r

Refr efresh dela esh delays r s requests b equests by 100s of ns y 100s of ns

Refresh interferes with memory accesses

slide-5
SLIDE 5

Time

Per er-bank r

  • bank refr

efresh esh in mobile DRAM (LPDDRx)

Existing R Existing Refr efresh M esh Modes

  • des

5

Time

All-bank r All-bank refr efresh esh in commodity DRAM (DDRx) Bank 7 Bank 7 Bank 1 Bank 1 Bank 0 Bank 0

Bank 7 Bank 7 Bank 1 Bank 1 Bank 0 Bank 0

Refr efresh esh Round-r

  • und-robin or
  • bin order

der

Per-bank refresh allows accesses to other banks while a bank is refreshing

slide-6
SLIDE 6

Shor Shortcomings of P

  • mings of Per

er-Bank R

  • Bank Refr

efresh esh

  • Problem 1: Refreshes to different banks are scheduled

in a strict round-robin order

– The static ordering is hardwired into DRAM chips – Refreshes busy banks with many queued requests when

  • ther banks are idle
  • Key idea: Schedule per-bank refreshes to idle banks
  • pportunistically in a dynamic order

6

slide-7
SLIDE 7

Shor Shortcomings of P

  • mings of Per

er-Bank R

  • Bank Refr

efresh esh

  • Problem 2: Banks that are being refreshed cannot

concurrently serve memory requests

7

Time

Bank 0 Bank 0

RD RD

Dela elayed b ed by r y refr efresh esh

Per er-Bank R

  • Bank Refr

efresh esh

slide-8
SLIDE 8

Shor Shortcomings of P

  • mings of Per

er-Bank R

  • Bank Refr

efresh esh

  • Problem 2: Refreshing banks cannot concurrently

serve memory requests

  • Key idea: Exploit subar

subarrays within a bank to parallelize refreshes and accesses across subar subarrays s

8

Time

Bank 0 Bank 0

Subar Subarray 1 y 1 Subar Subarray 0 y 0

RD RD Subar Subarray R y Refr efresh esh Time

Par aralleliz allelize e

slide-9
SLIDE 9

Outline Outline

  • Motivation and Key Ideas
  • DR

DRAM and R M and Refr efresh Backg esh Background

  • und
  • Our Mechanisms
  • Results

9

slide-10
SLIDE 10

DR DRAM S M Syst stem Or em Organiza ganization tion

10

Rank 1 ank 1 Bank 7 Bank 7 Bank 1 Bank 1 Bank 0 Bank 0

Rank 0 ank 0 Rank 1 ank 1 DR DRAM M

  • Banks can serve multiple requests in parallel
slide-11
SLIDE 11

DR DRAM R M Refr efresh F esh Frequenc equency y

  • DRAM standard requires memory controllers to send

per periodic r iodic refr efreshes eshes to DRAM

11

tRefPeriod (tREFI): Remains constan

  • nstant

tRefLatency (tRFC): Varies based on DRAM chip density (e.g., 350ns)

Timeline

Read/Write: roughly 50ns

slide-12
SLIDE 12

Incr ncreasing P easing Per erfor

  • rmanc

mance I e Impac mpact t

  • DRAM is unavailable to serve requests for
  • f time
  • 6.7%

6.7% for today’s 4Gb DRAM

  • Unavailability increases with higher density due to

higher tRefLatency

– 23% / 41% 23% / 41% for futur future 32Gb / 64Gb DR e 32Gb / 64Gb DRAM

12

tRefLatency tRefPeriod

slide-13
SLIDE 13
  • Shorter tR

tRefLa efLatenc ency than that of all-bank refresh

  • More frequent refreshes (shorter tR

tRefP efPer eriod iod)

All-Bank v All-Bank vs. P . Per er-Bank R

  • Bank Refr

efresh esh

13

Timeline Bank 0 Bank 1 Refr efresh esh

Per-Bank Refresh: In mobile DRAM (LPDDRx)

Refr efresh esh Timeline Bank 0 Bank 1

All-Bank Refresh: Employed in commodity DRAM (DDRx, LPDDRx)

Refr efresh esh Refr efresh esh Refr efresh esh

Staggered across banks to limit power

Read ead Read ead Read ead Read ead

Can serve memory accesses in parallel with refreshes across banks

slide-14
SLIDE 14

Shor Shortcomings of P

  • mings of Per

er-Bank R

  • Bank Refr

efresh esh

  • 1) Per-bank refreshes are str

stric ictly scheduled tly scheduled in round-robin order (as fixed by DRAM’s internal logic)

  • 2) A refr

efreshing bank eshing bank cannot serve memory accesses

14

Goal: Enable more parallelization between refreshes and accesses using practical mechanisms

slide-15
SLIDE 15

Outline Outline

  • Motivation and Key Ideas
  • DRAM and Refresh Background
  • Our Mechanisms

– 1. Dynamic Access-Refresh Parallelization (DARP) – 2. Subarray Access-Refresh Parallelization (SARP)

  • Results

15

slide-16
SLIDE 16

Our F Our First Appr irst Approach: D

  • ach: DARP

ARP

  • Dynamic A

ynamic Access-R ess-Refr efresh P esh Par aralleliza allelization (D tion (DARP) ARP)

– An improved scheduling policy for per-bank refreshes – Exploits refresh scheduling flexibility in DDR DRAM

  • Component 1: Out

Out-of-

  • f-or
  • rder per

der per-bank r

  • bank refr

efresh esh

– Avoids poor static scheduling decisions – Dynamically issues per-bank refreshes to idle banks

  • Component 2: Writ

ite-R

  • Refr

efresh P esh Par aralleliza allelization tion

– Avoids refresh interference on latency-critical reads – Parallelizes refreshes with a ba a batch of wr ch of writ ites es

16

slide-17
SLIDE 17

1) Out 1) Out-of-

  • f-Or

Order P der Per er-Bank R

  • Bank Refr

efresh esh

  • Dynamic scheduling polic

ynamic scheduling policy that prioritizes refreshes to idle banks

  • Memor

emory c y con

  • ntr

trollers

  • llers decide which bank to refresh

17

slide-18
SLIDE 18

Bank 1 Bank 0

Our mechanism: D Our mechanism: DARP ARP

1) Out 1) Out-of-

  • f-Or

Order P der Per er-Bank R

  • Bank Refr

efresh esh

18

Refr efresh esh Read ead Timeline Bank 1 Bank 0 Refr efresh esh Read ead Refr efresh esh Read ead

Baseline: R Baseline: Round r

  • und robin
  • bin

Refr efresh esh Read ead

Sa Saved c ed cycles cles

Dela elayed b ed by r y refr efresh esh

Sa Saved c ed cycles cles

Read

Request queue (Bank 0) Request queue (Bank 1)

Read

Reduces refresh penalty on demand requests by refreshing idle banks first in a flexible order

slide-19
SLIDE 19

Outline Outline

  • Motivation and Key Ideas
  • DRAM and Refresh Background
  • Our Mechanisms

– 1. Dynamic Access-Refresh Parallelization (DARP)

  • 1) Out-of-Order Per-Bank Refresh
  • 2)

2) Writ ite-R

  • Refr

efresh P esh Par aralleliza allelization tion

– 2. Subarray Access-Refresh Parallelization (SARP)

  • Results

19

slide-20
SLIDE 20

Refr efresh I esh Inter erfer erenc ence on Upc e on Upcoming R

  • ming Requests

equests

  • Problem: A refresh may collide with an upcoming

request in the near future

20

Bank 1 Bank 0 Refr efresh esh Read ead Read ead

Dela elayed b ed by r y refr efresh esh

Time

slide-21
SLIDE 21

DR DRAM M Writ ite Dr e Draining aining

  • Observations:
  • 1) Bus-tur

Bus-turnar naround la

  • und latenc

ency y when transitioning from writes to reads or vice versa

– To mitigate bus-tur bus-turnar naround la

  • und latenc

ency, writes are typically drained to DRAM in a batch during a period of time

  • 2) Writes are not la

latenc ency-cr critical itical

21

Timeline Bank 1 Bank 0 Writ ite e Read ead Writ ite e

Turnaround

Writ ite e

slide-22
SLIDE 22

2) 2) Writ ite-R

  • Refr

efresh P esh Par aralleliza allelization tion

  • Proactively schedules refreshes when banks are

serving wr writ ite ba e batches ches

22

Timeline Bank 1 Bank 0

Turnaround

Refr efresh esh Read ead Read ead

Baseline Baseline

Dela elayed b ed by r y refr efresh esh

Writ ite e Writ ite e Writ ite e

Writ ite-r

  • refr

efresh par esh paralleliza allelization tion

Timeline Bank 1 Bank 0 Read ead

Turnaround

Read ead Writ ite e Writ ite e Writ ite e Refr efresh esh

  • 1. P
  • 1. Postpone r
  • stpone refr

efresh esh

Refr efresh esh

  • 2. R
  • 2. Refr

efresh dur esh during wr ing writ ites es

Sa Saved c ed cycles cles

Avoids stalling latency-critical read requests by refreshing with non-latency-critical writes

slide-23
SLIDE 23

Outline Outline

  • Motivation and Key Ideas
  • DRAM and Refresh Background
  • Our Mechanisms

– 1. Dynamic Access-Refresh Parallelization (DARP) – 2. Subarray Access-Refresh Parallelization (SARP)

  • Results

23

slide-24
SLIDE 24

Our S Our Sec econd Appr

  • nd Approach: SARP
  • ach: SARP

Observations:

  • 1. A bank is further divided into subar

subarrays s

– Each has its own row buffer to perform refresh operations

  • 2. Some subar

subarrays and bank I/O bank I/O remain completely idle idle during refresh

24

Bank 7 Bank 7 Bank 1 Bank 1 Bank 0 Bank 0

… Subar Subarray y Bank I/O Bank I/O

Row Bu w Bufffer er

Idle Idle

slide-25
SLIDE 25

Our S Our Sec econd Appr

  • nd Approach: SARP
  • ach: SARP
  • Subar

Subarray A y Access-R ess-Refr efresh P esh Par aralleliza allelization (SARP) tion (SARP): :

– Parallelizes refreshes and accesses within a bank within a bank

25

slide-26
SLIDE 26

Our S Our Sec econd Appr

  • nd Approach: SARP
  • ach: SARP
  • Subar

Subarray A y Access-R ess-Refr efresh P esh Par aralleliza allelization (SARP) tion (SARP): :

– Parallelizes refreshes and accesses within a bank within a bank

26

Very modest DRAM modifications: 0.71% die area overhead

Bank 7 Bank 7 Bank 1 Bank 1 Bank 0 Bank 0

… Subar Subarray y Bank I/O Bank I/O

Timeline Subarray 1 Subarray 0

Bank 1 Bank 1

Da Data ta Refr efresh esh Refr efresh esh Read ead Read ead

slide-27
SLIDE 27

Outline Outline

  • Motivation and Key Ideas
  • DRAM and Refresh Background
  • Our Mechanisms
  • Results

esults

27

slide-28
SLIDE 28

Methodology ethodology

  • 100 w

100 wor

  • rkloads

loads: SPEC CPU2006, STREAM, TPC-C/H, random access

  • Syst

stem per em perfor

  • rmanc

mance metr e metric ic: Weighted speedup

28

DDR3 R DDR3 Rank ank

Simula Simulator c

  • r configur
  • nfigurations

tions

Memor emory y Con

  • ntr

troller

  • ller

8- 8-cor

  • re

e pr proc

  • cessor

essor

Memor emory y Con

  • ntr

troller

  • ller

Bank 7 Bank 7 Bank 1 Bank 1 Bank 0 Bank 0

L1 $: 32KB L1 $: 32KB L2 $: 512KB/c L2 $: 512KB/cor

  • re

e

slide-29
SLIDE 29

Compar

  • mparison P

ison Poin

  • ints

ts

  • All-bank r

All-bank refr efresh esh [DDR3, LPDDR3, …]

  • Per

er-bank r

  • bank refr

efresh esh [LPDDR3]

  • Elastic r

Elastic refr efresh esh [Stuecheli et al., MICRO ‘10]:

– Postpones refreshes by a time delay based on the predicted rank idle time to avoid interference on memory requests – Proposed to schedule all-bank refreshes without exploiting per-bank refreshes – Cannot parallelize refreshes and accesses within a rank

  • Ideal (no r

Ideal (no refr efresh) esh)

29

slide-30
SLIDE 30

1 2 3 4 5 6 8Gb 16Gb 32Gb

Weigh eighted Speedup ed Speedup (GeoM eoMean ean) )

DR DRAM Chip D M Chip Densit ensity y All-Bank Per-Bank Elastic DARP SARP DSARP Ideal

Syst stem P em Per erfor

  • rmanc

mance e

30

7.9% 12.3% 20.2%

  • 1. Both DARP & SARP provide performance gains

and combining them (DSARP) improves even more

  • 2. Consistent system performance improvement across

DRAM densities (within 0.9%, 1.2%, and 3.8% 0.9%, 1.2%, and 3.8% of ideal)

slide-31
SLIDE 31

Ener Energy E gy Effi fficienc ciency y

31

3.0% 5.2% 9.0%

Consistent reduction on energy consumption

5 10 15 20 25 30 35 40 45 8Gb 16Gb 32Gb Ener Energy per A gy per Access (nJ ess (nJ) ) DR DRAM Chip D M Chip Densit ensity y All-Bank Per-Bank Elastic DARP SARP DSARP Ideal

slide-32
SLIDE 32

Other R ther Results and Discussion in the P esults and Discussion in the Paper aper

  • Detailed multi-core results and analysis
  • Result breakdown based on memory intensity
  • Sensitivity results on number of cores, subarray

counts, refresh interval length, and DRAM parameters

  • Comparisons to DDR4 fine granularity refresh

32

slide-33
SLIDE 33

Ex Executiv ecutive Summar e Summary y

  • DRAM refr

efresh in esh inter erfer eres with memor es with memory ac y accesses esses

– Degrades system performance and energy efficiency – Becomes exacerbated as DRAM density increases

  • Goal: Serve memory accesses in parallel with refreshes to

reduce refresh interference on demand requests

  • Our mechanisms:

– 1. Enable more parallelization between refreshes and accesses across different banks with new per-bank refresh scheduling algorithms – 2. Enable serving accesses concurrently with refreshes in the same bank by exploiting DRAM subarrays

  • Improve system performance and energy efficiency for a wide

variety of different workloads and DRAM densities

– 20.2% and 9.0% for 8-core systems using 32Gb DRAM – Very close to the ideal scheme without refreshes

33

slide-34
SLIDE 34

Impr mproving DR ving DRAM P M Per erfor

  • rmanc

mance e by P y Par arallelizing R allelizing Refr efreshes eshes with A with Accesses esses

Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu

Kevin Chang

slide-35
SLIDE 35

Back Backup up

35

slide-36
SLIDE 36

Compar

  • mparison t

ison to C

  • Concur
  • ncurren

ent t Wor

  • rk

k

  • Zhang et al

Zhang et al., HPC ., HPCA’14 ’14

  • Ideas:

– 1) Sub Sub-r

  • rank r

ank refr efresh esh → refreshes a subset of banks within a rank – 2) Subar Subarray r y refr efresh esh → refreshes one subarray at a time – 3) Dynamic sub-rank refresh scheduling policies

  • Similarities:

– 1) Leverage idle subarrays to serve accesses – 2) Schedule refreshes to idle banks first

  • Differences:

– 1) Exploit write draining periods to hide refresh latency – 2) We provide detailed analysis on existing per-bank refresh in mobile DRAM – 3) Concrete description on our scheduling algorithm

36

slide-37
SLIDE 37

Per erfor

  • rmanc

mance I e Impac mpact of R t of Refr efreshes eshes

  • Refresh penalty exacerbates as density grows

37

5 10 15 20 25 16 32 Una Unavailabilit ailability (%) y (%) Gigabits ( Gigabits (Gb) per DR Gb) per DRAM Chip M Chip

Cur urren ent t Futur uture e

(B (By y y year 2020*) ear 2020*)

43% 43% 23% 23% 6.7% 6.7%

*ITRS Roadmap, 2011

Technology Feature Trend

Pot

  • ten

ential R tial Range ange

slide-38
SLIDE 38

Tempor emporal F al Flexibilit lexibility y

  • DRAM standard allows a few refresh commands to be

issued ear issued early or la ly or late e

38

DRAM Timeline 1 1 2 2 3 3 4 4 5 5 6 6 1 1 2 2 3 3 4 4 5 5

Dela elayed b ed by 1 r y 1 refr efresh c esh command

  • mmand

tRefreshPeriod

1 1 2 2 5 5 6 6 7 7

Ahead b Ahead by 1 r y 1 refr efresh c esh command

  • mmand

4 4 3 3

slide-39
SLIDE 39

Refr efresh esh

39

  • tRetention=32!"
  • tRefreshPeriod=3.9#"
  • Fixed number of refresh commands to refresh entire

DRAM: N=8192

1 1

DRAM Timeline

2 2 3 3 N N N+1 N+1 Row1 Row1 tRefreshWindow=$∗%&'()'"ℎ+'),-.=31.948!"<%&'%'/%,-/ 1 1

DRAM Timeline

N+1 N+1 N+1 N+1 Row1 Row1 Row1 tRefreshWindow %0'123 tRetention>tRefreshWindo4+%0'123

… … … …

slide-40
SLIDE 40

Unfair Unfairness ( ) ness ( )

40

0.5 1 1.5 2 2.5 8Gb 16Gb 32Gb Aver erage M age Maximum aximum Slo Slowdo down ( wn (lo lower is bett er is better er) ) DR DRAM Chip D M Chip Densit ensity y REFab Elastic REFpb DARP SARP Ideal

Our mechanisms do not unfairly slow down specific applications to gain performance

MaximumSlowdown = maxi IPCi

alone

IPCi

shared

slide-41
SLIDE 41

Power O er Over erhead head

41

Power overhead to parallelize a refresh operation and accesses over a f a four

  • ur-ac
  • activ

tivate windo e window: Activate current Refresh current Extend both tFAW and tRRD timing parameters:

PowerOverheadtFAW = (4* IACT + IREF) / 4* IACT tFAWSARP = tFAW *PowerOverheadtFAW tRRDSARP = tRRD*PowerOverheadtFAW

slide-42
SLIDE 42

Refr efresh I esh Inter erval (7.8 al (7.8μs) s)

42

1 2 3 4 5 6 8Gb 16Gb 32Gb GeoM eoMean ean Weigh eighted Speedup ed Speedup DR DRAM Chip D M Chip Densit ensity y REFab REFpb (D+S)ARP Ideal

3.3% 5.3% 9.1%

slide-43
SLIDE 43

Die Ar Die Area O ea Over erhead head

  • Rambus DRAM model with 55nm
  • SARP ar

SARP area o ea over erhead head: 0.71% in a 2Gb DRAM chip

43

slide-44
SLIDE 44

Syst stem P em Per erfor

  • rmanc

mance e

44

1 2 3 4 5 6 7 8 9 8Gb 16Gb 32Gb GeoM eoMean ean Weigh eighted Speedup ed Speedup DR DRAM Chip D M Chip Densit ensity y REFab Elastic REFpb DARP SARP (D+S)ARP Ideal

slide-45
SLIDE 45

Efffec ect of M t of Memor emory I y Intensit ensity y

45

5 10 15 20 25 30 35 25 50 75 100 Avg 0 25 50 75 100 Avg Compared to REFab Compared to REFpb WS I WS Impr mprovemen ement (%) t (%) 8Gb 16Gb 32Gb

slide-46
SLIDE 46

DDR4 FGR DDR4 FGR

46

0.2 0.4 0.6 0.8 1 1.2 8Gb 16Gb 32Gb Normalized WS DRAM Density REFab FGR 2x FGR 4x AR DSARP

slide-47
SLIDE 47

Per erfor

  • rmanc

mance Br e Breakdo eakdown wn

47

  • Out-of-order refresh improves performance by 3.2%/

3.9%/3.0% over 8/16/32Gb DRAM

  • Write-refresh parallelization provides additional

benefits of 4.3%/5.8%/5.2%

slide-48
SLIDE 48

tF tFAW S Sweep eep

48

tF tFAW/tRRD tRRD 5/1 5/1 10/2 10/2 15/3 15/3 20/4 20/4 25/5 25/5 30/6 30/6 WS Gain (%) 14.0 13.9 13.5 12.4 11.9 10.3 Baseline

slide-49
SLIDE 49

Per erfor

  • rmanc

mance D e Deg egrada adation using P tion using Per er-Bank

  • Bank

Refr efresh esh

49

0.95 1 1.05 1.1 1.15 1.2 1.25

Normalized Weighted Speedup

100 Workloads REFpb

Pathological latency = 3.5 * tRefLatency_AllBank Per-Bank Refresh

slide-50
SLIDE 50

Our S Our Sec econd Appr

  • nd Approach: SARP
  • ach: SARP
  • Subar

Subarray A y Access-R ess-Refr efresh P esh Par aralleliza allelization (SARP) tion (SARP): :

– Parallelizes refreshes and accesses within a bank within a bank

  • Problem: Shared address path for refreshes and

accesses

  • Solution: Decouple the shared address path

50

Subar Subarray y Bank I/O Bank I/O

Access or Refresh

slide-51
SLIDE 51

Our S Our Sec econd Appr

  • nd Approach: SARP
  • ach: SARP
  • Subar

Subarray A y Access-R ess-Refr efresh P esh Par aralleliza allelization (SARP) tion (SARP): :

– Parallelizes refreshes and accesses within a bank within a bank

  • Problem: Shared address path for refreshes and

accesses

  • Solution: Decouple the shared address path

51

Subar Subarray y Bank I/O Bank I/O

Access ess Refr efresh esh