Performance Model of DRAM Caches Authors: Nagendra Gulur * , Mahesh - - PowerPoint PPT Presentation

performance model of dram
SMART_READER_LITE
LIVE PREVIEW

Performance Model of DRAM Caches Authors: Nagendra Gulur * , Mahesh - - PowerPoint PPT Presentation

A Comprehensive Analytical Performance Model of DRAM Caches Authors: Nagendra Gulur * , Mahesh Mehendale * , and R Govindarajan + Presented by: Sreepathi Pai * Texas Instruments, + Indian Institute of Science University of Texas, Austin


slide-1
SLIDE 1

A Comprehensive Analytical Performance Model of DRAM Caches

Authors: Nagendra Gulur*, Mahesh Mehendale*, and R Govindarajan+ Presented by: Sreepathi Pai§

*Texas Instruments, +Indian Institute of Science §University of Texas, Austin

6th ACM/SPEC International Conference on Performance Engineering, 2015

1

slide-2
SLIDE 2

Talk Outline

  • Introduction to stacked DRAM Caches
  • Background (An overview of ANATOMY§)
  • ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

  • Evaluation
  • Insights
  • Conclusions

§ANATOMY: An Analytical Model of Memory System Performance (Published in the

2014 ACM international conference on Measurement and modeling of computer systems)

2

slide-3
SLIDE 3

Stacked DRAM

  • DRAM vertically stacked
  • ver the processor die.
  • Stacked DRAMs offer

– High bandwidth – High capacity – Moderately low latency.

  • Several proposals to
  • rganize this large

DRAM as a last-level cache.

Picture courtesy Bryan Black (From MICRO 2013 Keynote) 3

slide-4
SLIDE 4

Processor Orgn. With DRAM Cache

Core Core 1 Core N

. . .

L1D L1I L1D L1I L1D L1I L2 (LLSC)

DRAM Cache (Vertically Stacked)

(Off Chip) Main Memory Tag- Pred Hit Memory Controller Miss Processor with Stacked DRAM

MetaData

  • n SRAM

MetaData

  • n DRAM

4

slide-5
SLIDE 5

Talk Outline

  • Introduction to stacked DRAM Caches
  • Background (An overview of ANATOMY)
  • ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

  • Evaluation
  • Insights
  • Conclusions

5

slide-6
SLIDE 6

Memory Controller

Data Read & Write operations Control Address Data Rows Columns Bank Logic Row Buffer DRAM Bank

DIMM Rank Device

Overview of a DRAM based memory

Bank

6

slide-7
SLIDE 7

Basic DRAM Operations

  • ACTIVATE  Bring data from DRAM core into the row-buffer
  • READ/WRITE  Perform read/write operations on the

contents in the row-buffer

  • PRECHARGE  Store data back to DRAM core (ACTIVATE

discharges capacitors), put cells back at neutral voltage

H M Memory Requests PRE RD ACT M RD PRE RD ACT

Row buffer hits (RBH) are faster and consume less power Bank Level Parallelism (BLP)

  • Parallelism improves performance
  • Some switching delays hurt performance

7

slide-8
SLIDE 8

ANATOMY – Analytical Model of Memory

Two components

1) Queuing Model of Memory

– Organizational and Technological characteristics – Workload characteristics used as input

2) Use of Workload Characteristics

– Locality and Parallelism in workload’s memory accesses

8

slide-9
SLIDE 9

Analytical Model for Memory System Performance

Address Bus Server

Arrival Rate: 

Bank Server 1 Bank Server N Bank Server 2

Data Bus Server

M/D/1 Multiple M/D/1 M/D/1

Service Time: (RBH*1 + (1-RBH)*3) * BUS_CYCLE_TIME Service Time: tCL* RBH + (tCL+tPRE+tRCD) * (1-RBH) Service Time: Burst_Length * BUS_CYCLE_TIME

Latency = Qaddr + Qbank + Qdata + 1/µaddr + 1/µbank + 1/µdata Q = /(2µ*(1-)) for M/D/1 queue Qaddr 1/µaddr Qbank 1/µbank Qdata 1/µdata

9

slide-10
SLIDE 10
  • 12.5
  • 7.5
  • 2.5

2.5 7.5 12.5

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Avg

% Error

Latency RBH BLP

Average

Validation - Model Accuracy

  • Low Errors in RBH, BLP and Latency Estimation

– Average error of 3.9%, 4.2% and 4%

  • ANATOMY predicts trends accurately

10

slide-11
SLIDE 11

Talk Outline

  • Introduction to stacked DRAM Caches
  • Background (An overview of ANATOMY)
  • ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

  • Evaluation
  • Insights
  • Conclusions

11

slide-12
SLIDE 12

ANATOMY-Cache Model

Key Parameters that govern performance:

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

12

slide-13
SLIDE 13

ANATOMY-Cache Model

Key Parameters that govern performance:

  • Arrival Rate

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

13

slide-14
SLIDE 14

ANATOMY-Cache Model

Key Parameters that govern performance:

  • Arrival Rate
  • Tag access time

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

14

slide-15
SLIDE 15

ANATOMY-Cache Model

Key Parameters that govern performance:

  • Arrival Rate
  • Tag access time
  • Cache hit rate

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

15

slide-16
SLIDE 16

ANATOMY-Cache Model

Key Parameters that govern performance:

  • Arrival Rate
  • Tag access time
  • Cache hit rate
  • Cache RBH

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

16

slide-17
SLIDE 17

ANATOMY-Cache Model

Key Parameters that govern performance:

  • Arrival Rate
  • Tag access time
  • Cache hit rate
  • Cache RBH
  • Cache Miss

Penalty

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

17

slide-18
SLIDE 18

Extending ANATOMY to DRAM Caches

  • Two ANATOMY

instances - one for DRAM cache and one for main memory.

ANATOMYCache ANATOMYMem

18

slide-19
SLIDE 19

Extending ANATOMY to DRAM Caches

  • Two ANATOMY

instances - one for DRAM cache and one for main memory.

  • The models are fed by

the output of the tag server and each other’s

  • utputs.

ANATOMYCache ANATOMYMem

19

slide-20
SLIDE 20

Extending ANATOMY to DRAM Caches

  • Two ANATOMY

instances - one for DRAM cache and one for main memory.

  • The models are fed by

the output of the tag server and each other’s

  • utputs.

– Predicted Cache Hits – No Predictions

Predicted Hits No predictions

ANATOMYCache ANATOMYMem

20

slide-21
SLIDE 21

Extending ANATOMY to DRAM Caches

  • Two ANATOMY

instances - one for DRAM cache and one for main memory.

  • The models are fed by

the output of the tag server and each other’s

  • utputs.

– Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory

Predicted Hits No predictions Line Fills, Writebacks

ANATOMYCache ANATOMYMem

21

slide-22
SLIDE 22

Extending ANATOMY to DRAM Caches

  • Two ANATOMY

instances - one for DRAM cache and one for main memory.

  • The models are fed by

the output of the tag server and each other’s

  • utputs.

– Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory – Predicted Misses

Predicted Hits No predictions Line Fills, Writebacks Predicted Misses

ANATOMYCache ANATOMYMem

22

slide-23
SLIDE 23

Extending ANATOMY to DRAM Caches

  • Two ANATOMY

instances - one for DRAM cache and one for main memory.

  • The models are fed by

the output of the tag server and each other’s

  • utputs.

– Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory – Predicted Misses – Requests from Cache

Predicted Hits No predictions Line Fills, Writebacks Predicted Misses Misses, Line fills and Writebacks

ANATOMYCache ANATOMYMem

23

slide-24
SLIDE 24

Extending ANATOMY to DRAM Caches

  • Two ANATOMY

instances - one for DRAM cache and one for main memory.

  • The models are fed by

the output of the tag server and each other’s

  • utputs.
  • We compute the

latencies at the cache and memory using ANATOMY.

Predicted Hits No predictions Line Fills, Writebacks Predicted Misses Misses, Line fills and Writebacks

LCache LMem ANATOMYCache ANATOMYMem

24

slide-25
SLIDE 25

Obtaining the average LLSC miss penalty

  • Lcache and Lmem are combined by to estimate the

average LLSC miss penalty.

  • But first we discuss the estimation of the key

parameters that govern LCache and LMem.

25

slide-26
SLIDE 26

Estimating Key Parameters …

  • Arrival Rate
  • Tag access time
  • Cache hit rate
  • Cache RBH
  • Cache Miss Penalty

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

26

slide-27
SLIDE 27

Estimating the Cache Arrival Rate

  • Arrival Rate at the

Cache is a sum of several streams of accesses.

λ

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

λ

27

slide-28
SLIDE 28

Estimating the Cache Arrival Rate

  • Arrival Rate at the

Cache is a sum of several streams of accesses.

  • Predicted Hits

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

λ

28

slide-29
SLIDE 29

Estimating the Cache Arrival Rate

  • Arrival Rate at the

Cache is a sum of several streams of accesses.

  • Predicted Hits
  • No predictions

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

λ

29

slide-30
SLIDE 30

Estimating the Cache Arrival Rate

  • Arrival Rate at the

Cache is a sum of several streams of accesses.

  • Predicted Hits
  • No predictions
  • Line fills and

writebacks

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

λ

30

slide-31
SLIDE 31

Summarizing the Cache Arrival Rate

Request Stream Rate Notes Predicted Hits λ*hpred*hcache No predictions λ*(1-hpred) They are sent to the cache for tag look-up Line Fills λ*(1-hcache)*Bs Bs is the cache block size Writebacks λ*(1-hcache)*w w is the fraction of misses that cause write-backs

λcache = λ*hpred*hcache + λ*(1-hpred) + λ*(1-hcache)*Bs + λ*(1-hcache)*w

31

slide-32
SLIDE 32

Estimating Tag Predictor Hit Rate and Access Time

Tags-on-SRAM

  • All tags on SRAM.
  • Hit Rate = 100%

Tags-on-DRAM

  • A small set-

associative cache.

  • Hit Rate determined

by running an access trace through the cache model.

Predictor access time depends on its size. An estimate is obtained using CACTII.

32

slide-33
SLIDE 33

Estimating Cache Hit Rate

  • Cache Hit Rate depends on 3 key parameters:

– Cache Size – Set Associativity – Block Size

  • Well-studied problem

– A trace-based model and reuse distance analysis.

  • We use a trace of accesses from the LLSC.

LLSC Miss Trace Reuse Distance Analysis Hit Rate Cache Size Associativity

33

slide-34
SLIDE 34
  • Larger block sizes can capture spatial locality.
  • Bandwidth-neutral model: Cache miss rate

halves with doubling of cache block size.

» If this holds, then measuring hit rate at smallest block size via trace based analysis is sufficient. » For larger block sizes, estimate via:

Estimating Cache Hit Rate with Block Size

Workload Q5 is bandwidth-neutral

34

slide-35
SLIDE 35
  • For such workloads,

bandwidth-neutral model leads to lower miss rate prediction.

  • Use trace-based cache

simulations in such cases.

Not all workloads are bandwidth-neutral

Workload Q22 is NOT bandwidth-neutral

35

slide-36
SLIDE 36

Estimating DRAM Cache Row- Buffer Hit Rate

  • Row-Buffer Hit rate (RBH) of the DRAM cache

depends on the access pattern and the data

  • rganization on the DRAM.
  • We estimate RBH using the Reuse-Distance

framework similar to ANATOMY.

  • Details are in the paper.

36

slide-37
SLIDE 37

Putting them together …

  • LLSC Miss Penalty

from: Lcache and Lmem

λ

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

λ

Event Latency

Predicted Cache Hit A: hpred*hcache*Lcache Predicted Cache Miss B: hpred*(1-hcache)*Lmem No Prediction and Cache Hit C: (1-hpred)*hcache*Lcache No Prediction and Cache Miss D: (1-hpred)*(1-hcache) *[Lcache+Lmem] LAvg A+B+C+D

37

slide-38
SLIDE 38

Talk Outline

  • Introduction to stacked DRAM Caches
  • Background (An overview of ANATOMY)
  • ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

  • Evaluation
  • Insights
  • Conclusions

38

slide-39
SLIDE 39

Experimental Evaluation

  • Validation using GEM5 Simulation (with detailed Memory model)
  • Use of Multiprogrammed workloads
  • Workloads comprising of SPEC2000/SPEC2006 benchmarks
  • Architecture Configurations

– 4 core and 8 core – 128MB (4 core) and 256MB (8 core) DRAM caches – Cache Memory: 1.6GHz DRAM, 2KB page, 128-bit bus – DRAM Main Memory: 3.2GHz DRAM, 64-bit bus – Tags-on-DRAM:

  • Direct Mapped 64B block size
  • Tags and Data on the same DRAM rows
  • Tag Predictor: 2-way set associative tag cache

– Tags-on-SRAM:

  • Block Size: 1024B
  • 2 way set associative

39

slide-40
SLIDE 40

Validation of the Tags-on- DRAM Model

  • Low errors in estimation of Avg. LLSC Miss Penalty (10.9%

in 4-core and 9.3% in 8-core workloads)

40

slide-41
SLIDE 41

Validation of the Tags-on- SRAM Model

  • Low errors in estimation of Avg. LLSC Miss Penalty (10.5%

in 4-core and 8.2% in 8-core workloads)

41

slide-42
SLIDE 42

Talk Outline

  • Introduction to stacked DRAM Caches
  • Background (An overview of ANATOMY)
  • ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

  • Evaluation
  • Insights
  • Conclusions

42

slide-43
SLIDE 43

Insight 1: It is hard to out-perform Tags-on-SRAM designs

Tag Access Time:

Requires High Predictor Hit Rate to beat Tags-on-SRAM Latency for Tag Lookup

43

slide-44
SLIDE 44

Insight 2 - Motivation

  • The DRAM Cache gets a very high cache hit

rate.

  • The Main Memory remains mostly idle!
  • Cache is congested and memory is free!
  • So we consider if bypassing some cache hits to

main memory would get an overall latency benefit …

  • We extend ANATOMY-Cache model by

accounting for a fraction of requests that bypass the cache (details in the paper).

44

slide-45
SLIDE 45

Insight 2: Cache Bypass/Offload Helps!

Congested Workload: Misses Are Expensive!

There is a sweet-spot at which Avg. LLSC Miss Penalty is minimized

45

slide-46
SLIDE 46

Talk Outline

  • Introduction to stacked DRAM Caches
  • Background (An overview of ANATOMY)
  • ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

  • Evaluation
  • Insights
  • Conclusions

46

slide-47
SLIDE 47

ANATOMY-Cache

  • First Analytical Model of Stacked DRAM Caches
  • Covers Both Tags-on-DRAM and Tags-on-SRAM
  • rganizations
  • We investigated two insights with the help of the

model

47

slide-48
SLIDE 48

Thank You!

Thank You !!

48