[PPT] - Performance Model of DRAM Caches Authors: Nagendra Gulur * , Mahesh PowerPoint Presentation

SLIDE 1

A Comprehensive Analytical Performance Model of DRAM Caches

Authors: Nagendra Gulur, Mahesh Mehendale, and R Govindarajan+ Presented by: Sreepathi Pai§

*Texas Instruments, +Indian Institute of Science §University of Texas, Austin

6th ACM/SPEC International Conference on Performance Engineering, 2015

1

SLIDE 2

Talk Outline

Introduction to stacked DRAM Caches
Background (An overview of ANATOMY§)
ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

Evaluation
Insights
Conclusions

§ANATOMY: An Analytical Model of Memory System Performance (Published in the

2014 ACM international conference on Measurement and modeling of computer systems)

2

SLIDE 3

Stacked DRAM

DRAM vertically stacked
ver the processor die.
Stacked DRAMs offer

– High bandwidth – High capacity – Moderately low latency.

Several proposals to
rganize this large

DRAM as a last-level cache.

Picture courtesy Bryan Black (From MICRO 2013 Keynote) 3

SLIDE 4

Processor Orgn. With DRAM Cache

Core Core 1 Core N

. . .

L1D L1I L1D L1I L1D L1I L2 (LLSC)

DRAM Cache (Vertically Stacked)

(Off Chip) Main Memory Tag- Pred Hit Memory Controller Miss Processor with Stacked DRAM

MetaData

n SRAM

MetaData

n DRAM

4

SLIDE 5

Talk Outline

Introduction to stacked DRAM Caches
Background (An overview of ANATOMY)
ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

Evaluation
Insights
Conclusions

5

SLIDE 6

Memory Controller

Data Read & Write operations Control Address Data Rows Columns Bank Logic Row Buffer DRAM Bank

DIMM Rank Device

Overview of a DRAM based memory

Bank

6

SLIDE 7

Basic DRAM Operations

ACTIVATE  Bring data from DRAM core into the row-buffer
READ/WRITE  Perform read/write operations on the

contents in the row-buffer

PRECHARGE  Store data back to DRAM core (ACTIVATE

discharges capacitors), put cells back at neutral voltage

H M Memory Requests PRE RD ACT M RD PRE RD ACT

Row buffer hits (RBH) are faster and consume less power Bank Level Parallelism (BLP)

Parallelism improves performance
Some switching delays hurt performance

7

SLIDE 8

ANATOMY – Analytical Model of Memory

Two components

1) Queuing Model of Memory

– Organizational and Technological characteristics – Workload characteristics used as input

2) Use of Workload Characteristics

– Locality and Parallelism in workload’s memory accesses

8

SLIDE 9

Analytical Model for Memory System Performance

Address Bus Server

Arrival Rate: 

Bank Server 1 Bank Server N Bank Server 2

…

Data Bus Server

M/D/1 Multiple M/D/1 M/D/1

Service Time: (RBH*1 + (1-RBH)*3) * BUS_CYCLE_TIME Service Time: tCL* RBH + (tCL+tPRE+tRCD) * (1-RBH) Service Time: Burst_Length * BUS_CYCLE_TIME

Latency = Qaddr + Qbank + Qdata + 1/µaddr + 1/µbank + 1/µdata Q = /(2µ*(1-)) for M/D/1 queue Qaddr 1/µaddr Qbank 1/µbank Qdata 1/µdata

9

SLIDE 10

12.5
7.5
2.5

2.5 7.5 12.5

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 Avg

% Error

Latency RBH BLP

Average

Validation - Model Accuracy

Low Errors in RBH, BLP and Latency Estimation

– Average error of 3.9%, 4.2% and 4%

ANATOMY predicts trends accurately

10

SLIDE 11

Talk Outline

Introduction to stacked DRAM Caches
Background (An overview of ANATOMY)
ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

Evaluation
Insights
Conclusions

11

SLIDE 12

Key Parameters that govern performance:

Arrival Rate
Tag access time
Cache hit rate
Cache RBH
Cache Miss

Penalty

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

17

SLIDE 18

Extending ANATOMY to DRAM Caches

Two ANATOMY

instances - one for DRAM cache and one for main memory.

ANATOMYCache ANATOMYMem

18

SLIDE 19

Extending ANATOMY to DRAM Caches

Two ANATOMY

instances - one for DRAM cache and one for main memory.

The models are fed by

the output of the tag server and each other’s

utputs.

ANATOMYCache ANATOMYMem

19

SLIDE 20

Extending ANATOMY to DRAM Caches

Two ANATOMY

instances - one for DRAM cache and one for main memory.

The models are fed by

the output of the tag server and each other’s

utputs.

– Predicted Cache Hits – No Predictions

Predicted Hits No predictions

ANATOMYCache ANATOMYMem

20

SLIDE 21

Extending ANATOMY to DRAM Caches

Two ANATOMY

instances - one for DRAM cache and one for main memory.

The models are fed by

the output of the tag server and each other’s

utputs.

– Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory

Predicted Hits No predictions Line Fills, Writebacks

ANATOMYCache ANATOMYMem

21

SLIDE 22

Extending ANATOMY to DRAM Caches

Two ANATOMY

instances - one for DRAM cache and one for main memory.

The models are fed by

the output of the tag server and each other’s

utputs.

– Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory – Predicted Misses

Predicted Hits No predictions Line Fills, Writebacks Predicted Misses

ANATOMYCache ANATOMYMem

22

SLIDE 23

Extending ANATOMY to DRAM Caches

Two ANATOMY

instances - one for DRAM cache and one for main memory.

The models are fed by

the output of the tag server and each other’s

utputs.

– Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory – Predicted Misses – Requests from Cache

Predicted Hits No predictions Line Fills, Writebacks Predicted Misses Misses, Line fills and Writebacks

ANATOMYCache ANATOMYMem

23

SLIDE 24

Extending ANATOMY to DRAM Caches

Two ANATOMY

instances - one for DRAM cache and one for main memory.

The models are fed by

the output of the tag server and each other’s

utputs.
We compute the

latencies at the cache and memory using ANATOMY.

Predicted Hits No predictions Line Fills, Writebacks Predicted Misses Misses, Line fills and Writebacks

LCache LMem ANATOMYCache ANATOMYMem

24

SLIDE 25

Obtaining the average LLSC miss penalty

Lcache and Lmem are combined by to estimate the

average LLSC miss penalty.

But first we discuss the estimation of the key

parameters that govern LCache and LMem.

25

SLIDE 26

Estimating Key Parameters …

Arrival Rate
Tag access time
Cache hit rate
Cache RBH
Cache Miss Penalty

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

26

SLIDE 27

Estimating the Cache Arrival Rate

Arrival Rate at the

Cache is a sum of several streams of accesses.

λ

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

λ

27

SLIDE 28

Estimating the Cache Arrival Rate

Arrival Rate at the

Cache is a sum of several streams of accesses.

Predicted Hits

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

λ

28

SLIDE 29

Estimating the Cache Arrival Rate

Arrival Rate at the

Cache is a sum of several streams of accesses.

Predicted Hits
No predictions

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

λ

29

SLIDE 30

Estimating the Cache Arrival Rate

Arrival Rate at the

Cache is a sum of several streams of accesses.

Predicted Hits
No predictions
Line fills and

writebacks

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

λ

30

SLIDE 31

Summarizing the Cache Arrival Rate

Request Stream Rate Notes Predicted Hits λhpredhcache No predictions λ(1-hpred) They are sent to the cache for tag look-up Line Fills λ(1-hcache)Bs Bs is the cache block size Writebacks λ(1-hcache)*w w is the fraction of misses that cause write-backs

λcache = λhpredhcache + λ(1-hpred) + λ(1-hcache)Bs + λ(1-hcache)*w

31

SLIDE 32

Estimating Tag Predictor Hit Rate and Access Time

Tags-on-SRAM

All tags on SRAM.
Hit Rate = 100%

Tags-on-DRAM

A small set-

associative cache.

Hit Rate determined

by running an access trace through the cache model.

Predictor access time depends on its size. An estimate is obtained using CACTII.

32

SLIDE 33

Estimating Cache Hit Rate

Cache Hit Rate depends on 3 key parameters:

– Cache Size – Set Associativity – Block Size

Well-studied problem

– A trace-based model and reuse distance analysis.

We use a trace of accesses from the LLSC.

LLSC Miss Trace Reuse Distance Analysis Hit Rate Cache Size Associativity

33

SLIDE 34

Larger block sizes can capture spatial locality.
Bandwidth-neutral model: Cache miss rate

halves with doubling of cache block size.

» If this holds, then measuring hit rate at smallest block size via trace based analysis is sufficient. » For larger block sizes, estimate via:

Estimating Cache Hit Rate with Block Size

Workload Q5 is bandwidth-neutral

34

SLIDE 35

For such workloads,

bandwidth-neutral model leads to lower miss rate prediction.

Use trace-based cache

simulations in such cases.

Not all workloads are bandwidth-neutral

Workload Q22 is NOT bandwidth-neutral

35

SLIDE 36

Estimating DRAM Cache Row- Buffer Hit Rate

Row-Buffer Hit rate (RBH) of the DRAM cache

depends on the access pattern and the data

rganization on the DRAM.
We estimate RBH using the Reuse-Distance

framework similar to ANATOMY.

Details are in the paper.

36

SLIDE 37

Putting them together …

LLSC Miss Penalty

from: Lcache and Lmem

λ

L2 (LLSC)

DRAM Cache (Vertically Stacked) (Off Chip) Main Memory Hit

Memory Controller

Miss Processor with Stacked DRAM

Tag-Pred

λ

Event Latency

Predicted Cache Hit A: hpredhcacheLcache Predicted Cache Miss B: hpred(1-hcache)Lmem No Prediction and Cache Hit C: (1-hpred)hcacheLcache No Prediction and Cache Miss D: (1-hpred)(1-hcache) [Lcache+Lmem] LAvg A+B+C+D

37

SLIDE 38

Talk Outline

Introduction to stacked DRAM Caches
Background (An overview of ANATOMY)
ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

Evaluation
Insights
Conclusions

38

SLIDE 39

Experimental Evaluation

Validation using GEM5 Simulation (with detailed Memory model)
Use of Multiprogrammed workloads
Workloads comprising of SPEC2000/SPEC2006 benchmarks
Architecture Configurations

– 4 core and 8 core – 128MB (4 core) and 256MB (8 core) DRAM caches – Cache Memory: 1.6GHz DRAM, 2KB page, 128-bit bus – DRAM Main Memory: 3.2GHz DRAM, 64-bit bus – Tags-on-DRAM:

Direct Mapped 64B block size
Tags and Data on the same DRAM rows
Tag Predictor: 2-way set associative tag cache

– Tags-on-SRAM:

Block Size: 1024B
2 way set associative

39

SLIDE 40

Validation of the Tags-on- DRAM Model

Low errors in estimation of Avg. LLSC Miss Penalty (10.9%

in 4-core and 9.3% in 8-core workloads)

40

SLIDE 41

Validation of the Tags-on- SRAM Model

Low errors in estimation of Avg. LLSC Miss Penalty (10.5%

in 4-core and 8.2% in 8-core workloads)

41

SLIDE 42

Talk Outline

Introduction to stacked DRAM Caches
Background (An overview of ANATOMY)
ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

Evaluation
Insights
Conclusions

42

SLIDE 43

Insight 1: It is hard to out-perform Tags-on-SRAM designs

Tag Access Time:

Requires High Predictor Hit Rate to beat Tags-on-SRAM Latency for Tag Lookup

43

SLIDE 44

Insight 2 - Motivation

The DRAM Cache gets a very high cache hit

rate.

The Main Memory remains mostly idle!
Cache is congested and memory is free!
So we consider if bypassing some cache hits to

main memory would get an overall latency benefit …

We extend ANATOMY-Cache model by

accounting for a fraction of requests that bypass the cache (details in the paper).

44

SLIDE 45

Insight 2: Cache Bypass/Offload Helps!

Congested Workload: Misses Are Expensive!

There is a sweet-spot at which Avg. LLSC Miss Penalty is minimized

45

SLIDE 46

Talk Outline

Introduction to stacked DRAM Caches
Background (An overview of ANATOMY)
ANATOMY-Cache: Modeling Stacked DRAM

Cache Organizations

Evaluation
Insights
Conclusions

46

SLIDE 47

ANATOMY-Cache

First Analytical Model of Stacked DRAM Caches
Covers Both Tags-on-DRAM and Tags-on-SRAM
rganizations
We investigated two insights with the help of the

model

47

SLIDE 48

Thank You!

Thank You !!

48

A Comprehensive Analytical Performance Model of DRAM Caches

Authors: Nagendra Gulur*, Mahesh Mehendale*, and R Govindarajan+ Presented by: Sreepathi Pai§

Talk Outline

Cache Organizations

Stacked DRAM

– High bandwidth – High capacity – Moderately low latency.

DRAM as a last-level cache.

Processor Orgn. With DRAM Cache

. . .

Talk Outline

Cache Organizations

Overview of a DRAM based memory

Basic DRAM Operations

contents in the row-buffer

discharges capacitors), put cells back at neutral voltage

Row buffer hits (RBH) are faster and consume less power Bank Level Parallelism (BLP)

ANATOMY – Analytical Model of Memory

Two components

1) Queuing Model of Memory

– Organizational and Technological characteristics – Workload characteristics used as input

2) Use of Workload Characteristics

– Locality and Parallelism in workload’s memory accesses

Analytical Model for Memory System Performance

% Error

Validation - Model Accuracy

– Average error of 3.9%, 4.2% and 4%

Talk Outline

Cache Organizations

ANATOMY-Cache Model

Key Parameters that govern performance:

ANATOMY-Cache Model

Key Parameters that govern performance:

ANATOMY-Cache Model

Key Parameters that govern performance:

ANATOMY-Cache Model

Key Parameters that govern performance:

ANATOMY-Cache Model

Key Parameters that govern performance:

ANATOMY-Cache Model

Key Parameters that govern performance:

Penalty

Extending ANATOMY to DRAM Caches

instances - one for DRAM cache and one for main memory.

ANATOMYCache ANATOMYMem

Extending ANATOMY to DRAM Caches

instances - one for DRAM cache and one for main memory.

the output of the tag server and each other’s

ANATOMYCache ANATOMYMem

Extending ANATOMY to DRAM Caches

instances - one for DRAM cache and one for main memory.

the output of the tag server and each other’s

– Predicted Cache Hits – No Predictions

ANATOMYCache ANATOMYMem

Extending ANATOMY to DRAM Caches

instances - one for DRAM cache and one for main memory.

the output of the tag server and each other’s

– Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory

ANATOMYCache ANATOMYMem

Extending ANATOMY to DRAM Caches

instances - one for DRAM cache and one for main memory.

the output of the tag server and each other’s

– Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory – Predicted Misses

ANATOMYCache ANATOMYMem

Extending ANATOMY to DRAM Caches

instances - one for DRAM cache and one for main memory.

the output of the tag server and each other’s

– Predicted Cache Hits – No Predictions – Line fills and write back requests from main memory – Predicted Misses – Requests from Cache

ANATOMYCache ANATOMYMem

Extending ANATOMY to DRAM Caches

instances - one for DRAM cache and one for main memory.

the output of the tag server and each other’s

latencies at the cache and memory using ANATOMY.

LCache LMem ANATOMYCache ANATOMYMem

Obtaining the average LLSC miss penalty

average LLSC miss penalty.

parameters that govern LCache and LMem.

Estimating Key Parameters …

Estimating the Cache Arrival Rate

Cache is a sum of several streams of accesses.

Estimating the Cache Arrival Rate

Authors: Nagendra Gulur, Mahesh Mehendale, and R Govindarajan+ Presented by: Sreepathi Pai§

Request Stream Rate Notes Predicted Hits λhpredhcache No predictions λ(1-hpred) They are sent to the cache for tag look-up Line Fills λ(1-hcache)Bs Bs is the cache block size Writebacks λ(1-hcache)*w w is the fraction of misses that cause write-backs

λcache = λhpredhcache + λ(1-hpred) + λ(1-hcache)Bs + λ(1-hcache)*w

Predicted Cache Hit A: hpredhcacheLcache Predicted Cache Miss B: hpred(1-hcache)Lmem No Prediction and Cache Hit C: (1-hpred)hcacheLcache No Prediction and Cache Miss D: (1-hpred)(1-hcache) [Lcache+Lmem] LAvg A+B+C+D