via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University - - PowerPoint PPT Presentation

via reuse distance analysis
SMART_READER_LITE
LIVE PREVIEW

via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University - - PowerPoint PPT Presentation

Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University of Maryland, College Park Department of Electrical and Computer Engineering 1 University of


slide-1
SLIDE 1

Department of Electrical and Computer Engineering University of Maryland at College Park 1

Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis

Meng-Ju Wu and Donald Yeung University of Maryland, College Park

slide-2
SLIDE 2

Motivation

  • Trend  Multicore
  • Understanding memory performance is critical but difficult
  • Large design space and slow simulation

– 3,168 simulations  3 months on 74 core cluster

Department of Electrical and Computer Engineering, University of Maryland at College Park 2

LLC To off-chip memory C L1 L2 C L1 L2 C L1 L2 Problem size L1 size Core count L2 size LLC size Private/Shared

slide-3
SLIDE 3

Reuse Distance (RD)

  • RD: Depth in LRU stack [Mattson70]
  • Architecture independent
  • Provide app’s memory performance across cache sizes

Department of Electrical and Computer Engineering University of Maryland at College Park 3

Time: A 1 Core: LRU Stack: B 2 A A 3 B A C 4 A B RD

  • Ref. count

RD CMC Cache size Cache size RD Profile RD Cache Miss Count (CMC) Profile RD=1

slide-4
SLIDE 4
  • Concurrent Reuse Distance: shared cache[Ding09;Jiang10;Schuff09,10]
  • Private-stack Reuse Distance: private cache[Schuff09,10]
  • Provide the complete picture of the design space

– What is the optimal hierarchy configuration? – What is the performance impact for different configurations? – What is the scaling impact? – … and more

Multicore Reuse Distance

Department of Electrical and Computer Engineering, University of Maryland at College Park 4

Problem size L1 size Core count L2 size LLC size Private/Shared

slide-5
SLIDE 5

Outline

  • Motivation
  • Multicore Cache Performance Modeling

– Pin tool – Benchmarks – Cache Performance Models

  • Two Cases

– Scaling Impact on Private vs. Shared Caches – Optimal L2/LLC Capacity Allocation

  • Conclusions

Department of Electrical and Computer Engineering, University of Maryland at College Park 5

slide-6
SLIDE 6
  • Concurrent Reuse Distance (CRD)

– Shared cache – RD across interleaved memory streams

CRD and PRD profiles

Department of Electrical and Computer Engineering, University of Maryland at College Park 6

CRD Σ coherent LRU stacks PRD

cache size cache size

  • Ref. Count
  • Ref. Count
  • Profiling : In-house PIN tool

– Uniform interleaving, 64B cache block

A B C A B E A

  • Private-stack Reuse Distance (PRD)

– Private cache – RD on coherent per-thread stacks LRU stack

D G C F

slide-7
SLIDE 7

Benchmarks

Department of Electrical and Computer Engineering, University of Maryland at College Park 7

Benchmark Problem Sizes Units S1 S2 FFT 220 222 elements LU 10242 20482 elements RADIX 222 224 keys Barnes 217 219 particles FMM 217 219 particles Ocean 5142 10262 grid Water 253 403 molecules KMeans 220 222

  • bjects

BlackScholes 220 222

  • ptions
  • Two problem sizes

– S1 and S2  problem size scaling

slide-8
SLIDE 8

Tile CMP

Department of Electrical and Computer Engineering University of Maryland at College Park 8

core private L1 private L2 private LLC

  • r

shared LLC slice switch distributed directory

  • Scalable Architecture

– 2 , 4, 8, 16, 32, 64, 128, 256 core  core count scaling

  • Three-level cache hierarchy

– L1 and L2 cache: private; LLC: private or shared

L1 lat L2 lat LLC lat DRAM lat Hop Lat Cycles 1 4 10 200 3x( ) 1 _  counts core

Memory controller Memory controller Memory controller Memory controller

slide-9
SLIDE 9

5x106 Cache Miss Count

CRD_CMC

5x107 5x105

PRD_CMC

Average Memory Access Time (AMAT) Modeling

Department of Electrical and Computer Engineering University of Maryland at College Park 9

  • CRD_CMC and PRD_CMC

– Number of cache misses at each cache level

RD (cache size)

AMATS=( # of references x L1_lat +

# of L1 misses x L2_lat + # of L2 misses x (2Hop_lat + LLC_lat) + # of LLC misses x (2Hop_lat + DRAM_lat) ) / # of references Shared LLC:

AMATP =( # of references x L1_lat +

# of L1 misses x L2_lat + # of L2 misses x LLC_lat + # of DIR accesses x (Hop_lat + DIR_lat) + # of forwarding x (2Hop_lat + LLC_lat) + # of LLC misses x (2Hop_lat + DRAM_lat) ) / # of references Private LLC: # of private LLC misses # of L2 misses # of shared LLC misses # of L1 misses # of references L1 size L2 size LLC size FFT – 16 cores at S1 0M 2M 4M 8M 12M 16M 20M 24M 28M 32M 36M 40M 44M 48M 52M 56M 60M 64M 68M 72M

slide-10
SLIDE 10

5x106 Cache Miss Count CRD_CMC 5x105 PRD_CMC 0M 2M 4M 8M 12M 16M 20M 24M 28M 32M 36M 40M 44M 48M 52M 56M 60M 64M 68M 72M 3.0 4.0 5.0 6.0 7.0 8.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB AMAT L2:16KB, AMATP L2:16KB, AMATS

Shared LLC’s on-chip memory stall - Private LLC’s on-chip memory stall < Private LLC’s off-chip memory stall - Shared LLC’s off-chip memory stall

Case 1: Private vs. Shared LLC

  • AMATS < AMATP

Department of Electrical and Computer Engineering University of Maryland at College Park 10

# of L2 misses LLC size # of LLC misses FFT – 16 cores at S1

slide-11
SLIDE 11

`

3.0 4.0 5.0 6.0 7.0 8.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB AMAT 5x106 Cache Miss Count CRD_CMC 5x105 PRD_CMC 0M 2M 4M 8M 12M 16M 20M 24M 28M 32M 36M 40M 44M 48M 52M 56M 60M 64M 68M 72M

Case 1: Private vs. Shared LLC

  • AMATS < AMATP

Department of Electrical and Computer Engineering University of Maryland at College Park 11

# of L2 misses LLC size

Shared LLC’s on-chip memory stall - Private LLC’s on-chip memory stall < Private LLC’s off-chip memory stall - Shared LLC’s off-chip memory stall

L2:16KB, AMATP L2:16KB, AMATS L2:64KB, AMATP L2:64KB, AMATS

slide-12
SLIDE 12

`

3.0 4.0 5.0 6.0 7.0 8.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB AMAT 5x106 Cache Miss Count CRD_CMC 5x105 PRD_CMC

  • AMATS < AMATP

– L2 size > high reference count region – CRD_CMC/PRD_CMC gap is large

0M 2M 4M 8M 12M 16M 20M 24M 28M 32M 36M 40M 44M 48M 52M 56M 60M 64M 68M 72M

Case 1: Private vs. Shared LLC

Department of Electrical and Computer Engineering University of Maryland at College Park 12

# of L2 misses LLC size L2:64KB, AMATP L2:64KB, AMATS

slide-13
SLIDE 13

Private vs. Shared LLC Under Scaling Impact

Department of Electrical and Computer Engineering University of Maryland at College Park 13

FFT - S1 FFT - S2

S P

AMAT AMAT Ratio  <1.0 >1.0

  • Problem scaling  Private LLC
  • Core count scaling  Private LLC

Shared LLC is better

slide-14
SLIDE 14

5x106 Cache Miss Count CRD_CMC 5x105 PRD_CMC 3.0 3.5 4.0 4.5 5.0 5.5 AMATS LLC Size 32MB 64MB Shared LLCsize,opt

Case 2: L2/LLC Capacity Allocation

Department of Electrical and Computer Engineering University of Maryland at College Park 14

  • Optimal L2/LLC allocation at a fixed on-chip capacity

– L2 size + LLC size = 64MB, LLC > L2  LLC size: 32~64MB – Balancing on-chip memory and off-chip memory stalls

On-chip memory stall dominates Off-chip memory stall dominates AMATS Variation 0M 2M 4M 8M 12M 16M 20M 24M 28M 32M 36M 40M 44M 48M 52M 56M 60M 64M 68M 72M

slide-15
SLIDE 15

Case 2: Optimal LLC Size

Department of Electrical and Computer Engineering University of Maryland at College Park 15

  • Private LLC size > Shared LLC size
  • Core count scaling  optimal LLC size ↓
  • Core count scaling prefers private LLC

35 45 55 65 Capacity (MB) Cores Optimal LLC size

FFT - S2 / 64MB

Private LLC Shared LLC

  • 2%

0% 2% 4% 6% 8% AMAT difference Cores

% 100  

S P S

AMAT AMAT AMAT

Private LLC is better P vs. S ( )

slide-16
SLIDE 16

Case 2: Importance of Design Options

Department of Electrical and Computer Engineering University of Maryland at College Park 16

  • AMAT variation for L2/LLC partitioning

– < 76% in shared LLC, < 30% in private LLC

Shared LLC Private LLC

  • AMAT variation for private/shared LLC at LLCsize,opt

– < 11%

0% 100% 200% 300% FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average Variation 0% 50% 100% FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average Variation 0% 10% 20% 30% 40% 50% FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average Variation

slide-17
SLIDE 17

Conclusions

  • Multicore RD Analysis

– Fast, insightful, and complete picture

  • Architecture Implications

– Shared LLC can outperform private LLC

  • L2 size > high reference count region
  • CRD_CMC/PRD_CMC gap is large

– The optimal L2/LLC partition: balancing on-chip and off-chip memory stalls – L2/LLC size allocation is more important than private/shared LLC selection

  • Ongoing work

– Reconfigurable caches – Dynamic on-chip resource management

Department of Electrical and Computer Engineering, University of Maryland at College Park 17