Department of Electrical and Computer Engineering University of Maryland at College Park 1
via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University - - PowerPoint PPT Presentation
via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University - - PowerPoint PPT Presentation
Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University of Maryland, College Park Department of Electrical and Computer Engineering 1 University of
Motivation
- Trend Multicore
- Understanding memory performance is critical but difficult
- Large design space and slow simulation
– 3,168 simulations 3 months on 74 core cluster
Department of Electrical and Computer Engineering, University of Maryland at College Park 2
LLC To off-chip memory C L1 L2 C L1 L2 C L1 L2 Problem size L1 size Core count L2 size LLC size Private/Shared
Reuse Distance (RD)
- RD: Depth in LRU stack [Mattson70]
- Architecture independent
- Provide app’s memory performance across cache sizes
Department of Electrical and Computer Engineering University of Maryland at College Park 3
Time: A 1 Core: LRU Stack: B 2 A A 3 B A C 4 A B RD
- Ref. count
RD CMC Cache size Cache size RD Profile RD Cache Miss Count (CMC) Profile RD=1
- Concurrent Reuse Distance: shared cache[Ding09;Jiang10;Schuff09,10]
- Private-stack Reuse Distance: private cache[Schuff09,10]
- Provide the complete picture of the design space
– What is the optimal hierarchy configuration? – What is the performance impact for different configurations? – What is the scaling impact? – … and more
Multicore Reuse Distance
Department of Electrical and Computer Engineering, University of Maryland at College Park 4
Problem size L1 size Core count L2 size LLC size Private/Shared
Outline
- Motivation
- Multicore Cache Performance Modeling
– Pin tool – Benchmarks – Cache Performance Models
- Two Cases
– Scaling Impact on Private vs. Shared Caches – Optimal L2/LLC Capacity Allocation
- Conclusions
Department of Electrical and Computer Engineering, University of Maryland at College Park 5
- Concurrent Reuse Distance (CRD)
– Shared cache – RD across interleaved memory streams
CRD and PRD profiles
Department of Electrical and Computer Engineering, University of Maryland at College Park 6
CRD Σ coherent LRU stacks PRD
cache size cache size
- Ref. Count
- Ref. Count
- Profiling : In-house PIN tool
– Uniform interleaving, 64B cache block
A B C A B E A
- Private-stack Reuse Distance (PRD)
– Private cache – RD on coherent per-thread stacks LRU stack
D G C F
Benchmarks
Department of Electrical and Computer Engineering, University of Maryland at College Park 7
Benchmark Problem Sizes Units S1 S2 FFT 220 222 elements LU 10242 20482 elements RADIX 222 224 keys Barnes 217 219 particles FMM 217 219 particles Ocean 5142 10262 grid Water 253 403 molecules KMeans 220 222
- bjects
BlackScholes 220 222
- ptions
- Two problem sizes
– S1 and S2 problem size scaling
Tile CMP
Department of Electrical and Computer Engineering University of Maryland at College Park 8
core private L1 private L2 private LLC
- r
shared LLC slice switch distributed directory
- Scalable Architecture
– 2 , 4, 8, 16, 32, 64, 128, 256 core core count scaling
- Three-level cache hierarchy
– L1 and L2 cache: private; LLC: private or shared
L1 lat L2 lat LLC lat DRAM lat Hop Lat Cycles 1 4 10 200 3x( ) 1 _ counts core
Memory controller Memory controller Memory controller Memory controller
5x106 Cache Miss Count
CRD_CMC
5x107 5x105
PRD_CMC
Average Memory Access Time (AMAT) Modeling
Department of Electrical and Computer Engineering University of Maryland at College Park 9
- CRD_CMC and PRD_CMC
– Number of cache misses at each cache level
RD (cache size)
AMATS=( # of references x L1_lat +
# of L1 misses x L2_lat + # of L2 misses x (2Hop_lat + LLC_lat) + # of LLC misses x (2Hop_lat + DRAM_lat) ) / # of references Shared LLC:
AMATP =( # of references x L1_lat +
# of L1 misses x L2_lat + # of L2 misses x LLC_lat + # of DIR accesses x (Hop_lat + DIR_lat) + # of forwarding x (2Hop_lat + LLC_lat) + # of LLC misses x (2Hop_lat + DRAM_lat) ) / # of references Private LLC: # of private LLC misses # of L2 misses # of shared LLC misses # of L1 misses # of references L1 size L2 size LLC size FFT – 16 cores at S1 0M 2M 4M 8M 12M 16M 20M 24M 28M 32M 36M 40M 44M 48M 52M 56M 60M 64M 68M 72M
5x106 Cache Miss Count CRD_CMC 5x105 PRD_CMC 0M 2M 4M 8M 12M 16M 20M 24M 28M 32M 36M 40M 44M 48M 52M 56M 60M 64M 68M 72M 3.0 4.0 5.0 6.0 7.0 8.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB AMAT L2:16KB, AMATP L2:16KB, AMATS
Shared LLC’s on-chip memory stall - Private LLC’s on-chip memory stall < Private LLC’s off-chip memory stall - Shared LLC’s off-chip memory stall
Case 1: Private vs. Shared LLC
- AMATS < AMATP
Department of Electrical and Computer Engineering University of Maryland at College Park 10
# of L2 misses LLC size # of LLC misses FFT – 16 cores at S1
`
3.0 4.0 5.0 6.0 7.0 8.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB AMAT 5x106 Cache Miss Count CRD_CMC 5x105 PRD_CMC 0M 2M 4M 8M 12M 16M 20M 24M 28M 32M 36M 40M 44M 48M 52M 56M 60M 64M 68M 72M
Case 1: Private vs. Shared LLC
- AMATS < AMATP
Department of Electrical and Computer Engineering University of Maryland at College Park 11
# of L2 misses LLC size
Shared LLC’s on-chip memory stall - Private LLC’s on-chip memory stall < Private LLC’s off-chip memory stall - Shared LLC’s off-chip memory stall
L2:16KB, AMATP L2:16KB, AMATS L2:64KB, AMATP L2:64KB, AMATS
`
3.0 4.0 5.0 6.0 7.0 8.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB AMAT 5x106 Cache Miss Count CRD_CMC 5x105 PRD_CMC
- AMATS < AMATP
– L2 size > high reference count region – CRD_CMC/PRD_CMC gap is large
0M 2M 4M 8M 12M 16M 20M 24M 28M 32M 36M 40M 44M 48M 52M 56M 60M 64M 68M 72M
Case 1: Private vs. Shared LLC
Department of Electrical and Computer Engineering University of Maryland at College Park 12
# of L2 misses LLC size L2:64KB, AMATP L2:64KB, AMATS
Private vs. Shared LLC Under Scaling Impact
Department of Electrical and Computer Engineering University of Maryland at College Park 13
FFT - S1 FFT - S2
S P
AMAT AMAT Ratio <1.0 >1.0
- Problem scaling Private LLC
- Core count scaling Private LLC
Shared LLC is better
5x106 Cache Miss Count CRD_CMC 5x105 PRD_CMC 3.0 3.5 4.0 4.5 5.0 5.5 AMATS LLC Size 32MB 64MB Shared LLCsize,opt
Case 2: L2/LLC Capacity Allocation
Department of Electrical and Computer Engineering University of Maryland at College Park 14
- Optimal L2/LLC allocation at a fixed on-chip capacity
– L2 size + LLC size = 64MB, LLC > L2 LLC size: 32~64MB – Balancing on-chip memory and off-chip memory stalls
On-chip memory stall dominates Off-chip memory stall dominates AMATS Variation 0M 2M 4M 8M 12M 16M 20M 24M 28M 32M 36M 40M 44M 48M 52M 56M 60M 64M 68M 72M
Case 2: Optimal LLC Size
Department of Electrical and Computer Engineering University of Maryland at College Park 15
- Private LLC size > Shared LLC size
- Core count scaling optimal LLC size ↓
- Core count scaling prefers private LLC
35 45 55 65 Capacity (MB) Cores Optimal LLC size
FFT - S2 / 64MB
Private LLC Shared LLC
- 2%
0% 2% 4% 6% 8% AMAT difference Cores
% 100
S P S
AMAT AMAT AMAT
Private LLC is better P vs. S ( )
Case 2: Importance of Design Options
Department of Electrical and Computer Engineering University of Maryland at College Park 16
- AMAT variation for L2/LLC partitioning
– < 76% in shared LLC, < 30% in private LLC
Shared LLC Private LLC
- AMAT variation for private/shared LLC at LLCsize,opt
– < 11%
0% 100% 200% 300% FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average Variation 0% 50% 100% FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average Variation 0% 10% 20% 30% 40% 50% FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average Variation
Conclusions
- Multicore RD Analysis
– Fast, insightful, and complete picture
- Architecture Implications
– Shared LLC can outperform private LLC
- L2 size > high reference count region
- CRD_CMC/PRD_CMC gap is large
– The optimal L2/LLC partition: balancing on-chip and off-chip memory stalls – L2/LLC size allocation is more important than private/shared LLC selection
- Ongoing work
– Reconfigurable caches – Dynamic on-chip resource management
Department of Electrical and Computer Engineering, University of Maryland at College Park 17