via reuse distance analysis
play

via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University - PowerPoint PPT Presentation

Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University of Maryland, College Park Department of Electrical and Computer Engineering 1 University of


  1. Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University of Maryland, College Park Department of Electrical and Computer Engineering 1 University of Maryland at College Park

  2. Motivation • Trend  Multicore • Understanding memory performance is critical but difficult • Large design space and slow simulation – 3,168 simulations  3 months on 74 core cluster L2 size LLC size C C C L1 size Private/Shared L1 L1 L1 L2 L2 L2 LLC To off-chip Problem size Core count memory Department of Electrical and Computer Engineering, 2 University of Maryland at College Park

  3. Reuse Distance (RD) • RD: Depth in LRU stack [Mattson70] • Architecture independent • Provide app’s memory performance across cache sizes Time: 1 2 3 4 A B A C Core: RD=1 A B A LRU Stack: B A Ref. count RD Profile RD Cache Miss Count (CMC) Profile CMC RD RD Cache size Cache size Department of Electrical and Computer Engineering 3 University of Maryland at College Park

  4. Multicore Reuse Distance • Concurrent Reuse Distance: shared cache [Ding09;Jiang10;Schuff09,10] • Private-stack Reuse Distance: private cache [Schuff09,10] • Provide the complete picture of the design space – What is the optimal hierarchy configuration? – What is the performance impact for different configurations? – What is the scaling impact? L2 size LLC size – … and more L1 size Private/Shared Problem size Core count Department of Electrical and Computer Engineering, 4 University of Maryland at College Park

  5. Outline • Motivation • Multicore Cache Performance Modeling – Pin tool – Benchmarks – Cache Performance Models • Two Cases – Scaling Impact on Private vs. Shared Caches – Optimal L2/LLC Capacity Allocation • Conclusions Department of Electrical and Computer Engineering, 5 University of Maryland at College Park

  6. CRD and PRD profiles • • Concurrent Reuse Distance (CRD) Private-stack Reuse Distance (PRD) – Shared cache – Private cache – RD across interleaved memory streams – RD on coherent per-thread stacks coherent A F E G LRU A B C D stacks C LRU Σ B stack A Ref. Count Ref. Count PRD CRD cache size cache size • Profiling : In-house PIN tool – Uniform interleaving, 64B cache block Department of Electrical and Computer Engineering, 6 University of Maryland at College Park

  7. Benchmarks • Two problem sizes – S1 and S2  problem size scaling Benchmark Problem Sizes Units S1 S2 2 20 2 22 elements FFT 1024 2 2048 2 elements LU 2 22 2 24 keys RADIX 2 17 2 19 particles Barnes 2 17 2 19 particles FMM 514 2 1026 2 grid Ocean 25 3 40 3 molecules Water 2 20 2 22 objects KMeans 2 20 2 22 options BlackScholes Department of Electrical and Computer Engineering, 7 University of Maryland at College Park

  8. Tile CMP • Scalable Architecture – 2 , 4, 8, 16, 32, 64, 128, 256 core  core count scaling • Three-level cache hierarchy – L1 and L2 cache: private; LLC: private or shared Memory Memory controller controller core private LLC or shared private L1 LLC slice private L2 distributed switch directory Memory Memory controller controller L1 lat L2 lat LLC lat DRAM lat Hop Lat  Cycles 1 4 10 200 3x( ) core _ counts 1 Department of Electrical and Computer Engineering 8 University of Maryland at College Park

  9. Average Memory Access Time (AMAT) Modeling • CRD_CMC and PRD_CMC – Number of cache misses at each cache level FFT – 16 cores at S1 # of references PRD_CMC Cache Miss Count CRD_CMC 5x10 7 # of L1 misses # of private LLC misses # of L2 misses 5x10 6 # of shared LLC misses 5x10 5 0M 12M 16M 40M 60M 72M 2M 4M 8M 20M 24M 28M 32M 36M 44M 48M 52M 56M 64M 68M L1 size RD (cache size) LLC size L2 size Private LLC: Shared LLC: AMAT P =( # of references x L1_lat + AMAT S =( # of references x L1_lat + # of L1 misses x L2_lat + # of L1 misses x L2_lat + # of L2 misses x LLC_lat + # of L2 misses x (2Hop_lat + LLC_lat) + # of DIR accesses x (Hop_lat + DIR_lat) + # of LLC misses x (2Hop_lat + DRAM_lat) # of forwarding x (2Hop_lat + LLC_lat) + ) / # of references # of LLC misses x (2Hop_lat + DRAM_lat) ) / # of references Department of Electrical and Computer Engineering 9 University of Maryland at College Park

  10. Case 1: Private vs. Shared LLC • AMAT S < AMAT P Shared LLC’s on -chip memory stall - Private LLC’s on -chip memory stall < Private LLC’s off -chip memory stall - Shared LLC’s off -chip memory stall PRD_CMC # of L2 misses Cache Miss Count FFT – 16 cores at S1 CRD_CMC # of LLC misses 5x10 6 5x10 5 0M 12M 16M 40M 60M 72M 2M 4M 8M 20M 24M 28M 32M 36M 44M 48M 52M 56M 64M 68M 8.0 L2:16KB, AMAT P 7.0 AMAT L2:16KB, AMAT S 6.0 5.0 4.0 3.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB LLC size Department of Electrical and Computer Engineering 10 University of Maryland at College Park

  11. Case 1: Private vs. Shared LLC • AMAT S < AMAT P Shared LLC’s on -chip memory stall - Private LLC’s on -chip memory stall < Private LLC’s off -chip memory stall - Shared LLC’s off -chip memory stall PRD_CMC Cache Miss Count CRD_CMC # of L2 misses 5x10 6 5x10 5 0M 12M 16M 40M 60M 72M 2M 4M 8M 20M 24M 28M 32M 36M 44M 48M 52M 56M 64M 68M 8.0 L2:16KB, AMAT P 7.0 L2:16KB, AMAT S AMAT L2:64KB, AMAT P 6.0 L2:64KB, AMAT S 5.0 ` 4.0 3.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB LLC size Department of Electrical and Computer Engineering 11 University of Maryland at College Park

  12. Case 1: Private vs. Shared LLC • AMAT S < AMAT P – L2 size > high reference count region – CRD_CMC/PRD_CMC gap is large PRD_CMC Cache Miss Count CRD_CMC # of L2 misses 5x10 6 5x10 5 0M 12M 16M 40M 60M 72M 2M 4M 8M 20M 24M 28M 32M 36M 44M 48M 52M 56M 64M 68M 8.0 L2:64KB, AMAT P 7.0 AMAT L2:64KB, AMAT S 6.0 5.0 ` 4.0 3.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB LLC size Department of Electrical and Computer Engineering 12 University of Maryland at College Park

  13. Private vs. Shared LLC Under Scaling Impact • Problem scaling  Private LLC • Core count scaling  Private LLC Shared LLC is better AMAT Ratio  P >1.0 <1.0 AMAT S FFT - S1 FFT - S2 Department of Electrical and Computer Engineering 13 University of Maryland at College Park

  14. Case 2: L2/LLC Capacity Allocation • Optimal L2/LLC allocation at a fixed on-chip capacity – L2 size + LLC size = 64MB, LLC > L2  LLC size: 32~64MB – Balancing on-chip memory and off-chip memory stalls PRD_CMC Cache Miss Count CRD_CMC 5x10 6 5x10 5 0M 12M 16M 40M 60M 72M 2M 4M 8M 20M 24M 28M 32M 36M 44M 48M 52M 56M 64M 68M 5.5 On-chip memory stall Off-chip memory stall 5.0 AMAT S dominates dominates 4.5 AMAT S Variation 4.0 Shared LLC size,opt 3.5 3.0 32MB LLC Size 64MB Department of Electrical and Computer Engineering 14 University of Maryland at College Park

  15. Case 2: Optimal LLC Size • Private LLC size > Shared LLC size • Core count scaling  optimal LLC size ↓ • Core count scaling prefers private LLC FFT - S2 / 64MB  AMAT AMAT  S P 100 % P vs. S ( ) Optimal LLC size AMAT 8% S 65 Private LLC is better AMAT difference Capacity (MB) 6% 55 4% 2% 45 Private LLC 0% Shared LLC 35 -2% Cores Cores Department of Electrical and Computer Engineering 15 University of Maryland at College Park

  16. Case 2: Importance of Design Options • AMAT variation for private/shared LLC at LLC size,opt – < 11% 50% Variation 40% 30% 20% 10% 0% FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average • AMAT variation for L2/LLC partitioning – < 76% in shared LLC, < 30% in private LLC Shared LLC Private LLC Variation Variation 100% 300% 200% 50% 100% 0% 0% FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average Department of Electrical and Computer Engineering 16 University of Maryland at College Park

  17. Conclusions • Multicore RD Analysis – Fast, insightful, and complete picture • Architecture Implications – Shared LLC can outperform private LLC • L2 size > high reference count region • CRD_CMC/PRD_CMC gap is large – The optimal L2/LLC partition: balancing on-chip and off-chip memory stalls – L2/LLC size allocation is more important than private/shared LLC selection • Ongoing work – Reconfigurable caches – Dynamic on-chip resource management Department of Electrical and Computer Engineering, 17 University of Maryland at College Park

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend