via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University - PowerPoint PPT Presentation

Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University of Maryland, College Park Department of Electrical and Computer Engineering 1 University of Maryland at College Park

Motivation • Trend  Multicore • Understanding memory performance is critical but difficult • Large design space and slow simulation – 3,168 simulations  3 months on 74 core cluster L2 size LLC size C C C L1 size Private/Shared L1 L1 L1 L2 L2 L2 LLC To off-chip Problem size Core count memory Department of Electrical and Computer Engineering, 2 University of Maryland at College Park

Reuse Distance (RD) • RD: Depth in LRU stack [Mattson70] • Architecture independent • Provide app’s memory performance across cache sizes Time: 1 2 3 4 A B A C Core: RD=1 A B A LRU Stack: B A Ref. count RD Profile RD Cache Miss Count (CMC) Profile CMC RD RD Cache size Cache size Department of Electrical and Computer Engineering 3 University of Maryland at College Park

Multicore Reuse Distance • Concurrent Reuse Distance: shared cache [Ding09;Jiang10;Schuff09,10] • Private-stack Reuse Distance: private cache [Schuff09,10] • Provide the complete picture of the design space – What is the optimal hierarchy configuration? – What is the performance impact for different configurations? – What is the scaling impact? L2 size LLC size – … and more L1 size Private/Shared Problem size Core count Department of Electrical and Computer Engineering, 4 University of Maryland at College Park

Outline • Motivation • Multicore Cache Performance Modeling – Pin tool – Benchmarks – Cache Performance Models • Two Cases – Scaling Impact on Private vs. Shared Caches – Optimal L2/LLC Capacity Allocation • Conclusions Department of Electrical and Computer Engineering, 5 University of Maryland at College Park

CRD and PRD profiles • • Concurrent Reuse Distance (CRD) Private-stack Reuse Distance (PRD) – Shared cache – Private cache – RD across interleaved memory streams – RD on coherent per-thread stacks coherent A F E G LRU A B C D stacks C LRU Σ B stack A Ref. Count Ref. Count PRD CRD cache size cache size • Profiling : In-house PIN tool – Uniform interleaving, 64B cache block Department of Electrical and Computer Engineering, 6 University of Maryland at College Park

Benchmarks • Two problem sizes – S1 and S2  problem size scaling Benchmark Problem Sizes Units S1 S2 2 20 2 22 elements FFT 1024 2 2048 2 elements LU 2 22 2 24 keys RADIX 2 17 2 19 particles Barnes 2 17 2 19 particles FMM 514 2 1026 2 grid Ocean 25 3 40 3 molecules Water 2 20 2 22 objects KMeans 2 20 2 22 options BlackScholes Department of Electrical and Computer Engineering, 7 University of Maryland at College Park

Tile CMP • Scalable Architecture – 2 , 4, 8, 16, 32, 64, 128, 256 core  core count scaling • Three-level cache hierarchy – L1 and L2 cache: private; LLC: private or shared Memory Memory controller controller core private LLC or shared private L1 LLC slice private L2 distributed switch directory Memory Memory controller controller L1 lat L2 lat LLC lat DRAM lat Hop Lat  Cycles 1 4 10 200 3x( ) core _ counts 1 Department of Electrical and Computer Engineering 8 University of Maryland at College Park

Average Memory Access Time (AMAT) Modeling • CRD_CMC and PRD_CMC – Number of cache misses at each cache level FFT – 16 cores at S1 # of references PRD_CMC Cache Miss Count CRD_CMC 5x10 7 # of L1 misses # of private LLC misses # of L2 misses 5x10 6 # of shared LLC misses 5x10 5 0M 12M 16M 40M 60M 72M 2M 4M 8M 20M 24M 28M 32M 36M 44M 48M 52M 56M 64M 68M L1 size RD (cache size) LLC size L2 size Private LLC: Shared LLC: AMAT P =( # of references x L1_lat + AMAT S =( # of references x L1_lat + # of L1 misses x L2_lat + # of L1 misses x L2_lat + # of L2 misses x LLC_lat + # of L2 misses x (2Hop_lat + LLC_lat) + # of DIR accesses x (Hop_lat + DIR_lat) + # of LLC misses x (2Hop_lat + DRAM_lat) # of forwarding x (2Hop_lat + LLC_lat) + ) / # of references # of LLC misses x (2Hop_lat + DRAM_lat) ) / # of references Department of Electrical and Computer Engineering 9 University of Maryland at College Park

Case 1: Private vs. Shared LLC • AMAT S < AMAT P Shared LLC’s on -chip memory stall - Private LLC’s on -chip memory stall < Private LLC’s off -chip memory stall - Shared LLC’s off -chip memory stall PRD_CMC # of L2 misses Cache Miss Count FFT – 16 cores at S1 CRD_CMC # of LLC misses 5x10 6 5x10 5 0M 12M 16M 40M 60M 72M 2M 4M 8M 20M 24M 28M 32M 36M 44M 48M 52M 56M 64M 68M 8.0 L2:16KB, AMAT P 7.0 AMAT L2:16KB, AMAT S 6.0 5.0 4.0 3.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB LLC size Department of Electrical and Computer Engineering 10 University of Maryland at College Park

Case 1: Private vs. Shared LLC • AMAT S < AMAT P Shared LLC’s on -chip memory stall - Private LLC’s on -chip memory stall < Private LLC’s off -chip memory stall - Shared LLC’s off -chip memory stall PRD_CMC Cache Miss Count CRD_CMC # of L2 misses 5x10 6 5x10 5 0M 12M 16M 40M 60M 72M 2M 4M 8M 20M 24M 28M 32M 36M 44M 48M 52M 56M 64M 68M 8.0 L2:16KB, AMAT P 7.0 L2:16KB, AMAT S AMAT L2:64KB, AMAT P 6.0 L2:64KB, AMAT S 5.0 ` 4.0 3.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB LLC size Department of Electrical and Computer Engineering 11 University of Maryland at College Park

Case 1: Private vs. Shared LLC • AMAT S < AMAT P – L2 size > high reference count region – CRD_CMC/PRD_CMC gap is large PRD_CMC Cache Miss Count CRD_CMC # of L2 misses 5x10 6 5x10 5 0M 12M 16M 40M 60M 72M 2M 4M 8M 20M 24M 28M 32M 36M 44M 48M 52M 56M 64M 68M 8.0 L2:64KB, AMAT P 7.0 AMAT L2:64KB, AMAT S 6.0 5.0 ` 4.0 3.0 4MB 8MB 12MB 16MB 20MB 24MB 28MB 32MB 36MB 40MB 44MB 48MB 52MB 56MB 60MB 64MB 68MB 72MB LLC size Department of Electrical and Computer Engineering 12 University of Maryland at College Park

Private vs. Shared LLC Under Scaling Impact • Problem scaling  Private LLC • Core count scaling  Private LLC Shared LLC is better AMAT Ratio  P >1.0 <1.0 AMAT S FFT - S1 FFT - S2 Department of Electrical and Computer Engineering 13 University of Maryland at College Park

Case 2: L2/LLC Capacity Allocation • Optimal L2/LLC allocation at a fixed on-chip capacity – L2 size + LLC size = 64MB, LLC > L2  LLC size: 32~64MB – Balancing on-chip memory and off-chip memory stalls PRD_CMC Cache Miss Count CRD_CMC 5x10 6 5x10 5 0M 12M 16M 40M 60M 72M 2M 4M 8M 20M 24M 28M 32M 36M 44M 48M 52M 56M 64M 68M 5.5 On-chip memory stall Off-chip memory stall 5.0 AMAT S dominates dominates 4.5 AMAT S Variation 4.0 Shared LLC size,opt 3.5 3.0 32MB LLC Size 64MB Department of Electrical and Computer Engineering 14 University of Maryland at College Park

Case 2: Optimal LLC Size • Private LLC size > Shared LLC size • Core count scaling  optimal LLC size ↓ • Core count scaling prefers private LLC FFT - S2 / 64MB  AMAT AMAT  S P 100 % P vs. S ( ) Optimal LLC size AMAT 8% S 65 Private LLC is better AMAT difference Capacity (MB) 6% 55 4% 2% 45 Private LLC 0% Shared LLC 35 -2% Cores Cores Department of Electrical and Computer Engineering 15 University of Maryland at College Park

Case 2: Importance of Design Options • AMAT variation for private/shared LLC at LLC size,opt – < 11% 50% Variation 40% 30% 20% 10% 0% FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average • AMAT variation for L2/LLC partitioning – < 76% in shared LLC, < 30% in private LLC Shared LLC Private LLC Variation Variation 100% 300% 200% 50% 100% 0% 0% FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average FFT LU RADIX Barnes FMM Ocean Water KMeans BlackS. Average Department of Electrical and Computer Engineering 16 University of Maryland at College Park

Conclusions • Multicore RD Analysis – Fast, insightful, and complete picture • Architecture Implications – Shared LLC can outperform private LLC • L2 size > high reference count region • CRD_CMC/PRD_CMC gap is large – The optimal L2/LLC partition: balancing on-chip and off-chip memory stalls – L2/LLC size allocation is more important than private/shared LLC selection • Ongoing work – Reconfigurable caches – Dynamic on-chip resource management Department of Electrical and Computer Engineering, 17 University of Maryland at College Park

via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University - PowerPoint PPT Presentation

Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University of Maryland, College Park Department of Electrical and Computer Engineering 1 University of

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

UC Berkeley ReUSE Programs March 9, 2017 Lin King Cal Zero Waste Manager UC Berkeley Chair

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Software Reuse From informal reuse (scavenging) to systematic reuse Management and technical

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Californias Regulatory Process to Protect Public Health for Crop Irrigation Reuse and Potable

Japanese waste paper trend Japanese waste paper trend High collection & reuse High

Groundwater Solutions for Indirect Potable Reuse 2014 Rocky Mountain Water Reuse Workshop August

Potable Reuse for Inland Applications: Pilot Testing Results from a New Potable Reuse Treatment

CS 5150 So(ware Engineering 18. Reuse and Design Pa9erns William Y. Arms So(ware Reuse It is

Distance Education Technologies: Distance Education Technologies: Distance Education

CCLD 363 CCLD 363 Distance Field Distance Field Education Education Placements Placements

Distance fields imre paadik Overview Signed distance fields Distance fields in computer

Phylogenetics: Distance Methods COMP 571 Luay Nakhleh, Rice University Outline Evolutionary

Speermint Minimum Set of Requirements for SIP-Based VoIP Interconnection

Exercise 4: Conjunctive Queries, CSP, and Hypergraphs Database Theory 2020-05-04 Maximilian

The Crazy-Cool Things you can do with Node.js Nuno Job, Nodejitsu @dscape 2+2 17x24 Fast

F ormalizing Dijkstra 1 F ormalizing Dijkstra John Harrison Univ ersit y of Cam

Reactor Pressure Vessel Closure Head (RPVCH) Replacement at the Davis-Besse Nuclear Power

Data Management Foundations Workshop, Mar. 3, 2009 These are SRMs! And end-users do not see the

Data-driven COVID-19 modeling Data-driven COVID-19 modeling 1 Cyprien Neverov th August 28 ,

Water & Sewerage Services Bill Presentation to CRD 14 th October 2015 Sean McAleese Customer

via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University - PowerPoint PPT Presentation

Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis Meng-Ju Wu and Donald Yeung University of Maryland, College Park Department of Electrical and Computer Engineering 1 University of

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

1 Infrastructure Requirements Limit Reuse Planned Indirect Potable Reuse (Purple pipe may be a

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

UC Berkeley ReUSE Programs March 9, 2017 Lin King Cal Zero Waste Manager UC Berkeley Chair

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B

Software Reuse From informal reuse (scavenging) to systematic reuse Management and technical

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Californias Regulatory Process to Protect Public Health for Crop Irrigation Reuse and Potable

Japanese waste paper trend Japanese waste paper trend High collection &amp; reuse High

Groundwater Solutions for Indirect Potable Reuse 2014 Rocky Mountain Water Reuse Workshop August

Potable Reuse for Inland Applications: Pilot Testing Results from a New Potable Reuse Treatment

CS 5150 So(ware Engineering 18. Reuse and Design Pa9erns William Y. Arms So(ware Reuse It is

Distance Education Technologies: Distance Education Technologies: Distance Education

CCLD 363 CCLD 363 Distance Field Distance Field Education Education Placements Placements

Distance fields imre paadik Overview Signed distance fields Distance fields in computer

Phylogenetics: Distance Methods COMP 571 Luay Nakhleh, Rice University Outline Evolutionary

Speermint Minimum Set of Requirements for SIP-Based VoIP Interconnection

Exercise 4: Conjunctive Queries, CSP, and Hypergraphs Database Theory 2020-05-04 Maximilian

The Crazy-Cool Things you can do with Node.js Nuno Job, Nodejitsu @dscape 2+2 17x24 Fast

F ormalizing Dijkstra 1 F ormalizing Dijkstra John Harrison Univ ersit y of Cam

Reactor Pressure Vessel Closure Head (RPVCH) Replacement at the Davis-Besse Nuclear Power

Data Management Foundations Workshop, Mar. 3, 2009 These are SRMs! And end-users do not see the

Data-driven COVID-19 modeling Data-driven COVID-19 modeling 1 Cyprien Neverov th August 28 ,

Water &amp; Sewerage Services Bill Presentation to CRD 14 th October 2015 Sean McAleese Customer

Japanese waste paper trend Japanese waste paper trend High collection & reuse High

Water & Sewerage Services Bill Presentation to CRD 14 th October 2015 Sean McAleese Customer