Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay - PowerPoint PPT Presentation

Evaluating and Mitigating Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay Dublish, Vijay Nagarajan, Nigel Topham The University of Edinburgh ISPASS 2017 25 th April Santa Rosa, California

Multithreading on GPUs Hardware Kernel Scheduler Core Core Core Core Host CPU to Cores hide GPU memory latencies with concurrent execution DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 2 Memory Hierarchy in GPUs

Multithreading on GPUs Hardware Kernel Scheduler Memory- Core Core Core Core intensive applications Host CPU to Latencies grow GPU Appear in critical path Bandwidth Bottleneck DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 3 Memory Hierarchy in GPUs

Deeper Memory Hierarchy Core Core Core Core Small caches High multithreading L1 L1 L1 L1 Bandwidth filtering High cache miss rates Distributed Bandwidth filtering L2 Bandwidth Bottleneck DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 4 Memory Hierarchy in GPUs

Deeper Memory Hierarchy L2 roundtrip latency ̴ 300 cycles Core Core Core Core (2-3x higher) Small caches High multithreading L1 L1 Bandwidth filtering L1 L1 High cache miss rates Distributed Bandwidth filtering L2 Bandwidth Bottleneck DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 5 Memory Hierarchy in GPUs

Deeper Memory Hierarchy L2 roundtrip latency ̴ 200 cycles Core Core Core Core Small caches High multithreading L1 L1 L1 L1 Bandwidth filtering High cache miss rates Identify and mitigate Bandwidth filtering bottlenecks across the L2 memory hierarchy DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 6 Memory Hierarchy in GPUs

Goals • Characterize : Understand the bandwidth bottlenecks across different levels of the memory hierarchy such as L1, L2 and DRAM • Cause : Investigate the architectural causes for congestion • Effect : Design-space exploration to evaluate the effect of mitigating congestion • Proposal : Use cause and effect analysis to present cost-effective configurations of the memory hierarchy Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 7 Memory Hierarchy in GPUs

Experimental Environment • Platform • GPGPU -Sim (v3.2.2) • GPUWattch (McPAT) • Benchmark Suites • Rodinia • Parboil • MapReduce Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 8 Memory Hierarchy in GPUs

Baseline Configuration • GTX 480 NVIDIA GPU • 15 SMs • Private L 1 Data Cache (16 KB; 32 MSHRs) • Shared L 2 Cache (768 KB; 32 MSHRs/bank) • L1-L2 Interconnect (Crossbar; 32+32 bytes) • DRAM ( 384 bits bus width) Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 9 Memory Hierarchy in GPUs

Latency Tolerance Performance plateau Latency tolerance Latency appears in the critical path Performance versus Latency curve for memory-intensive benchmarks Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 10 Memory Hierarchy in GPUs

Latency Tolerance Performance plateau [ 120 cycles , 220 cycles ] Ideal L2 access latency Ideal DRAM access latency Added latencies due to increasing congestion Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 11 Memory Hierarchy in GPUs

Latency Tolerance Performance plateau Baseline Memory Latencies 1x Far from [ 120 cycles , 220 cycles ] Practically possible saturation Ideal L2 access latency Ideal al DRAM AM access ss latenc ncy to improve (theoretically performance possible) Observations about “baseline memory latencies” 1. Baseline memory latencies critically higher than performance plateau latencies 2. Baseline memory latencies critically higher than ideal access latencies to L2/DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 12 Memory Hierarchy in GPUs

Infinite Bandwidth 2.37x Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 13 Memory Hierarchy in GPUs

Infinite Bandwidth 2.37x 1.15x Significant congestion in the cache hierarchy Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 14 Memory Hierarchy in GPUs

Understanding Bandwidth Bottleneck • While the bandwidth provided decreases in the lower Core levels of the memory hierarchy, bandwidth demand does not reduce proportionally. Decreasing bandwidth L1 access • This leads to a bandwidth skew between adjacent levels. queue L1 • As a result, requests queue up in the memory hierarchy for long durations, causing congestion. L2 access queue L2 • L2 access queues are full for 46% of its usage lifetime. DRAM • DRAM access queue are full for 39% of its usage lifetime access queue DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 15 Memory Hierarchy in GPUs

Causes of congestion • • Structural Hazards Back Pressure Core L1 L1 MSHR HR L2 L2 MSHR DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 16 Memory Hierarchy in GPUs

Causes of congestion • • Structural Hazards Back Pressure Core • Prolonged contention for cache resources such as MSHRs or replaceable cache lines. L1 • Pending requests must complete and relinquish the HIT? L1 MSHR resources. L2 • Therefore, new miss requests get serialized, increasing the FULL MISS memory latencies even more. L2 MSHR Structural Hazard High cache hit latencies DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 17 Memory Hierarchy in GPUs

Causes of congestion • • Structural Hazards Back Pressure STALL Core Independent compute? X • Cascading effect of structural hazards • Higher level gets throttled L1 L1 MSHR X • Eventually throttles core performance L2 L2 MSHR Restricted parallelism on cores DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 18 Memory Hierarchy in GPUs

Causes of congestion • • Structural Hazards Back Pressure Core L1 cache stalls L1 L1 MSHR 41% L2 11% L2 MSHR HR DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 19 Memory Hierarchy in GPUs

Causes of congestion • • Structural Hazards Back Pressure Core L1 cache stalls L1 48% 48% L1 MSHR L2 L2 MSHR Major causes of stalls at L1 DRAM 1. L1 MSHR : 41% (Structural Hazards) 2. L2 back pressure : 48% (Back pressure) Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 20 Memory Hierarchy in GPUs

Causes of congestion • • Structural Hazards Back Pressure Core L2 cache stalls L1 35% L1 MSHR 42% 42% L2 L2 MSHR Major causes of stalls at L2 DRAM 1. Crossbar (response path) : 42% (Back pressure) 2. DRAM : 35% (Back pressure) Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 21 Memory Hierarchy in GPUs

Mitigating congestion Classifying the Design Space Core • Category -1: Operate at peak throughput L1 • Minimize stalls by exploiting existing peak throughput L1 MSHR • e.g. MSHRs, Access Queue size L2 • Category -2: Increase peak throughput L2 MSHR HR • Minimize stalls by increasing the peak throughput • e.g. Crossbar flit size, DRAM bus width DRAM Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 22 Memory Hierarchy in GPUs

Identifying the Design Space • L1 parameters • L1 Miss Queue Core • L1 MSHR • Memory pipeline width • L2 parameters L1 • L2 Miss/Response Queue • L2 MSHR L1 MSHR • L2 Data Port Width • L2 Banks L2 • Flit Size (Crossbar) L2 MSHR • DRAM parameters • Scheduler Queue • Banks DRAM • Bus width Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 23 Memory Hierarchy in GPUs

Mitigating congestion 4% Scaling L1 parameters by 4x Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 24 Memory Hierarchy in GPUs

Mitigating congestion - 7% - 13% - 33% - 25% 4% 4% Scaling L1 parameters by 4x Improving bandwidth in isolation can lead to even more congestion at the lower levels Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 25 Memory Hierarchy in GPUs

Mitigating congestion Core frequency scaling on real GTX 480 Up to 23% slowdown Improving bandwidth in isolation can lead to even more congestion at the lower levels Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 26 Memory Hierarchy in GPUs

Mitigating congestion 59% Scaling L2 parameters by 4x Shows the criticality of the L2 bandwidth Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 27 Memory Hierarchy in GPUs

Mitigating congestion 11% Scaling DRAM parameters by 4x (HBM) Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 28 Memory Hierarchy in GPUs

Mitigating congestion 226% 212% 59% 59% - 13% 69% 4% Scaling L1 and L2 parameters by 4x A case for synergistic scaling! Evaluating and Mitigating Bandwidth Bottlenecks Across the 25/04/2017 29 Memory Hierarchy in GPUs

Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay - PowerPoint PPT Presentation

Evaluating and Mitigating Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay Dublish, Vijay Nagarajan, Nigel Topham The University of Edinburgh ISPASS 2017 25 th April Santa Rosa, California Multithreading on GPUs Hardware

Agenda Freight-Caused Roadway Bottlenecks Roadway Freight Network Freight Strategy

Bandwidth Management Chris Wilson Aptivate Ltd, UK AfNOG 2010 Ingredients What is bandwidth

Bandwidth Ex Parte Addendum M a y 1 0 , 2 0 1 8 Addendum to Bandwidth FCC Meeting on May 2,

Virtualising our CPE Mantychore is part-funded by the EC under Grant Agreement N 261527

Evaluating Approaches to Detect Bottlenecks in the Pipe & Filter Framework TeeTime Adrian

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu,

Bandwidth Management Chris Wilson Aptivate Ltd, UK AfNOG 2012 Download this presentation at:

Bandwidth for all Bandwidth for all The Peruvian The Peruvian case case Roxana Barrantes

A 83-dB SFDR 10-MHz Bandwidth A 83-dB SFDR 10-MHz Bandwidth Continuous-Time Delta-Sigma

A Full Bandwidth Audio Codec with Low A Full Bandwidth Audio Codec with Low Complexity and Very

Available Bandwidth Available Bandwidth Estimation in IEEE 802.11- - Estimation in IEEE 802.11

Tomography with Available Bandwidth with Available Bandwidth Tomography Alok Shriram Shriram

Design of Bandwidth Bandwidth Aware Aware and and Design of Congestion Avoiding Avoiding

Estimating Bandwidth of Mobile Users Sept 2003 Rohit Kapoor CSD, UCLA Estimating Bandwidth of

Network Bandwidth Utilization Forecast Model on High Bandwidth Networks Scientific Data

PARALLEM: massively Parallel Landscape Evolution Modelling Tuesday 28 th November 2017 The

T evatron crystal collimation experiment ( T -980 ) by Vlasov Alexander Moscow State

Quenching to field-stabilized Adam Iaizzi* Postdoctoral Fellow magnetization plateaus in the

Black hole variability: from Galactic center to microquasars Frdric VINCENT 1 1 Centrum

An Analy alysis is of of CD CDC C cos osmic ic-ra ray test 2017.12.28 Year End

TRISTAN measurements at Troitsk nu-mass experiment Tim Brunst - September 12 th 2019 - Toyama

Plating Up Progress Scope the development of a metrics framework that can help investors to

I ntroduction to OCL Why is formalization required? Graphical elements of the diagrammatic

Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay - PowerPoint PPT Presentation

Evaluating and Mitigating Bandwidth Bottlenecks Across the Memory Hierarchy in GPUs Saumay Dublish, Vijay Nagarajan, Nigel Topham The University of Edinburgh ISPASS 2017 25 th April Santa Rosa, California Multithreading on GPUs Hardware

Agenda Freight-Caused Roadway Bottlenecks Roadway Freight Network Freight Strategy

Bandwidth Management Chris Wilson Aptivate Ltd, UK AfNOG 2010 Ingredients What is bandwidth

Bandwidth Ex Parte Addendum M a y 1 0 , 2 0 1 8 Addendum to Bandwidth FCC Meeting on May 2,

Virtualising our CPE Mantychore is part-funded by the EC under Grant Agreement N 261527

Evaluating Approaches to Detect Bottlenecks in the Pipe &amp; Filter Framework TeeTime Adrian

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu,

Bandwidth Management Chris Wilson Aptivate Ltd, UK AfNOG 2012 Download this presentation at:

Bandwidth for all Bandwidth for all The Peruvian The Peruvian case case Roxana Barrantes

A 83-dB SFDR 10-MHz Bandwidth A 83-dB SFDR 10-MHz Bandwidth Continuous-Time Delta-Sigma

A Full Bandwidth Audio Codec with Low A Full Bandwidth Audio Codec with Low Complexity and Very

Available Bandwidth Available Bandwidth Estimation in IEEE 802.11- - Estimation in IEEE 802.11

Tomography with Available Bandwidth with Available Bandwidth Tomography Alok Shriram Shriram

Design of Bandwidth Bandwidth Aware Aware and and Design of Congestion Avoiding Avoiding

Estimating Bandwidth of Mobile Users Sept 2003 Rohit Kapoor CSD, UCLA Estimating Bandwidth of

Network Bandwidth Utilization Forecast Model on High Bandwidth Networks Scientific Data

PARALLEM: massively Parallel Landscape Evolution Modelling Tuesday 28 th November 2017 The

T evatron crystal collimation experiment ( T -980 ) by Vlasov Alexander Moscow State

Quenching to field-stabilized Adam Iaizzi* Postdoctoral Fellow magnetization plateaus in the

Black hole variability: from Galactic center to microquasars Frdric VINCENT 1 1 Centrum

An Analy alysis is of of CD CDC C cos osmic ic-ra ray test 2017.12.28 Year End

TRISTAN measurements at Troitsk nu-mass experiment Tim Brunst - September 12 th 2019 - Toyama

Plating Up Progress Scope the development of a metrics framework that can help investors to

I ntroduction to OCL Why is formalization required? Graphical elements of the diagrammatic

Evaluating Approaches to Detect Bottlenecks in the Pipe & Filter Framework TeeTime Adrian