Poise : Balancing Thread-Level Parallelism and Memory System - PowerPoint PPT Presentation

Poise : Balancing Thread-Level Parallelism and Memory System Performance in GPUs using Machine Learning Saumay Dublish * Vijay Nagarajan ‡ Nigel Topham ‡ * * Synopsys Inc. ‡ ‡ The University of Edinburgh HPCA 2019 Washington D.C., USA 19 th February, 2019

GPU Architecture Overview SM SM SM • GPUs are throughput-oriented systems L1 L1 L1 • Focus on overall system throughput • Rely on high levels of multithreading L2 • Implemented by switching across warps • Overlap latency with useful execution DRAM 2 GPU Architecture Overview

GPU Architecture Consequence of increasing TLP SM SM SM • Increasing TLP not always useful L1 L1 L1 • Leads to cache thrashing • Leads to bandwidth bottlenecks L2 • Results in high levels of congestion • Latencies tend to be very high! DRAM Can such high latencies be hidden? 3 GPU Architecture Consequence of increasing TLP

Hiding Latencies in GPUs Harnessing concurrency LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY LOAD LOAD Independent LOAD Warp concurrency Independent Independent Load latency LOAD Independent Independent Independent (Inter-warp concurrency) LOAD Independent Independent Independent Independent time Execution Independent Independent Independent Independent DEPENDENCY Independent Independent Independent DEPENDENCY Independent Independent DEPENDENCY Independent DEPENDENCY DEPENDENCY 4 GPU Architecture Hiding Latencies in GPUs

Hiding Latencies in GPUs Harnessing concurrency LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY LOAD LOAD Independent LOAD Warp concurrency Independent Independent Load latency LOAD Independent Independent Independent (Inter-warp concurrency) LOAD Independent Independent Independent Independent time Execution Independent Independent Independent Independent DEPENDENCY Independent Independent Independent DEPENDENCY Independent Independent DEPENDENCY Independent DEPENDENCY DEPENDENCY 5 GPU Architecture Hiding Latencies in GPUs

Hiding Latencies in GPUs Harnessing concurrency LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY LOAD LOAD Independent LOAD Warp concurrency Independent Independent Load latency LOAD Independent Independent Independent (Inter-warp concurrency) LOAD Independent Independent Independent Independent time Execution Independent Independent Independent Independent DEPENDENCY Independent Independent Independent DEPENDENCY Independent Independent DEPENDENCY Works well in compute-intensive Independent DEPENDENCY applications DEPENDENCY 6 GPU Architecture Hiding Latencies in GPUs

The Case of Limited Parallelism Fewer independent operations LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY LOAD LOAD Independent LOAD Warp concurrency Independent Independent Load latency LOAD Independent Independent Independent (Inter-warp concurrency) LOAD Independent Independent Independent Independent time Execution Independent Independent Independent Independent DEPENDENCY Independent Independent Independent DEPENDENCY Independent Independent DEPENDENCY Independent DEPENDENCY DEPENDENCY 7 GPU Architecture The Case of Limited Parallelism

The Case of Limited Parallelism Fewer independent operations LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY LOAD LOAD Independent LOAD Warp concurrency Independent Independent Load latency LOAD Independent Independent Independent (Inter-warp concurrency) LOAD Independent Independent Independent Independent time Execution Independent Independent Independent Independent DEPENDENCY Independent Independent Independent DEPENDENCY Independent Independent DEPENDENCY Independent DEPENDENCY DEPENDENCY 8 GPU Architecture The Case of Limited Parallelism

The Case of Limited Parallelism Fewer independent operations LOAD Independent Load latency Instruction concurrency Independent Independent time Execution (Intra-warp concurrency) Independent DEPENDENCY Higher load latency LOAD LOAD due to congestion Independent LOAD LOAD LOAD Independent Independent LOAD LOAD Independent Independent Independent Independent LOAD Independent LOAD Independent Warp concurrency Independent Independent Independent LOAD Load latency Independent Independent LOAD LOAD Independent Independent Independent Independent Independent Independent Independent Independent LOAD (Inter-warp concurrency) Independent Independent LOAD LOAD Independent Independent Independent DEPENDENCY Independent Independent Independent Independent Independent time Independent Execution Independent Independent DEPENDENCY Independent Independent Independent Independent Independent Independent Independent Independent DEPENDENCY DEPENDENCY Independent Independent Independent Independent DEPENDENCY Independent Independent Independent Independent DEPENDENCY Independent DEPENDENCY Independent Independent Independent Independent Independent DEPENDENCY DEPENDENCY Independent DEPENDENCY Impractically large number of warps Independent Independent DEPENDENCY DEPENDENCY DEPENDENCY required to completely hide latency DEPENDENCY DEPENDENCY 9 GPU Architecture The Case of Limited Parallelism

Need For Balance Tension between TLP and memory system performance • Increase TLP to improve concurrency – latency worsens • Reduce TLP to reduce latency – concurrency worsens Memory Performance Concurrency 10

Need For Balance Tension between TLP and memory system performance • Increase TLP to improve concurrency – latency worsens • Reduce TLP to reduce latency – concurrency worsens ☓ Memory Performance ✓ Concurrency 11

Need For Balance Tension between TLP and memory system performance • Increase TLP to improve concurrency – latency worsens • Reduce TLP to reduce latency – concurrency worsens ☓ ✓ Concurrency Memory Performance 12

Need For Balance Tension between TLP and memory system performance • Increase TLP to improve concurrency – latency worsens • Reduce TLP to reduce latency – concurrency worsens ✓ ✓ Memory Performance Concurrency Optimal system throughput with balanced TLP and memory performance

Outline • Problem Statement Balancing TLP and memory performance • Prior state-of-the-art CCWS and PCAL warp schedulers • Pitfalls in prior techniques Iterative search and prone to local optima • Goals Computing the best warp scheduling decisions • Proposal Poise • Results Experimental results • Conclusion Key takeaways 14

Prior state-of-the-art Warps Cache Thrashing L1 cache Memory Congestion 15 Prior state-of-the-art CCWS

Prior state-of-the-art Cache-conscious wavefront scheduling (CCWS) Limits the degree of multithreading ☓ Warps Reduces cache thrashing L1 cache Relieves congestion Shortcomings • Restricted coupling of warps with cache performance • Underutilization of shared memory resources • Dynamic policy has significant performance and cost overheads • Static policy burdens the user with the task of profiling every workload 16 Prior state-of-the-art CCWS

Prior state-of-the-art Priority-based cache allocation (PCAL) Alter parallelism independent of memory system performance ☓ Warps L1 cache 17 Prior state-of-the-art CCWS

Prior state-of-the-art Priority-based cache allocation (PCAL) Vital warps (W1, W2, W3) Cache-polluting warps ☓ Warps L1 cache Cache-polluting warps (W1, W2) Vital warps 18 Prior state-of-the-art PCAL

Prior state-of-the-art Priority-based cache allocation (PCAL) Vital warps (N) Determine degree of multithreading Cache-polluting warps Cache-polluting warps (p) Subset of vital warps Ability to allocate and evict the L1 cache Reduce cache contention Warp-tuple { N, p } Vital warps 19 Prior state-of-the-art PCAL

Limitations of PCAL • Heuristic-based iterative search are slow in hardware • Prone to local optima in Cache-polluting warps presence of multiple performance peaks • These two limitations lead to sub-optimal solutions Local optimum Vital warps 20 Prior state-of-the-art Limitations of PCAL

Goals How to find the best warp-tuple? • Balance TLP and memory performance • Avoid local optima Cache-polluting warps • Converge expeditiously • Low sampling and hardware overhead • Avoid burdening the user Best warp-tuple? Vital warps 21 Goals

Proposal Poise A technique to dynamically balance TLP and memory system performance Machine Learning Framework Hardware Inference Engine Supervised learning Runtime prediction Unseen user application Feature Set Runtime Input Sample Input Profiled Kernels Feature weights Prediction Training Regression Stage & Dataset Model Poise Prediction via compiler Local Search Best warp-tuple Sample Output Best warp-tuple 22 Poise Poise : A System Overview

Poise : Balancing Thread-Level Parallelism and Memory System - PowerPoint PPT Presentation

Poise : Balancing Thread-Level Parallelism and Memory System Performance in GPUs using Machine Learning Saumay Dublish * Vijay Nagarajan Nigel Topham * * Synopsys Inc. The University of Edinburgh HPCA 2019 Washington D.C., USA

Standards for all: Work ethic based on continual improvement Respect Teaching and learning

1 The Effects of Stress Social Psychological Physical Mental 4 Steps to

Text-mining Social Media to Study Mental and Physical Well-being Lyle Ungar University of

ThorCon Design Philosophy: The Do-able Molten Salt Reactor Jack Devanney, Lars Jorgensen,

GUNGAHLIN COLLEGE Year 11 Parent Forum Wednesday 31 May CONNECT LEARN ACHIEVE How to

DCCP Spec Updates * * * [Eddie Kohler, Mark Handley] UCLA IETF 59 DCCP Meeting March 4, 2004

Anne Bracy CS 3410 Computer Science Cornell University The slides are the product of many

Primary Care Network Development Fund 3 December 2019 Primary Care Committee 1 Purpose The

Using Apache Brooklyn and Docker to Simulate your Production Environments

CS3350B Computer Organization Chapter 3: CPU Control & Datapath Part 1: Introduction to MIPS

DIRECT USES/HEAT PUMPS University of Pisa -DESTEC Italian Geothermal Union SUMMARY 1. Direct

Hyperbolicity of the Layerwise Discretized Shallow Water equations The bilayer case Martin

Digitized smart Logistics from Luxembourg June 2018 Cluster for Logistics Luxembourg A.s.b.l 1

Merton Partnership Doing Things Differently 21 January 2016 Adam Doyle, Chief Officer Hard

1 Confidentiality using Symmetric Encryption have two major placement alternatives link

Security Overview Security Goals The Attack Space Security Mechanisms

Introduction to Symmetric Cryptography Lars R. Knudsen June 2014 L.R. Knudsen Introduction to

First-class Synchronous Operations CML supports selective communication in a very general way,

Non-Blocking Communications Deadlock 1 2 5 3 4 0 Communicator Completion The mode of

Abstractness of Continuation Semantics for Asynchronous Concurrency Gabriel Ciobanu, Eneia

INF5470 Fall 2010 Philipp Hfliger Lecture 5: Neuromorphic Communication Content The AER

The full -calculus: simple and expressive -Refresh. What we know so far? We have studied

Serial Communication Asynchronous communication Synchronous communication clock TX RX data

Real-life needs Individualized Individualized Feedback in ITS Feedback in ITS Detmar Meurers

Poise : Balancing Thread-Level Parallelism and Memory System - PowerPoint PPT Presentation

Poise : Balancing Thread-Level Parallelism and Memory System Performance in GPUs using Machine Learning Saumay Dublish * Vijay Nagarajan Nigel Topham * * Synopsys Inc. The University of Edinburgh HPCA 2019 Washington D.C., USA

Standards for all: Work ethic based on continual improvement Respect Teaching and learning

1 The Effects of Stress Social Psychological Physical Mental 4 Steps to

Text-mining Social Media to Study Mental and Physical Well-being Lyle Ungar University of

ThorCon Design Philosophy: The Do-able Molten Salt Reactor Jack Devanney, Lars Jorgensen,

GUNGAHLIN COLLEGE Year 11 Parent Forum Wednesday 31 May CONNECT LEARN ACHIEVE How to

DCCP Spec Updates * * * [Eddie Kohler, Mark Handley] UCLA IETF 59 DCCP Meeting March 4, 2004

Anne Bracy CS 3410 Computer Science Cornell University The slides are the product of many

Primary Care Network Development Fund 3 December 2019 Primary Care Committee 1 Purpose The

Using Apache Brooklyn and Docker to Simulate your Production Environments

CS3350B Computer Organization Chapter 3: CPU Control &amp; Datapath Part 1: Introduction to MIPS

DIRECT USES/HEAT PUMPS University of Pisa -DESTEC Italian Geothermal Union SUMMARY 1. Direct

Hyperbolicity of the Layerwise Discretized Shallow Water equations The bilayer case Martin

Digitized smart Logistics from Luxembourg June 2018 Cluster for Logistics Luxembourg A.s.b.l 1

Merton Partnership Doing Things Differently 21 January 2016 Adam Doyle, Chief Officer Hard

1 Confidentiality using Symmetric Encryption have two major placement alternatives link

Security Overview Security Goals The Attack Space Security Mechanisms

Introduction to Symmetric Cryptography Lars R. Knudsen June 2014 L.R. Knudsen Introduction to

First-class Synchronous Operations CML supports selective communication in a very general way,

Non-Blocking Communications Deadlock 1 2 5 3 4 0 Communicator Completion The mode of

Abstractness of Continuation Semantics for Asynchronous Concurrency Gabriel Ciobanu, Eneia

INF5470 Fall 2010 Philipp Hfliger Lecture 5: Neuromorphic Communication Content The AER

The full -calculus: simple and expressive -Refresh. What we know so far? We have studied

Serial Communication Asynchronous communication Synchronous communication clock TX RX data

Real-life needs Individualized Individualized Feedback in ITS Feedback in ITS Detmar Meurers

CS3350B Computer Organization Chapter 3: CPU Control & Datapath Part 1: Introduction to MIPS