ACACES 2018 Summer School GPU Architectures: Basic to Advanced - PowerPoint PPT Presentation

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant Professor College of William & Mary (http://adwaitjog.github.io/)

Course Outline q Lectures 1 and 2: Basics Concepts ● Basics of GPU Programming ● Basics of GPU Architecture q Lecture 3: GPU Performance Bottlenecks ● Memory Bottlenecks ● Compute Bottlenecks ● Possible Software and Hardware Solutions q Lecture 4: GPU Security Concerns ● Timing channels ● Possible Software and Hardware Solutions

Key GPU Performance Concerns GPU (Device) Memory Concerns: Data transfers SM SM SM SM between SMs and global memory Scratchpad are costly. Registers and Local Memory Bottleneck! GPU Global Memory A T1 T2 T3 T4 Compute Concerns: Threads that B T1 T2 T3 T4 do not take the same control path Time C T1 T2 lead to serialization in the GPU compute pipeline. D T3 T4 E T1 T2 T3 T4

Reducing Off-Chip Access q Re-writing software to use “shared memory” and avoid un-coalesced global accesses is difficult for the GPU programmer. q Recent GPUs introduce hardware-managed caches (L1/L2), but large number of threads lead to thrashing. q General purpose code, now being ported to GPUs, has branches and irregular accesses. Not always possible to fix them in the code. We need intelligent hardware solutions!

I) Alleviating the Memory Bottlenecks – Memory concerns: Thousands of threads running on SMs need data from DRAM, however, DRAM bandwidth is limited. Increasing it is very costly Bottleneck! Bottleneck! SM L1 GPU Global Memory GPU (Device) SM SM L1 L2 Cache L1 SM L1 – Q1. How can we use caches effectively to reduce the bandwidth demand? – Q2. Can we effectively data compression and reduce the data consumption? – Q3. How can we effectively/fairly allocate memory bandwidth across concurrent streams/apps?

Quantifying Memory Bottlenecks AVG: 32% 55% 100% 80% 60% 40% 20% 0% SAD PVC SSC BFS MUM CFD KMN SCP FWT IIX SPMV JPEG BFSR SC FFT SD2 WP PVR BP CON AES SD1 BLK HS SLA DN LPS NN PFN LYTE LUD MM STO CP NQU CUTP HW TPAF AVG AVG-T1 HPC Applications Percentage of total execution cycles wasted waiting for the data to come back from memory. [Jog et al., ASPLOS 2013]

Strategies q Cache-Aware Warp Scheduling Techniques ● Effective caching à Less Pressure on Memory q Employing Assist Warps for Helping Data Compression ● Bandwidth Preserved n Bandwidth Allocation Strategies for Multi- Application execution on GPUs ● Better System Throughput and Fairness

Application-Architecture Co-Design q Architecture: GPUs typically employ smaller caches compared to CPUs q Scheduler: Many warps concurrently access the small caches in a round-robin manner leading to thrashing.

Cache Aware Scheduling q Philosophy: "One work at a time" q Working ● Select a "group" ( work ) of warps ● Always prioritizes it over other groups ● Group switch is not round-robin q Benefits: ● Preserve locality ● Fewer Cache Misses

Improve L1 Cache Hit Rates Data for Grp.1 arrives. No prioritization. Round-Robin Order Grp.1 Grp.2 Grp.4 Grp.1 Grp.2 Grp.3 Grp.4 Grp.3 W W W W W W W W W W W W W W W W Data for Grp.1 arrives. Cache Aware Order Prioritize Grp.1 Grp.1 Grp.2 G3 Grp.1 G3 Grp.4 Grp.2 Grp.3 Grp.4 Round-Robin: 4 groups in Time T T Prioritization: 3 groups in Time T Fewer warp groups access the cache concurrently à Less cache contention Time

Reduction in L1 Miss Rates Cache Aware Round-Robin Scheduler Scheduler Normalized L1 Miss Rates 1.20 1.00 0.80 0.60 34% 0.40 0.20 0.00 SAD SSC BFS KMN IIX SPMV BFSR AVG. n 25% improvement in IPC across 19 applications n Limited benefits for cache insensitive applications n Software Support (e.g., specify data-structures that should be " uncacheable ”) [Jog et al., ASPLOS 2013]

Other Sophisticated Mechanisms q Rogers et al., Cache Conscious Wavefront Scheduling, MICRO’12 q Kayiran et al., Neither more Nor Less: Optimizing Thread-level Parallelism for GPGPUs, PACT’13 q Chen et al., Adaptive cache management for energy-efficient GPU computing, MICRO’14 q Lee et al., CAWS: criticality-aware warp scheduling for GPGPU workloads

Challenges in GPU Efficiency Threads Cores Register File Thread 3 Idle! Thread 2 Memory Full! Idle! Full! Hierarchy Thread 1 Thread 0 GPU Streaming Multiprocessor The memory bandwidth bottleneck leads to idle cores Thread limits lead to an underutilized register file

Motivation: Unutilized On-chip Memory 100% % Unallocated 80% Registers 60% 40% 20% 0% q 24% of the register file is unallocated on average q Similar trends for on-chip scratchpad memory

Motivation: Idle Pipelines Memory Bound 100% % Cycles 80% 60% Active 67% of cycles idle 40% Stalls 20% 0% CONS JPEG LPS MUM RAY SCP PVC PVR bfs Avg. Compute Bound 100% 80% % Cycles 60% 35% of cycles idle Active 40% Stalls 20% 0% NN STO bp hs dmr NQU SLA lc pt mc

Motivation: Summary Heterogeneous application requirements lead to: q Bottlenecks in execution q Idle resources

Our Goal q Use idle resources to do something useful: accelerate bottlenecks using helper threads Core Register s File Memory Helper threads Hierarchy ¨ A flexible framework to enable helper threading in GPUs: C ore- A ssisted B ottleneck A cceleration ( CABA )

Helper threads in GPUs q Large body of work in CPUs … ● [Chappell+ ISCA ’99, MICRO ’02], [Yang+ USC TR ’98], [Dubois+ CF ’04], [Zilles+ ISCA ’01], [Collins+ ISCA ’01, MICRO ’01], [Aamodt+ HPCA ’04], [Lu+ MICRO ’05], [Luk+ ISCA ’01], [Moshovos+ ICS ’01], [Kamruzzaman+ ASPLOS ’11], etc. q However, there are new challenges with GPUs…

Challenge How do you efficiently manage and use helper threads in a throughput-oriented architecture?

Managing Helper Threads in GPUs Thread Hardware Warp Block Software Where do we add helper threads?

Approach #1: Software-only ü No hardware changes Coarse grained Regular threads Synchronization is difficult Not aware of runtime program behavior Helper threads

Where Do We Add Helper Threads? Thread Warp Hardware Block Software

Other functionality In the paper: q More details on the hardware structures q Data communication and synchronization q Enforcing priorities

CABA: Applications q Data compression q Memoization q Prefetching q Encyrption …

A Case for CABA: Data Compression q Data compression can help alleviate the memory bandwidth bottleneck - transmits data in a more condensed form Uncompressed Compressed Memory Idle! Hierarchy ¨ CABA employs idle compute pipelines to perform compression

Data Compression with CABA q Use assist warps to: ● Compress cache blocks before writing to memory ● Decompress cache blocks before placing into the cache q CABA flexibly enables various compression algorithms q Example: BDI Compression [Pekhimenko+ PACT ’12] ● Parallelizable across SIMT width ● Low latency q Others: FPC [Alameldeen+ TR ’04], C-Pack [Chen+ VLSI ’10]

Walkthrough of Decompression Assist Assist Trigger Warp Warp Store Controller Scheduler L2 + Miss! L1D Hit! Memory Cores

Walkthrough of Compression Assist Assist Trigger Warp Warp Store Controller Scheduler L2 + L1D Memory Cores

Effect on Performance 2.8 2.6 Performance Normalized 2.4 41.7% 2.2 2 1.8 1.6 1.4 1.2 1 CABA-BDI No-Overhead-BDI § CABA provides a 41.7% performance improvement § CABA achieves performance close to that of designs with no overhead for compression

Effect on Bandwidth Consumption 90% 80% Memory Bandwidth 70% Consumption 60% 50% 40% 30% 20% 10% 0% Baseline CABA-BDI Data compression with CABA alleviates the memory bandwidth bottleneck

Conclusion q Observation: Imbalances in execution leave GPU resources underutilized q Goal: Employ underutilized GPU resources to do something useful – accelerate bottlenecks using helper threads q Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture? q Solution: CABA (Core-Assisted Bottleneck Acceleration, ISCA’15) ● A new framework to enable helper threading in GPUs ● Enables flexible data compression to alleviate the memory bandwidth bottleneck ● A wide set of use cases (e.g., prefetching, memoization)

Discrete GPU Cards --- Scaling Trends 2008 2010 2012 2014 2016 2018 GTX 275 GTX 480 GV 100 GTX 680 GTX 980 GP 100 (Tesla) (Fermi) (Volta) (Maxwell) (Pascal) (Kepler) 240 448 5120 2048 3584 1536 CUDA CUDA CUDA CUDA CUDA CUDA Cores Cores Cores Cores Cores Cores (127 (139 (900 (224 (720 (192 GB/sec) GB/sec) GB/sec) GB/sec) GB/sec) GB/sec)

ACACES 2018 Summer School GPU Architectures: Basic to Advanced - PowerPoint PPT Presentation

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant Professor College of William & Mary (http://adwaitjog.github.io/) Course Outline q Lectures 1 and 2: Basics Concepts Basics of GPU

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant

Do we still care about single thread performance? ACACES

Architectures Architectural styles Software architectures Architectures versus middleware

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

FILM RESTORATION SUMMER SCHOOL / FIAF SUMMER SCHOOL 2009 FILM RESTORATION SUMMER SCHOOL / FIAF

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Semantic Slicing of Software Version Histories Yi Li / U Toronto Julia Rubin / MIT Marsha

OpHit Slicing Dan Pershey Feb 11, 2019 Overview Implemented an OpHit clusterer, based on

Runtime monitoring of time-critical tasks in multi-core systems Claire Pagetti ONERA, France

DD4hep - Geometry Description for HEP Experiments Christian Grefe CERN 28. May 2013 C. Grefe,

Graphics and Animation Mobile Application Development in iOS School of EECS Washington State

Lecture 1: CS/ECE 3810 Introduction Todays topics: Why computer organization is

Correlation of Quadratic Boolean Functions: Cryptanalysis of All Versions of Full MORUS Siwei Sun

An Elementary Proof of Bertrands Postulate Daniel W. Cranston Virginia Commonwealth University

ACACES 2018 Summer School GPU Architectures: Basic to Advanced - PowerPoint PPT Presentation

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant Professor College of William & Mary (http://adwaitjog.github.io/) Course Outline q Lectures 1 and 2: Basics Concepts Basics of GPU

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant

ACACES 2018 Summer School GPU Architectures: Basic to Advanced Concepts Adwait Jog, Assistant

Do we still care about single thread performance? ACACES

Architectures Architectural styles Software architectures Architectures versus middleware

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

FILM RESTORATION SUMMER SCHOOL / FIAF SUMMER SCHOOL 2009 FILM RESTORATION SUMMER SCHOOL / FIAF

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Semantic Slicing of Software Version Histories Yi Li / U Toronto Julia Rubin / MIT Marsha

OpHit Slicing Dan Pershey Feb 11, 2019 Overview Implemented an OpHit clusterer, based on

Runtime monitoring of time-critical tasks in multi-core systems Claire Pagetti ONERA, France

DD4hep - Geometry Description for HEP Experiments Christian Grefe CERN 28. May 2013 C. Grefe,

Graphics and Animation Mobile Application Development in iOS School of EECS Washington State

Lecture 1: CS/ECE 3810 Introduction Todays topics: Why computer organization is

Correlation of Quadratic Boolean Functions: Cryptanalysis of All Versions of Full MORUS Siwei Sun

An Elementary Proof of Bertrands Postulate Daniel W. Cranston Virginia Commonwealth University

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,