Diogenes: A tool for exposing Hidden GPU Performance Opportunities - PowerPoint PPT Presentation

Diogenes: A tool for exposing Hidden GPU Performance Opportunities Benjamin Welton and Barton Miller 2019 Performance T ools Workshop July 29th, Tahoe, CA.

Overview of Diogenes Automatically detect performance issues with CPU- GPU interactions (synchronizations, memory transfers) o Unnecessary interactions o Misplaced interactions o We do not do GPU kernel profiling, general CPU profiling, etc Output is a list of unnecessary or misplaced interactions o Including an estimate of potential benefit (in terms of application runtime) of fixing these issues. 2

Features of Diogenes Binary instrumentation of the application and CUDA user space driver for data collection o Collect information not available from other methods o Use (or non-use) of data from the GPU by the CPU o Identify hidden interactions o Conditional interactions (ex. a synchronous cuMemcpyAsync call). o Detect and measure interactions on the private API. o Directly measure synchronization time o Look at the contents of memory transfers Analysis method to show only problematic interactions. 3

Current Status of Diogenes Prototype is working on Power 8/9 architectures o Including on the current GPU driver versions used on LLNL/ORNL machines What Works: o Identifying unnecessary transfers o non-unified memory transfers only o Identifying unnecessary/misplaced synchronizations that occur at a single point (type 1 & 2 below) Type 1: No use of GPU Computed Data Type 2: Misplaced Synchronization Synchronization(); Synchronization(); for(…) { for(…) { // Work with no GPU dependencies // Work with no GPU dependencies } } Synchronization(); result = GPUData [0] + … 4

Current Status of Diogenes Ncurses interface for exploring Diogenes analysis 5

Diogenes Predictive Accuracy Overview App Name App Type Diogenes Actual Benefit by Estimated Benefit Manual Fix (T op N, % of Exec) (T op N,% of Exec) cumf_als Matrix Factorization 10.0% 8.3% AMG Algebraic Solver 6.8% 5.8% Rodinia Gaussian Benchmark 2.2% 2.1% cuIBM CFD 10.8% 17.6% • Estimates for the top 1-3 most prominent problems in each application. • Tried to be as careful as possible to alter only the problematic operation 6

Diogenes Collection and Analysis Techniques 1. Identify and time interactions Including hidden synchronizations and memory transfers o Binary Instrumentation of libcuda to identify and time calls performing synchronizations and/or data transfers 2. Determine the necessity of the interaction If the interaction is necessary for correctness, is it placed in an efficient location? o Synchronizations: A combination of memory tracing, CPU profiling, and program slicing Duplicate Data Transfers: Content based data deduplication approach. . 3. Provide an estimate of the fixing the bad interactions Diogenes uses a new Feed Forward Instrumentation workflow for data collection combined with a new model to produce the estimate 7

Diogenes – Workflow Diogenes uses a newly developed technique called feed forward instrumentation: o The results of previous instrumentation guides the insertion of new instrumentation. 8 Diogenes performs each step automatically (via a launcher)

Diogenes – Workflow Diogenes uses a newly developed technique called feed forward instrumentation: o The results of previous instrumentation guides the insertion of new instrumentation. Step 1 Measure execution time of the application (without instrumentation) Application libcuda.so 9 Diogenes performs each step automatically (via a launcher)

Diogenes – Workflow Diogenes uses a newly developed technique called feed forward instrumentation: o The results of previous instrumentation guides the insertion of new instrumentation. Step 1 Step 2 Measure execution time of the Instrument libcuda to identify application (without and time synchronizations and instrumentation) Memory Transfers Diogenes Application Application libcuda.so libcuda.so 10 Diogenes performs each step automatically (via a launcher)

Diogenes – Workflow Diogenes uses a newly developed technique called feed forward instrumentation: o The results of previous instrumentation guides the insertion of new instrumentation. Step 1 Step 2 Step 3 Measure execution time of the Instrument libcuda to identify Instrument application to application (without and time synchronizations and determine necessity of the instrumentation) Memory Transfers operation. Diogenes Diogenes Application Application Application libcuda.so libcuda.so libcuda.so 11 Diogenes performs each step automatically (via a launcher)

Diogenes – Workflow Diogenes uses a newly developed technique called feed forward instrumentation: o The results of previous instrumentation guides the insertion of new instrumentation. Step 1 Step 2 Step 3 Step 4 Measure execution time of the Instrument libcuda to identify Instrument application to Model potential benefit using data from Step’s 1 -3 to application (without and time synchronizations and determine necessity of the instrumentation) Memory Transfers operation. identify problematic calls and potential savings Call Type Potential Savings Diogenes Diogenes … … … Application Application Application libcuda.so libcuda.so libcuda.so … … … 12 Diogenes performs each step automatically (via a launcher)

Diogenes – Overhead/Limitations Overhead: o 30-70x 6x-20x application run time o Dyninst parsing overhead on really large binaries (e.g. >40 minutes for 1.5 GB binary) o Parse overhead now in the few minute range for parsing large binaries thanks to parallel parsing. Limited to single user threaded programs 13

The Gap In Performance T ools Existing T ools (CUPTI, etc.) have collection and analysis gaps preventing detection of issues o Don’t collect performance data on hidden interactions o Conditional Interactions o Implicitly synchronizing API calls o Private API calls 14

Conditional Interaction Conditional Interactions are unreported (and undocumented) synchronizations/transfers performed by a CUDA call. dest = malloc(size); cuMemcpyDtoHAsync_v2(dest,gpuMem,size,stream); libcuda.so Internal Driver API Synchronization Implementation Internal Memory Copy Implementation 15

Conditional Interaction Conditional Interactions are unreported (and undocumented) synchronizations/transfers performed by a CUDA call. Synchronous due to the way dest was allocated dest = malloc(size); cuMemcpyDtoHAsync_v2(dest,gpuMem,size,stream); libcuda.so Internal Driver API Synchronization Implementation Internal Memory Copy Implementation 19

Conditional Interaction Collection Gap CUPTI doesn’t report when undocumented interactions are performed by a call. dest = malloc(size); cuMemcpyDtoHAsync_v2(dest,gpuMem,size,stream); libcuda.so Internal Driver API Synchronization Implementation CUPTI Reports: cuMemcpyDtoHAsync_v2 Internal Memory Transfer Time Memory Copy CUPTI Implementation 20

Conditional Interaction Collection Gap CUPTI doesn’t report when undocumented interactions are performed by a call. Call back to CUPTI does not contain information about whether a synchrounization occurred. dest = malloc(size); cuMemcpyDtoHAsync_v2(dest,gpuMem,size,stream); libcuda.so Internal Driver API Synchronization Implementation CUPTI Reports: cuMemcpyDtoHAsync_v2 Internal Memory Transfer Time Memory Copy CUPTI Implementation 21

Conditional Interaction Collection Gap Hard to detect with library interposition approaches due to: 1. Need to know under what undocumented conditions a call can perform an interaction. dest = malloc(size); cuMemcpyDtoHAsync_v2(dest,gpuMem,size,stream); libcuda.so 2. Need to capture operations potentially Internal unrelated to CUDA to Interposition Driver API Synchronization see if the call meets Layer Implementation those conditions. Internal Memory Copy Implementation 3. Hope that a driver update doesn’t change behavior. 22

Implicit Synchronization Collection Gap CUPTI does not collect synchronization performance data for implicitly synchronizing CUDA calls o Examples include cudaMemcpy, cudaFree, etc We believe CUPTI collects performance data for synchronizations only for the following calls o cudaDeviceSynchronize o cudaStreamSynchronize. [Unconfirmed] Change in the way synchronizations are performed in CUDA 10 that effect all CUDA calls. o It now appears all calls check to see if a synchronization should be performed o Change from previous behavior of only potentially synchronous calls performing this check 23

Diogenes: A tool for exposing Hidden GPU Performance Opportunities - PowerPoint PPT Presentation

Diogenes: A tool for exposing Hidden GPU Performance Opportunities Benjamin Welton and Barton Miller 2019 Performance T ools Workshop July 29th, Tahoe, CA. Overview of Diogenes Automatically detect performance issues with CPU- GPU

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE & EXTENSIBLE A FLEXIBLE, COMPOSABLE &

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Prioritized Garbage Collection Using the Garbage Collector to Support Caching Diogenes Nunez ,

Fugledes spectral set conjecture on cyclic groups Romanos Diogenes Malikiosis TU Berlin Frame

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Another view Hidden Input CEC is constant error Hidden carrousel No vanishing gradients

SKI: Exposing Kernel Concurrency Bugs through Systematic Schedule Exploration Pedro Fonseca

1 Data path basic building blocks. Register file Storage elements Basic building block (at

S YNCHRONOUS & 24-Nov-2010 A SYNCHRONOUS D ATA T RANSFER www.eazynotes.com Maninder Kaur 1

Outline 0024 Spring 2010 12 :: 2 Synchronization, revisited Two aspects of synchronization

Unit 15: Experimental Microkernel Systems 15.2. Chorus: Microkernel and User-space Actors AP

Interconnection Structures Patrick Happ Raul Queiroz Feitosa Objective To present key issues

CAN CLOUD COMPUTING SYSTEMS OFFER HIGH ASSURANCE WITHOUT LOSING KEY CLOUD PROPERTIES? CS6410

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Architecture Design Principles for the Integration of Synchronization Interfaces into

Diogenes: A tool for exposing Hidden GPU Performance Opportunities - PowerPoint PPT Presentation

Diogenes: A tool for exposing Hidden GPU Performance Opportunities Benjamin Welton and Barton Miller 2019 Performance T ools Workshop July 29th, Tahoe, CA. Overview of Diogenes Automatically detect performance issues with CPU- GPU

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE &amp; EXTENSIBLE A FLEXIBLE, COMPOSABLE &amp;

SynAthina Onli line Tools 1. . A mapping tool 2. A Community Tool 3. An Archive Tool 3. An

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Prioritized Garbage Collection Using the Garbage Collector to Support Caching Diogenes Nunez ,

Fugledes spectral set conjecture on cyclic groups Romanos Diogenes Malikiosis TU Berlin Frame

Outline depmixS4: an R-package for hidden Markov models Hidden Markov Models Ingmar Visser 1

Another view Hidden Input CEC is constant error Hidden carrousel No vanishing gradients

SKI: Exposing Kernel Concurrency Bugs through Systematic Schedule Exploration Pedro Fonseca

1 Data path basic building blocks. Register file Storage elements Basic building block (at

S YNCHRONOUS &amp; 24-Nov-2010 A SYNCHRONOUS D ATA T RANSFER www.eazynotes.com Maninder Kaur 1

Outline 0024 Spring 2010 12 :: 2 Synchronization, revisited Two aspects of synchronization

Unit 15: Experimental Microkernel Systems 15.2. Chorus: Microkernel and User-space Actors AP

Interconnection Structures Patrick Happ Raul Queiroz Feitosa Objective To present key issues

CAN CLOUD COMPUTING SYSTEMS OFFER HIGH ASSURANCE WITHOUT LOSING KEY CLOUD PROPERTIES? CS6410

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Architecture Design Principles for the Integration of Synchronization Interfaces into

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE & EXTENSIBLE A FLEXIBLE, COMPOSABLE &

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

S YNCHRONOUS & 24-Nov-2010 A SYNCHRONOUS D ATA T RANSFER www.eazynotes.com Maninder Kaur 1