S9347: Performance Analysis for Large Scale GPU Applications and DL - PowerPoint PPT Presentation

S9347: Performance Analysis for Large Scale GPU Applications and DL Frameworks Dr. Guido Juckeland / Robert Henschel Head Computational Science Dept. / Director of Science CommunityTools at Indiana University www.hzdr.de

Agenda What to expect from the next 80 minutes  Motivation  Generating profiles and trace files with Score-P  Visualizing trace files with Vampir  Looking into Deep Learning Frameworks 2

Disclaimer It‘s extremely easy to waste performance  Poor/no GPU usage (80-90%)  Bad MPI (50-90%)  Total: 1% of peak (or worse)  Performance tools will not “automagically” make your code faster – they just point to “areas of interest” 3

Motivation Performance Tuning 101 4

Profiling vs. Tracing Preserving the details Number of Invocations Execution Time Statistics main bar foo 0 0,5 1 1,5 2 2,5 3 3,5 4 4,5 Timelines main foo bar foo main foo bar foo Time 5

Sampling Periodic observations of your application (Pull) t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Time bar main foo Measurement  Running program is periodically interrupted to take measurement  Statistical inference of program behavior  Not very detailed information on highly volatile metrics  Requires long-running applications  Works with unmodified executables 6

Instrumentation Modify application to deliver information (Push) t 8 t 10 t 13 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 9 t 11 t 12 t 14 Time bar main foo Measurement  Measurement code is inserted such that every event of interest is captured directly  Advantage:  Much more detailed information  Disadvantage:  Processing of source-code / executable necessary  Large relative overheads for small functions 7

Sampling vs. Tracing Comparing both approaches visually main calculate calculate calculate Function Instrumen- add f add f add f tation: f f f f f f f B a T f e r e u c r main calculate calculate calculate Sampling: add f add f add f f f f f f f f B a T f e r u e c r 8

Sampling + Instrumentation Combining the best of both worlds S S p m S p m p m e e e a a a l l l main calculate calculate calculate add f add f add f f f f MPI MPI MPI f B a T f e r u e c r  Long running applications:  Requires large buffers or heavy filtering  Creating a filter requires runs in advance  Codes with many small functions (e.g.: C++):  Function instrumentation a challenge  Score-P: Sampling+Tracing 9

Terms and How They Relate Making sure we use the same words Profiling Tracing Data Profiles Timelines Presentation Data Summarization Logging Recording Data Event-based Sampling Instrumentation Acquisition Analysis Layer Analysis Technique 10

Summary Making the “right” choices 11

Generating Traces and Profiles with Score-P 12

Overall workflow Recording and studying performance data Application Application Performance Visualization Trace Core Data Score-P Score-P  Attach Score-P to application  Run with attached monitor ==> trace/profile data  Study trace with Vampir / profile with Cube  Repeat to:  Adapt instrumentation (“what you measure”)  Evaluate result of a change 13

Attaching Score-P a.k.a. instrumenting your source code CC = pgcc CC = scorep <options> pgcc CC = pgcc CC = scorep <options> pgcc CXX = pgCC CXX = scorep <options> pgCC CXX = pgCC CXX = scorep <options> pgCC F90 = pgf90 F90 = scorep <options> pgf90 F90 = pgf90 F90 = scorep <options> pgf90 MPICC = mpicc MPICC = scorep <options> mpicc MPICC = mpicc MPICC = scorep <options> mpicc NVCC = nvcc NVCC = scorep <options> nvcc NVCC = nvcc NVCC = scorep <options> nvcc $ scorep --help This is the Score-P instrumentation tool. The usage is: scorep <options> <original command> Common options are: ... --instrument-filter=<file> Specifies the filter file for filtering functions during compile-time. It applies the same syntax, as the one used by Score-P during run-time. --user Enables user instrumentation. 14

Attaching Score-P Instrument once – change measurement via runtime variables $ scorep-info config-vars --full SCOREP_ENABLE_PROFILING [...] SCOREP_ENABLE_TRACING [...] SCOREP_TOTAL_MEMORY Description: Total memory in bytes for the measurement system [...] SCOREP_EXPERIMENT_DIRECTORY Description: Name of the experiment directory [...] $ export SCOREP_ENABLE_PROFILING=true $ export SCOREP_ENABLE_TRACING=false Profiling Example $ export SCOREP_EXPERIMENT_DIRECTORY=profile $ mpirun <instrumented binary> 15

Combined Sampling+Tracing Available since Score-P 2.0 $ export SCOREP_ENABLE_TRACING=true $ export SCOREP_ENABLE_UNWINDING=true $ export SCOREP_SAMPLING_EVENTS=perf_cycles@2000000  User code is sampled (pull)  Runtime libraries with tracing support use events (push):  MPI  OpenMP / OpenACC / pthreads  CUDA / OpenCL  I/O 16

Things to look at What can Score-P record? Appli- User Functions Parallel Paradigms Hardware User Functions Parallel Paradigms Hardware Run on HPC − C/C++/Fortran − MPI − Performance − C/C++/Fortran − MPI − Performance cation system − Sampling *NEW* − Pthreads counters (PAPI) − Sampling *NEW* − Pthreads counters (PAPI) − Custom regions − OpenMP − Plugin counters − Custom regions − OpenMP − Plugin counters − XeonPhi Native *NEW* − XeonPhi Native *NEW* − Java − CUDA − Java − CUDA Results Operating Score-P Operating − Python − OpenACC/OpenCL *NEW* − Python − OpenACC/OpenCL *NEW* System System (*Experimenal*) − OpenShmem (+Cray) (*Experimenal*) − OpenShmem (+Cray) − Resource usage − Resource usage Performance Measurement − I/O (*Experimental*) − I/O (*Experimental*) (Profjle/Trace) 17

GPU Tracing Example CUDA and OpenACC $ export SCOREP_ENABLE_TRACING=yes $ export SCOREP_TIMER=clock_gettime $ export SCOREP_CUDA_ENABLE=driver,kernel,memcpy,flushatexit $ export SCOREP_OPENACC_ENABLE=yes $ export ACC_PROFLIB=$SCOREP_LIB/libscorep_adapter_openacc_event.so  Can be used in combination  Also supports CUPTI counters 18

Limitations Why tracing is hard Application Application Performance Visualization Trace CPU Data Score-P Score-P Temporarily stored in main memory Adds Overhead at runtime Limited size => Overhead must be low for meaningful performance analysis  Event tracing requires trade-offs:  Only add the data sources you need  Limit granularity (i.e., filtering)  Score-P is a profiling experiment 19

DEMO: Generating Traces and Profiles with Score-P 20

Visualizing Profiles with CUBE Traces with Vampir 21

Bringing it all together Score-P + Analysis Tools Vampir Scalasca CUBE TAU Periscope TAUdb Call-path profiles Event traces (OTF2) (CUBE4, TAU) Online interface Hardware counter (PAPI, rusage) Score-P measurement infrastructure Instrumentation wrapper Accelerator-based Process-level parallelism Thread-level parallelism Source code parallelism User instrumentation (MPI, SHMEM) (OpenMP, Pthreads) instrumentation (CUDA, OpenCL, OpenACC ) Application 22

CUBE Interactive profile analysis How is it What kind of Where is it in the distributed across performance source code? the processes/threads? metric? In what context? 23

Vampir Interactive trace analysis >50% time wasted Large imbalance instantly visible 24

Vampir Performance data visualization in a complex environment I/O Compute Nodes Login Dekstop I/O Compute Nodes Login Dekstop System (Batch jobs) Nodes System System (Batch jobs) Nodes System Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Trace Core Core Core Core Core Core Core Core Core File Core Core Core Core (OTF2) Core Core Core Core Core Core Core Core Core Core Core Core Core 25

Simplest Approach Use your destop system I/O Compute Nodes Login Dekstop I/O Compute Nodes Login Dekstop System (Batch jobs) Nodes System System (Batch jobs) Nodes System Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Trace Core Core Core Core Core Core Core Core Core File Core Core Core Core (OTF2) + Minimal setup (no installations, no batch job) Core Core Core Core Core Visualization and Core Core Core Core - Copying of traces to desktop analysis : Core Core Core Core Vampir - Only small traces 26

S9347: Performance Analysis for Large Scale GPU Applications and DL - PowerPoint PPT Presentation

S9347: Performance Analysis for Large Scale GPU Applications and DL Frameworks Dr. Guido Juckeland / Robert Henschel Head Computational Science Dept. / Director of Science CommunityTools at Indiana University www.hzdr.de Agenda What to

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Th The Next Linux Superpower: eB eBPF Pr Primer Sasha Goldshtein CTO, Sela Group @goldshtn

Prof. f. Dr.-In Ing. g. Jens Strack ckelj ljan an Jiangsu su Universi sity ty 18.09.2

Internet of Things Trade Mission to Malaysia 24 th 28 th July 2017 Global IoT Trend The major

1Q19 results Opportunity Day 23 rd May 2019 DISCLAIMER The information contained in this

Comparison between perf, Ftrace, LTTng and GDB tracepoints Rafik Fahem Department of Computer

SCAN: An approach to Label and Relate Performances Evaluation Execution Trace Segments RQ1: How

Congestion Revenue Rights Auction Efficiency Track 1B Perry Servedio Sr. Market Design

Maintaining Your Mac A Joe ON Tech Guide Maintenance Basics You wouldnt buy a new car and

S9347: Performance Analysis for Large Scale GPU Applications and DL - PowerPoint PPT Presentation

S9347: Performance Analysis for Large Scale GPU Applications and DL Frameworks Dr. Guido Juckeland / Robert Henschel Head Computational Science Dept. / Director of Science CommunityTools at Indiana University www.hzdr.de Agenda What to

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Understanding GPU performance How to get peak FLOPS (GPU version) Kenjiro Taura 1 / 7 Contents

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

GPU peak performance vs. CPU Squeezing GPU performance Peak Double Precision FLOPS Peak Memory

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems A. Gharaibeh, E.

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA AGENDA GPU Video

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Th The Next Linux Superpower: eB eBPF Pr Primer Sasha Goldshtein CTO, Sela Group @goldshtn

Prof. f. Dr.-In Ing. g. Jens Strack ckelj ljan an Jiangsu su Universi sity ty 18.09.2

Internet of Things Trade Mission to Malaysia 24 th 28 th July 2017 Global IoT Trend The major

1Q19 results Opportunity Day 23 rd May 2019 DISCLAIMER The information contained in this

Comparison between perf, Ftrace, LTTng and GDB tracepoints Rafik Fahem Department of Computer

SCAN: An approach to Label and Relate Performances Evaluation Execution Trace Segments RQ1: How

Congestion Revenue Rights Auction Efficiency Track 1B Perry Servedio Sr. Market Design

Maintaining Your Mac A Joe ON Tech Guide Maintenance Basics You wouldnt buy a new car and

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,