s9347 performance analysis for large scale gpu
play

S9347: Performance Analysis for Large Scale GPU Applications and DL - PowerPoint PPT Presentation

S9347: Performance Analysis for Large Scale GPU Applications and DL Frameworks Dr. Guido Juckeland / Robert Henschel Head Computational Science Dept. / Director of Science CommunityTools at Indiana University www.hzdr.de Agenda What to


  1. S9347: Performance Analysis for Large Scale GPU Applications and DL Frameworks Dr. Guido Juckeland / Robert Henschel Head Computational Science Dept. / Director of Science CommunityTools at Indiana University www.hzdr.de

  2. Agenda What to expect from the next 80 minutes  Motivation  Generating profiles and trace files with Score-P  Visualizing trace files with Vampir  Looking into Deep Learning Frameworks 2

  3. Disclaimer It‘s extremely easy to waste performance  Poor/no GPU usage (80-90%)  Bad MPI (50-90%)  Total: 1% of peak (or worse)  Performance tools will not “automagically” make your code faster – they just point to “areas of interest” 3

  4. Motivation Performance Tuning 101 4

  5. Profiling vs. Tracing Preserving the details Number of Invocations Execution Time Statistics main bar foo 0 0,5 1 1,5 2 2,5 3 3,5 4 4,5 Timelines main foo bar foo main foo bar foo Time 5

  6. Sampling Periodic observations of your application (Pull) t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 t 9 Time bar main foo Measurement  Running program is periodically interrupted to take measurement  Statistical inference of program behavior  Not very detailed information on highly volatile metrics  Requires long-running applications  Works with unmodified executables 6

  7. Instrumentation Modify application to deliver information (Push) t 8 t 10 t 13 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 9 t 11 t 12 t 14 Time bar main foo Measurement  Measurement code is inserted such that every event of interest is captured directly  Advantage:  Much more detailed information  Disadvantage:  Processing of source-code / executable necessary  Large relative overheads for small functions 7

  8. Sampling vs. Tracing Comparing both approaches visually main calculate calculate calculate Function Instrumen- add f add f add f tation: f f f f f f f B a T f e r e u c r main calculate calculate calculate Sampling: add f add f add f f f f f f f f B a T f e r u e c r 8

  9. Sampling + Instrumentation Combining the best of both worlds S S p m S p m p m e e e a a a l l l main calculate calculate calculate add f add f add f f f f MPI MPI MPI f B a T f e r u e c r  Long running applications:  Requires large buffers or heavy filtering  Creating a filter requires runs in advance  Codes with many small functions (e.g.: C++):  Function instrumentation a challenge  Score-P: Sampling+Tracing 9

  10. Terms and How They Relate Making sure we use the same words Profiling Tracing Data Profiles Timelines Presentation Data Summarization Logging Recording Data Event-based Sampling Instrumentation Acquisition Analysis Layer Analysis Technique 10

  11. Summary Making the “right” choices 11

  12. Generating Traces and Profiles with Score-P 12

  13. Overall workflow Recording and studying performance data Application Application Performance Visualization Trace Core Data Score-P Score-P  Attach Score-P to application  Run with attached monitor ==> trace/profile data  Study trace with Vampir / profile with Cube  Repeat to:  Adapt instrumentation (“what you measure”)  Evaluate result of a change 13

  14. Attaching Score-P a.k.a. instrumenting your source code CC = pgcc CC = scorep <options> pgcc CC = pgcc CC = scorep <options> pgcc CXX = pgCC CXX = scorep <options> pgCC CXX = pgCC CXX = scorep <options> pgCC F90 = pgf90 F90 = scorep <options> pgf90 F90 = pgf90 F90 = scorep <options> pgf90 MPICC = mpicc MPICC = scorep <options> mpicc MPICC = mpicc MPICC = scorep <options> mpicc NVCC = nvcc NVCC = scorep <options> nvcc NVCC = nvcc NVCC = scorep <options> nvcc $ scorep --help This is the Score-P instrumentation tool. The usage is: scorep <options> <original command> Common options are: ... --instrument-filter=<file> Specifies the filter file for filtering functions during compile-time. It applies the same syntax, as the one used by Score-P during run-time. --user Enables user instrumentation. 14

  15. Attaching Score-P Instrument once – change measurement via runtime variables $ scorep-info config-vars --full SCOREP_ENABLE_PROFILING [...] SCOREP_ENABLE_TRACING [...] SCOREP_TOTAL_MEMORY Description: Total memory in bytes for the measurement system [...] SCOREP_EXPERIMENT_DIRECTORY Description: Name of the experiment directory [...] $ export SCOREP_ENABLE_PROFILING=true $ export SCOREP_ENABLE_TRACING=false Profiling Example $ export SCOREP_EXPERIMENT_DIRECTORY=profile $ mpirun <instrumented binary> 15

  16. Combined Sampling+Tracing Available since Score-P 2.0 $ export SCOREP_ENABLE_TRACING=true $ export SCOREP_ENABLE_UNWINDING=true $ export SCOREP_SAMPLING_EVENTS=perf_cycles@2000000  User code is sampled (pull)  Runtime libraries with tracing support use events (push):  MPI  OpenMP / OpenACC / pthreads  CUDA / OpenCL  I/O 16

  17. Things to look at What can Score-P record? Appli- User Functions Parallel Paradigms Hardware User Functions Parallel Paradigms Hardware Run on HPC − C/C++/Fortran − MPI − Performance − C/C++/Fortran − MPI − Performance cation system − Sampling *NEW* − Pthreads counters (PAPI) − Sampling *NEW* − Pthreads counters (PAPI) − Custom regions − OpenMP − Plugin counters − Custom regions − OpenMP − Plugin counters − XeonPhi Native *NEW* − XeonPhi Native *NEW* − Java − CUDA − Java − CUDA Results Operating Score-P Operating − Python − OpenACC/OpenCL *NEW* − Python − OpenACC/OpenCL *NEW* System System (*Experimenal*) − OpenShmem (+Cray) (*Experimenal*) − OpenShmem (+Cray) − Resource usage − Resource usage Performance Measurement − I/O (*Experimental*) − I/O (*Experimental*) (Profjle/Trace) 17

  18. GPU Tracing Example CUDA and OpenACC $ export SCOREP_ENABLE_TRACING=yes $ export SCOREP_TIMER=clock_gettime $ export SCOREP_CUDA_ENABLE=driver,kernel,memcpy,flushatexit $ export SCOREP_OPENACC_ENABLE=yes $ export ACC_PROFLIB=$SCOREP_LIB/libscorep_adapter_openacc_event.so  Can be used in combination  Also supports CUPTI counters 18

  19. Limitations Why tracing is hard Application Application Performance Visualization Trace CPU Data Score-P Score-P Temporarily stored in main memory Adds Overhead at runtime Limited size => Overhead must be low for meaningful performance analysis  Event tracing requires trade-offs:  Only add the data sources you need  Limit granularity (i.e., filtering)  Score-P is a profiling experiment 19

  20. DEMO: Generating Traces and Profiles with Score-P 20

  21. Visualizing Profiles with CUBE Traces with Vampir 21

  22. Bringing it all together Score-P + Analysis Tools Vampir Scalasca CUBE TAU Periscope TAUdb Call-path profiles Event traces (OTF2) (CUBE4, TAU) Online interface Hardware counter (PAPI, rusage) Score-P measurement infrastructure Instrumentation wrapper Accelerator-based Process-level parallelism Thread-level parallelism Source code parallelism User instrumentation (MPI, SHMEM) (OpenMP, Pthreads) instrumentation (CUDA, OpenCL, OpenACC ) Application 22

  23. CUBE Interactive profile analysis How is it What kind of Where is it in the distributed across performance source code? the processes/threads? metric? In what context? 23

  24. Vampir Interactive trace analysis >50% time wasted Large imbalance instantly visible 24

  25. Vampir Performance data visualization in a complex environment I/O Compute Nodes Login Dekstop I/O Compute Nodes Login Dekstop System (Batch jobs) Nodes System System (Batch jobs) Nodes System Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Trace Core Core Core Core Core Core Core Core Core File Core Core Core Core (OTF2) Core Core Core Core Core Core Core Core Core Core Core Core Core 25

  26. Simplest Approach Use your destop system I/O Compute Nodes Login Dekstop I/O Compute Nodes Login Dekstop System (Batch jobs) Nodes System System (Batch jobs) Nodes System Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Trace Core Core Core Core Core Core Core Core Core File Core Core Core Core (OTF2) + Minimal setup (no installations, no batch job) Core Core Core Core Core Visualization and Core Core Core Core - Copying of traces to desktop analysis : Core Core Core Core Vampir - Only small traces 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend