S9347: Performance Analysis for Large Scale GPU Applications and DL - - PowerPoint PPT Presentation

s9347 performance analysis for large scale gpu
SMART_READER_LITE
LIVE PREVIEW

S9347: Performance Analysis for Large Scale GPU Applications and DL - - PowerPoint PPT Presentation

S9347: Performance Analysis for Large Scale GPU Applications and DL Frameworks Dr. Guido Juckeland / Robert Henschel Head Computational Science Dept. / Director of Science CommunityTools at Indiana University www.hzdr.de Agenda What to


slide-1
SLIDE 1

www.hzdr.de

S9347: Performance Analysis for Large Scale GPU Applications and DL Frameworks

  • Dr. Guido Juckeland / Robert Henschel

Head Computational Science Dept. / Director of Science CommunityTools at Indiana University

slide-2
SLIDE 2

2

Agenda

What to expect from the next 80 minutes

  • Motivation
  • Generating profiles and trace files with Score-P
  • Visualizing trace files with Vampir
  • Looking into Deep Learning Frameworks
slide-3
SLIDE 3

3

Disclaimer

It‘s extremely easy to waste performance

  • Poor/no GPU usage (80-90%)
  • Bad MPI (50-90%)
  • Total: 1% of peak (or worse)
  • Performance tools will not “automagically” make your code faster – they just point to

“areas of interest”

slide-4
SLIDE 4

4

Motivation Performance Tuning 101

slide-5
SLIDE 5

5

Profiling vs. Tracing

Preserving the details Statistics

0,5 1 1,5 2 2,5 3 3,5 4 4,5

Number of Invocations Execution Time

main foo bar

Time main foo bar foo main foo bar foo

Timelines

slide-6
SLIDE 6

6

Sampling

Periodic observations of your application (Pull)

  • Running program is periodically interrupted to take measurement
  • Statistical inference of program behavior
  • Not very detailed information on highly volatile metrics
  • Requires long-running applications
  • Works with unmodified executables

Time

main foo

bar

Measurement

t9 t7 t6 t5 t4 t1 t2 t3 t8

slide-7
SLIDE 7

7

Instrumentation

Modify application to deliver information (Push)

  • Measurement code is inserted such that every event of interest is captured directly
  • Advantage:
  • Much more detailed information

main foo

bar

Measurement

Time t13 t9 t7 t6 t5 t4 t1 t2 t3 t8 t10 t11t12 t14

  • Disadvantage:
  • Processing of source-code / executable necessary
  • Large relative overheads for small functions
slide-8
SLIDE 8

8

Sampling vs. Tracing

Comparing both approaches visually

main calculate calculate calculate add add add f f f f f f f f f T r a c e B u f f e r

Function Instrumen- tation: Sampling:

main calculate calculate calculate add add add f f f f f f f f f T r a c e B u f f e r

slide-9
SLIDE 9

9

Sampling + Instrumentation

Combining the best of both worlds

  • Long running applications:
  • Requires large buffers or heavy filtering
  • Creating a filter requires runs in advance
  • Codes with many small functions (e.g.: C++):
  • Function instrumentation a challenge

main calculate calculate calculate add add add f

MPI

f

MPI

f

MPI

f f f T r a c e B u f f e r S a m p l e S a m p l e S a m p l e

  • Score-P: Sampling+Tracing
slide-10
SLIDE 10

10

Terms and How They Relate

Making sure we use the same words

Analysis Layer Analysis Technique

Data Acquisition Data Recording Data Presentation

Profiling Tracing

Profiles Timelines Summarization Logging Sampling Event-based Instrumentation

slide-11
SLIDE 11

11

Summary

Making the “right” choices

slide-12
SLIDE 12

12

Generating Traces and Profiles with Score-P

slide-13
SLIDE 13

13

Overall workflow

Recording and studying performance data

  • Attach Score-P to application
  • Run with attached monitor ==> trace/profile data
  • Study trace with Vampir / profile with Cube
  • Repeat to:
  • Adapt instrumentation (“what you measure”)
  • Evaluate result of a change

Application Application Core Score-P Score-P Trace Data Performance Visualization

slide-14
SLIDE 14

14

Attaching Score-P

a.k.a. instrumenting your source code

CC = pgcc CXX = pgCC F90 = pgf90 MPICC = mpicc NVCC = nvcc CC = pgcc CXX = pgCC F90 = pgf90 MPICC = mpicc NVCC = nvcc CC = scorep <options> pgcc CXX = scorep <options> pgCC F90 = scorep <options> pgf90 MPICC = scorep <options> mpicc NVCC = scorep <options> nvcc CC = scorep <options> pgcc CXX = scorep <options> pgCC F90 = scorep <options> pgf90 MPICC = scorep <options> mpicc NVCC = scorep <options> nvcc

$ scorep --help This is the Score-P instrumentation tool. The usage is: scorep <options> <original command> Common options are: ...

  • -instrument-filter=<file>

Specifies the filter file for filtering functions during compile-time. It applies the same syntax, as the one used by Score-P during run-time.

  • -user Enables user instrumentation.
slide-15
SLIDE 15

15

Attaching Score-P

Instrument once – change measurement via runtime variables

$ scorep-info config-vars --full SCOREP_ENABLE_PROFILING [...] SCOREP_ENABLE_TRACING [...] SCOREP_TOTAL_MEMORY Description: Total memory in bytes for the measurement system [...] SCOREP_EXPERIMENT_DIRECTORY Description: Name of the experiment directory [...] $ export SCOREP_ENABLE_PROFILING=true $ export SCOREP_ENABLE_TRACING=false $ export SCOREP_EXPERIMENT_DIRECTORY=profile $ mpirun <instrumented binary>

Profiling Example

slide-16
SLIDE 16

16

Combined Sampling+Tracing

Available since Score-P 2.0

  • User code is sampled (pull)
  • Runtime libraries with tracing support use events (push):
  • MPI
  • OpenMP / OpenACC / pthreads
  • CUDA / OpenCL
  • I/O

$ export SCOREP_ENABLE_TRACING=true $ export SCOREP_ENABLE_UNWINDING=true $ export SCOREP_SAMPLING_EVENTS=perf_cycles@2000000

slide-17
SLIDE 17

17

Things to look at

What can Score-P record?

Run on HPC system

Appli- cation

Results

Score-P

Performance Measurement (Profjle/Trace)

User Functions

− C/C++/Fortran − Sampling *NEW* − Custom regions − Java − Python (*Experimenal*)

User Functions

− C/C++/Fortran − Sampling *NEW* − Custom regions − Java − Python (*Experimenal*)

Operating System

− Resource usage

Operating System

− Resource usage

Hardware

− Performance counters (PAPI) − Plugin counters

Hardware

− Performance counters (PAPI) − Plugin counters

Parallel Paradigms

− MPI − Pthreads − OpenMP − XeonPhi Native *NEW* − CUDA − OpenACC/OpenCL *NEW* − OpenShmem (+Cray) − I/O (*Experimental*)

Parallel Paradigms

− MPI − Pthreads − OpenMP − XeonPhi Native *NEW* − CUDA − OpenACC/OpenCL *NEW* − OpenShmem (+Cray) − I/O (*Experimental*)

slide-18
SLIDE 18

18

GPU Tracing

Example CUDA and OpenACC

  • Can be used in combination
  • Also supports CUPTI counters

$ export SCOREP_ENABLE_TRACING=yes $ export SCOREP_TIMER=clock_gettime $ export SCOREP_CUDA_ENABLE=driver,kernel,memcpy,flushatexit $ export SCOREP_OPENACC_ENABLE=yes $ export ACC_PROFLIB=$SCOREP_LIB/libscorep_adapter_openacc_event.so

slide-19
SLIDE 19

19

Limitations

Why tracing is hard

  • Event tracing requires trade-offs:
  • Only add the data sources you need
  • Limit granularity (i.e., filtering)
  • Score-P is a profiling experiment

Application Application CPU Score-P Score-P Trace Data Performance Visualization

Adds Overhead at runtime => Overhead must be low for meaningful performance analysis Temporarily stored in main memory Limited size

slide-20
SLIDE 20

20

DEMO: Generating Traces and Profiles with Score-P

slide-21
SLIDE 21

21

Visualizing Profiles with CUBE Traces with Vampir

slide-22
SLIDE 22

22

Bringing it all together

Score-P + Analysis Tools

Application

Vampir Scalasca Periscope TAU

Accelerator-based parallelism

(CUDA, OpenCL, OpenACC)

Score-P measurement infrastructure

Event traces (OTF2)

User instrumentation

Call-path profiles (CUBE4, TAU) Online interface Hardware counter (PAPI, rusage)

Process-level parallelism (MPI, SHMEM) Thread-level parallelism (OpenMP, Pthreads)

Instrumentation wrapper

Source code instrumentation

CUBE TAUdb

slide-23
SLIDE 23

23

CUBE

Interactive profile analysis

How is it distributed across the processes/threads? What kind of performance metric? Where is it in the source code? In what context?

slide-24
SLIDE 24

24

Vampir

Interactive trace analysis Large imbalance instantly visible

>50% time wasted

slide-25
SLIDE 25

25

Vampir

Performance data visualization in a complex environment

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Compute Nodes (Batch jobs) Compute Nodes (Batch jobs)

Core Core Core Core

Login Nodes Login Nodes Trace File (OTF2) I/O System I/O System

Core Core

Dekstop System Dekstop System

slide-26
SLIDE 26

26

Simplest Approach

Use your destop system

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Compute Nodes (Batch jobs) Compute Nodes (Batch jobs)

Core Core Core Core

Login Nodes Login Nodes Trace File (OTF2) I/O System I/O System Dekstop System Dekstop System

Core Core

Visualization and analysis: Vampir

+ Minimal setup (no installations, no batch job)

  • Copying of traces to desktop
  • Only small traces
slide-27
SLIDE 27

27

(Re)Using the HPC Resources

Run analysis engine on compute nodes, GUI on desktop

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Compute Nodes (Batch jobs) Compute Nodes (Batch jobs)

Core Core Core Core

Login Nodes Login Nodes Trace File (OTF2) I/O System I/O System Dekstop System Dekstop System

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core

Analysis: VampirServer Visualization: Vampir TCP Socket connection

+ Best performance, low response time

  • Tunneling to connect to batch job
  • Installation on desktop system needed
slide-28
SLIDE 28

28

Vampir GUI

What do the fancy colors mean?

Master Timeline Summary Timeline Process Timeline Counter Data Timeline Master Timeline Summary Timeline Process Timeline Counter Data Timeline

Function Summary Communication Matrix View Process Summary Function Summary Communication Matrix View Process Summary

slide-29
SLIDE 29

29

Vampir GUI

Timeline Charts

all threads’ activities over time per thread all threads’ activities over time per activity all threads’ perf-metric over time single thread’s activities over time single threads perf-metric over time

Master Timeline

Summary Timeline Performance Radar Process Timeline Counter Data Timeline

slide-30
SLIDE 30

30

Vampir GUI

Summary/Profile Charts

Function Summary Message Summary I/O Summary Process Summary Communication Matrix View runtime/invocation summaries data transfer statistics I/O statistics Clustering of similar event streams Pairwise communincation statistics

slide-31
SLIDE 31

31

Vampir Performance Charts in Detail

Master Timeline

Detailed information about functions, communication and synchronization events for collection of processes.

slide-32
SLIDE 32

32

Vampir Performance Charts in Detail

Summary Timeline

Fractions of the number of processes that are actively involved in given activities at a certain point in time.

slide-33
SLIDE 33

33

Vampir Performance Charts in Detail

Process Timeline

Detailed information about different levels of function calls in a stacked bar chart for an individual process.

slide-34
SLIDE 34

34

Vampir Performance Charts in Detail

Counter Timeline

Detailed counter information over time for an individual process.

slide-35
SLIDE 35

35

Vampir Performance Charts in Detail

Performance Radar

Detailed counter information over time for a collection of processes.

slide-36
SLIDE 36

36

Vampir Performance Metrics

Where do they come from?

slide-37
SLIDE 37

37

Vampir Performance Charts in Detail

Function Summary

Overview of the accumulated information across all functions and for a collection of processes.

slide-38
SLIDE 38

38

Vampir Performance Charts in Detail

Process Summary

Overview of the accumulated information across all functions and for every process independently. Clustering: Grouping of similar processes by using summarized function information.

slide-39
SLIDE 39

39

Vampir Performance Charts in Detail

Communication Matrix View

slide-40
SLIDE 40

40

Vampir at Scale

Fit to chart height (feat. 200,000+ event streams)

slide-41
SLIDE 41

41

Comparing Traces with Vampir

slide-42
SLIDE 42

42

Seeing the differences

slide-43
SLIDE 43

43

Zooming in

One iteration of solution1 One iteration of solution2

Computation/ Communicatio n overlap for solution 3

slide-44
SLIDE 44

44

DEMO: Visualizing Trace Files with Vampir

slide-45
SLIDE 45

45

Looking into DL Frameworks

slide-46
SLIDE 46

46

Score-P Python Bindings

Tracing/Profiling for all python programs

$ export SCOREP_ENABLE_PROFILING=true $ export SCOREP_ENABLE_TRACING=false $ export SCOREP_EXPERIMENT_DIRECTORY=profile $ python -m scorep --mpi <script.py>

Profiling Example

  • Not yet included in main release
  • Available on GitHub:
  • https://github.com/score-p/scorep_binding_python
  • NSight/nvvp for single node DL frameworks still better (user instrumentation)
  • Score-P only choice for MPI-parallel DL frameworks
slide-47
SLIDE 47

47

Vampir with Python Traces

It looks all the same

slide-48
SLIDE 48

48

Vampir is available at http://www.vampir.eu Vampir at IU: https://kb.iu.edu/d/awbv Get support via vampirsupport@zih.tu-dresden.de Score-P: http://www.vi-hps.org/projects/score-p