[PPT] - XSP: Across-Stack Profiling and Analysis of Machine Learning Models PowerPoint Presentation

SLIDE 1

XSP: Across-Stack Profiling and Analysis

f Machine Learning Models on GPUs

Cheng Li1, Abdul Dakkak1, Jinjun Xiong2, Wei Wei3, Lingjie Xu3, Wen-mei Hwu1

University of Illinois Urbana-Champaign1, IBM Research2, Alibaba Group3

{cli99, dakkak, w-hwu}@illinois.edu, jinjun@us.ibm.com, {w.wei, lingjie.xu}@alibaba-inc.com

Video: https://youtu.be/v95JfmM66eE

SLIDE 2

§ Machine Learning (ML) models are used in many application domains § Understanding ML inference performance is an increasingly pressing but challenging task

Background

2

Slow adoption of DL innovations

SLIDE 3

Example: ResNet50

ML Model

3

BN

P P P F S

1 2 2 3 4 4 4 5 6 6 6 6 6 7 8 8

Convolution

BatchNorm Relu Padding Fully Connected Softmax

A graph where each vertex is a layer (or operator) and an edge represents data transfer

Module 1

Module 2 Module 3 Module 4 Module 5 Module 6 Module 7 Module 8

BN

+

BN

BN
BN

6 4

5

6

5

6 645656 2565656 2565656 645656 645656 645656 645656 645656 645656 2565656 2 5 6

5

6

5

6 2565656 2565656

+

BN
BN
BN

2565656 2 5 6

5

6

5

6 645656 645656 645656 645656 645656 645656 2565656 2 5 6

5

6

5

6 2565656 2565656

BN

+

BN

BN
BN

2 5 6

5

6

5

6 2565656 5122828 5122828 1282828 1282828 1282828 1282828 1282828 1282828 5122828 5 1 2

2

8

2

8 5122828 5122828

+

BN
BN
BN

5122828 5 1 2

2

8

2

8 1282828 1282828 1282828 1282828 1282828 1282828 5122828 5 1 2

2

8

2

8 5122828 5122828

BN

+

BN

BN
BN

5 1 2

2

8

2

8 5122828 10241414 10241414 2561414 2561414 2561414 2561414 2561414 2561414 10241414 1 2 4

1

4

1

4 10241414 10241414

+

BN
BN
BN

10241414 1 2 4

1

4

1

4 2561414 2561414 2561414 2561414 2561414 2561414 10241414 1 2 4

1

4

1

4 10241414 10241414

BN

+

BN

BN
BN

1 2 4

1

4

1

4 10241414 204877 204877 51277 51277 51277 51277 51277 51277 204877 2 4 8

7
7

204877 204877

+

BN
BN
BN

204877 2 4 8

7
7

51277 51277 51277 51277 51277 51277 204877 2 4 8

7
7

204877 204877

Pooling

SLIDE 4

ML Inference Pipeline

4

Image decoding
Resizing
Normalization
Type conversion

Input Image

Input Tensor

Model prediction using framework API Unpacking into pairs (label, probabilities) and sorting

Output Tensor

Prediction Pre-processing Post-processing

(dog, 0.99)

Top1

SLIDE 5

§ A holistic view of the model execution is needed § Existing profiling tools are disjoint

– Profiling at different granularities means switching between tools – No correlation between profiles

XSP Motivation

5

Pre-proess Input Post-process Output Predict Model Conv Conv Bias Concat Bias Data FC Relu Relu Malloc

CUDNN Transpose

Free

CUDNN

flop count SP DRAM Read DRAM Write cudaMalloc ConvKernel cudaFree

Model Framework System

Levels of the HW/SW stack

SLIDE 6

§ Inference is impacted by the interplay between levels of the HW/SW stack § Any of them can be a bottleneck

XSP Motivation

6

Pre-proess Input Post-process Output Predict Model Conv Conv Bias Concat Bias Data FC Relu Relu Malloc

CUDNN Transpose

Free

CUDNN

flop count SP DRAM Read DRAM Write cudaMalloc ConvKernel cudaFree

Model Framework System

Levels of the HW/SW stack

SLIDE 7

Current DL Profiling on GPUs

7

Using code insertion Using framework profiler Using nvprof or Nsight Model-, layer-, and GPU kernel-level profiles of MLPerf ResNet50 v1.5 with batch size 256 on a Volta GPU

One has to manually perform the difficult task of correlating these disjoint profiles

Input Pre-Process Output Post-Process Model Inference … BN Data SoftMax Relu Conv Kernel1

Name=ShuffleTensor Grid=

Kernel2

Name=OffsetComp Grid=

Kernel3

Name=VoltaCUDNN_128x64 Grid=

GPU Metrics

SP Flop Count=62GFlop DRAM Read Bytes=12.1MB DRAM Write Bytes=296MB Achieved Occupancy=13.2%

Model

1

GPU Kernel

3

Layer

2

SLIDE 8

§ NGC frameworks (TensorFlow, PyTorch, etc.) are instrumented with NVTX markers

– GPU profile with layer annotations, lacks framework profiling – May inhibit frameworks from performing some optimizations – Does not work for DL models that use customized frameworks

§ TensorFlow profiler

– framework profile with some GPU profiling – Does not work for other frameworks

§ Vendor lock-in & limited applicability

An Approach - Modifying Frameworks

8

SLIDE 9

§ Incorporates profile data from different sources to obtain a holistic and hierarchical view of DL workloads

– Innovatively leverages distributed tracing

§ Accurately captures the profiles at each HW/SW stack level despite the profiling overhead

– Leveled experimentation methodology

§ Coupled with an automated analysis pipeline § Reveals insights that would otherwise be difficult to discern

XSP: Across-stack Profiling

9

SLIDE 10

§ Designed to monitor distributed applications (e.g. microservices) § Key Concepts

– Span: a named, timed operation representing a piece of the workflow

Start & end timestamps
Tags & Logs: key-value pairs of user-defined annotation or logging messages for

spans

SpanContext: a state to refer to a distinct span

– Trace: a tree of spans – Tracer: an object that creates and publishes spans

Distributed Tracing

10

SLIDE 11

An Example

11

A C E B D {context} {context} {context} {context} {context} F {context} An application with services (A, B, C, D, E, F) that have causal relationships F Time A B C D E Spans Application Timeline Tracing Workflow Tracing Server Publish Spans Tracer(s) Host 0 Application Host 1 Tracer(s) Publish Spans Server Host

SLIDE 12

§ Observe the similarity between profiling and distributed tracing § Turn profilers into tracers § Convert profiled events into spans § Multiple tracers can exist within a stack level § Tracers can be enabled/disabled

Leveraging Distributed Tracing in XSP

12

XSP Design Level 2 Level 1 HW/SW Stack E20 E10 E2y E00 E2z E1x ..... ..... Level 0 (user-code) ..... Level N

Tracer 0 Events

Tracer 1 Tracer 2, 3 E1

Tracer 1 Events

E2

Tracer 2 Events

Tracer 0 Tracer M Tracers E0 E2

Tracer 3 Events

.....

SLIDE 13

§ Tracers use the system clock § Spans are time intervals and assigned with levels § During the profile analysis, check interval inclusion

– If interval s1 contains interval s2 and s1 is a level higher than s2, then s1 is a parent of s2

Constructing Parent/Child Relationships

13

Time E00 E10 E20 E21 E2y Spans E1x ..... ..... Time Interval Inclusion

SLIDE 14

§ E.g. Asynchronous GPU kernel launches § Capture both the kernel launch and execution spans

– Use the kernel launch span to figure out the parent span – Use the kernel execution span to get performance information or figure

ut its children spans

Capturing Asynchronous Events

14

Time Conv kernel execution ..... cudaLaunchKernel

SLIDE 15

§ E.g. Two conv layers overlap, and each invokes GPU kernels § Serialize the conv layers to get their correlations to GPU kernels § Or more complex post-processing

Capturing Parallel Events

15

Time model conv2 kernel2 ..... ..... Two conv layers overlap conv1 kernel1 ..... .....

SLIDE 16

XSP for ML Inference on GPUs

16

Global Tracer: User inserts tracing API (startSpan & finishSpan) to capture code sections

No change to DL frameworks or libraries

Framework Tracer: Built on top of the framework profiling capability to capture layer level information GPU Tracer: Built on top of CUPTI to capture CUDA runtime API, GPU activities, GPU metrics

Model-, layer-, and GPU kernel-level profiles of MLPerf ResNet50 v1.5 with batch size 256 on a Volta GPU

Input Pre-Process Output Post-Process Model Inference … BN Data SoftMax Relu Conv Kernel1

Name=ShuffleTensor Grid=

Kernel2

Name=OffsetComp Grid=

Kernel3

Name=VoltaCUDNN_128x64 Grid=

GPU Metrics

SP Flop Count=62GFlop DRAM Read Bytes=12.1MB DRAM Write Bytes=296MB Achieved Occupancy=13.2%

Model

1

GPU Kernel

3

Layer

2

SLIDE 17

… BN

4.2ms

Data

1.2ms

SoftMax

0.1ms

Relu

2.1ms

Conv

5.1ms

ShflTens

0.1ms

OffstComp

0ms

VoltaCUDNN_128x64

4.9ms

M/L

Input Pre-Process Output Post-Process Model Prediction

275.1ms

M

Input Pre-Process Output Post-Process Model Prediction

275.1ms Profiling Overhead 157ms

… BN Data SoftMax Relu Conv

5.1ms

M/L/G

Input Pre-Process Output Post-Process Model Prediction

275.1ms Profiling Overhead 215.2ms Profiling Overhead 0.24ms

M: Model-level Profiling L: Layer-level Profiling G: GPU Kernel-level Profiling

§ Profiling always comes with overhead § XSP uses leveled experimentation to get accurate timing for all levels

Dealing with Profiling Overhead

17

SLIDE 18

… BN

4.2ms

Data

1.2ms

SoftMax

0.1ms

Relu

2.1ms

Conv

5.1ms

ShflTens

0.1ms

OffstComp

0ms

VoltaCUDNN_128x64

4.9ms

M/L

Input Pre-Process Output Post-Process Model Prediction

275.1ms

M

Input Pre-Process Output Post-Process Model Prediction

275.1ms Profiling Overhead 157ms

… BN Data SoftMax Relu Conv

5.1ms

M/L/G

Input Pre-Process Output Post-Process Model Prediction

275.1ms Profiling Overhead 215.2ms Profiling Overhead 0.24ms

M: Model-level Profiling L: Layer-level Profiling G: GPU Kernel-level Profiling

§ Profilers at level n accurately capture events at level n § Use traces from runs with different profiling levels enabled

– Overheadn = Profile0/…/n – Profile0/…/n-1

Leveled Experimentation

18

ProfileM/L – ProfileM ProfileM/L/G – ProfileM/L

SLIDE 19

Automated Across-stack Analysis

19

The 15 analyses performed by XSP using profiles from one or more levels M: Model-level profiling L: Framework-level profiling G: GPU-level profiling

SLIDE 20

Example Analysis

A13 A14 A8 https://ipdps20.netlify.com/tensorflow/mlperf_resnet50_v1.5/ The top 5 most time-consuming GPU kernel invocations Layer roofline analysis GPU vs Non-GPU Normalized latency

1 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 20 40 60 80 100

Layer Index /

GPU vs Non-GPU Normalized Latency

SLIDE 21

§ Other profiling tools or methods can be integrated

– More tracers at each stack level, e.g. CPU+GPU – Capture more stack levels, e.g. ML library level and application level – Work with accelerators and simulators

§ Add more types of analyses § Add ML training support

XSP Extensibility

21

SLIDE 22

§ XSP is an across-stack profiling design that aggregates profile data from different sources and correlates them to construct a holistic and hierarchical view of ML model execution

– A smooth hierarchical step-through of model performance at different levels within the HW/SW stack to identify bottlenecks – Systematic comparisons of models, frameworks, and hardware through the consistent profiling and automated analysis workflows – Extensible to accommodate different use cases

Conclusion

22

SLIDE 23

23

Cheng Li*1, Abdul Dakkak*1, Jinjun Xiong2, Wei Wei3, Lingjie Xu3, Wen-mei Hwu1 University of Illinois Urbana-Champaign1, IBM Research2, Alibaba Group

XSP: Across-Stack Profiling and Analysis

Cheng Li*1, Abdul Dakkak*1, Jinjun Xiong2, Wei Wei3, Lingjie Xu3, Wen-mei Hwu1

§ Machine Learning (ML) models are used in many application domains § Understanding ML inference performance is an increasingly pressing but challenging task

Background

ML Model

ML Inference Pipeline

§ A holistic view of the model execution is needed § Existing profiling tools are disjoint

– Profiling at different granularities means switching between tools – No correlation between profiles

XSP Motivation

§ Inference is impacted by the interplay between levels of the HW/SW stack § Any of them can be a bottleneck

XSP Motivation

Current DL Profiling on GPUs

§ NGC frameworks (TensorFlow, PyTorch, etc.) are instrumented with NVTX markers

– GPU profile with layer annotations, lacks framework profiling – May inhibit frameworks from performing some optimizations – Does not work for DL models that use customized frameworks

§ TensorFlow profiler

– framework profile with some GPU profiling – Does not work for other frameworks

§ Vendor lock-in & limited applicability

An Approach - Modifying Frameworks

§ Incorporates profile data from different sources to obtain a holistic and hierarchical view of DL workloads

– Innovatively leverages distributed tracing

§ Accurately captures the profiles at each HW/SW stack level despite the profiling overhead

– Leveled experimentation methodology

§ Coupled with an automated analysis pipeline § Reveals insights that would otherwise be difficult to discern

XSP: Across-stack Profiling

§ Designed to monitor distributed applications (e.g. microservices) § Key Concepts

– Span: a named, timed operation representing a piece of the workflow

– Trace: a tree of spans – Tracer: an object that creates and publishes spans

Distributed Tracing

An Example

§ Observe the similarity between profiling and distributed tracing § Turn profilers into tracers § Convert profiled events into spans § Multiple tracers can exist within a stack level § Tracers can be enabled/disabled

Leveraging Distributed Tracing in XSP

§ Tracers use the system clock § Spans are time intervals and assigned with levels § During the profile analysis, check interval inclusion

– If interval s1 contains interval s2 and s1 is a level higher than s2, then s1 is a parent of s2

Constructing Parent/Child Relationships

§ E.g. Asynchronous GPU kernel launches § Capture both the kernel launch and execution spans

– Use the kernel launch span to figure out the parent span – Use the kernel execution span to get performance information or figure

Capturing Asynchronous Events

§ E.g. Two conv layers overlap, and each invokes GPU kernels § Serialize the conv layers to get their correlations to GPU kernels § Or more complex post-processing

Capturing Parallel Events

XSP for ML Inference on GPUs

§ Profiling always comes with overhead § XSP uses leveled experimentation to get accurate timing for all levels

Dealing with Profiling Overhead

§ Profilers at level n accurately capture events at level n § Use traces from runs with different profiling levels enabled

– Overheadn = Profile0/…/n – Profile0/…/n-1

Leveled Experimentation

Automated Across-stack Analysis

Example Analysis

§ Other profiling tools or methods can be integrated

– More tracers at each stack level, e.g. CPU+GPU – Capture more stack levels, e.g. ML library level and application level – Work with accelerators and simulators

§ Add more types of analyses § Add ML training support

XSP Extensibility

§ XSP is an across-stack profiling design that aggregates profile data from different sources and correlates them to construct a holistic and hierarchical view of ML model execution

Conclusion

Thank you

More information in the paper

Cheng Li1, Abdul Dakkak1, Jinjun Xiong2, Wei Wei3, Lingjie Xu3, Wen-mei Hwu1