End to End Optimization Stack for Deep Learning Presenter: Tianqi - - PowerPoint PPT Presentation

end to end optimization stack for deep learning
SMART_READER_LITE
LIVE PREVIEW

End to End Optimization Stack for Deep Learning Presenter: Tianqi - - PowerPoint PPT Presentation

End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team Ziheng Jiang Tianqi Chen Thierry


slide-1
SLIDE 1

End to End Optimization Stack for Deep Learning

Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington

slide-2
SLIDE 2

Collaborators

Tianqi Chen Thierry Moreau Haichen Shen Ziheng Jiang Carlos Guestrin Luis Ceze Arvind Krishnamurthy ML, Software Stack Hardware Stack GPU ARM, NNVM pipeline

University of Washington AWS AI Team

and many more contributors in the DMLC community

slide-3
SLIDE 3

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Frameworks Operator Libraries cuDNN, NNPack, MKL-DNN

CNTK

slide-4
SLIDE 4

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN Frameworks

CNTK

slide-5
SLIDE 5

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN Frameworks

CNTK

slide-6
SLIDE 6

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN

Built a new accelerator

Frameworks

CNTK

slide-7
SLIDE 7

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN

Built a new accelerator Need entire software stack

  • n top of it!


Layout transformation Quantization Operator kernel optimization Benchmarking ….

Frameworks

CNTK

slide-8
SLIDE 8

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN Frameworks

CNTK

slide-9
SLIDE 9

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN Frameworks

CNTK

slide-10
SLIDE 10

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN

Data Layout Optimization

Frameworks

CNTK

slide-11
SLIDE 11

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN

Data Layout Optimization Operator Fusion

Frameworks

CNTK

slide-12
SLIDE 12

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN

Data Layout Optimization Operator Fusion

Need optimized hardware kernel for each variant, on each hardware!


Frameworks

CNTK

slide-13
SLIDE 13

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN

Data Layout Optimization Operator Fusion

Need optimized hardware kernel for each variant, on each hardware!


Frameworks

CNTK

Serving

slide-14
SLIDE 14

The End to End System Challenge

Hardware Back-Ends Frameworks

slide-15
SLIDE 15

The End to End System Challenge

Hardware Back-Ends Frameworks

slide-16
SLIDE 16

The End to End System Challenge

Hardware Back-Ends Frameworks

slide-17
SLIDE 17

The End to End System Challenge

Hardware Back-Ends Frameworks

slide-18
SLIDE 18

The End to End System Challenge

Hardware Back-Ends Frameworks

slide-19
SLIDE 19

The End to End System Challenge

Hardware Back-Ends Frameworks

slide-20
SLIDE 20

The End to End System Challenge

Hardware Back-Ends Frameworks

slide-21
SLIDE 21

The End to End System Challenge

Hardware Back-Ends Frameworks

Intermediate representation

slide-22
SLIDE 22

Computational Graph IR and Remaining Gap

Auto Differentiation Memory Plan Operator Fusion

Computational Graph

Backends

Examples: NGraph, XLA, NNVM, DLVM …

slide-23
SLIDE 23

Computational Graph IR and Remaining Gap

Auto Differentiation Memory Plan Operator Fusion

Computational Graph

Backends

slide-24
SLIDE 24

Computational Graph IR and Remaining Gap

Auto Differentiation Memory Plan Operator Fusion

Computational Graph

Backends

too many possible choices: precision, layout, fused pattern, device, threading … Need a low level IR to express them explicitly

slide-25
SLIDE 25

TVM: Low Level IR

TVM

Memory Plan

NNVM Graph

Auto Differentiation

Framework

Hardware backends

  • Concise and compact description
  • Explicit control on codegen
  • Ease of deployment
  • Support new hardware backends
slide-26
SLIDE 26

Tensor Index Expression Declaration

import tvm m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h') A = tvm.placeholder((m, h), name='A') B = tvm.placeholder((n, h), name=‘B') k = tvm.reduce_axis((0, h), name=‘k') C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] * B[j, k], axis=k))

Inputs Shape of C

Compute C = dot(A, B.T)

Computation Rule

slide-27
SLIDE 27

Challenge: Hardware Diversities

IR

slide-28
SLIDE 28

Challenge: Hardware Diversities

IR

CPU GPU Accelerators

slide-29
SLIDE 29

Challenge: Hardware Diversities

IR

CPU GPU Accelerators

L2 RF RF TX/L1 SM RF RF TX/L1 SM L1D L1I L2 L3 L1D L1I L2 Unified Buffer Acc FIFO

Memory subsystem implicitly managed mixed explicitly managed

slide-30
SLIDE 30

Challenge: Hardware Diversities

IR

CPU GPU Accelerators

L2 RF RF TX/L1 SM RF RF TX/L1 SM L1D L1I L2 L3 L1D L1I L2 Unified Buffer Acc FIFO

Memory subsystem implicitly managed mixed explicitly managed Compute primitives scalar vector tensor

slide-31
SLIDE 31

Challenge: Hardware Diversities

IR

CPU GPU Accelerators

L2 RF RF TX/L1 SM RF RF TX/L1 SM L1D L1I L2 L3 L1D L1I L2 Unified Buffer Acc FIFO

Memory subsystem implicitly managed mixed explicitly managed Compute primitives scalar vector tensor fp32 fp16 int8 Data type

slide-32
SLIDE 32

Unified Schedule Optimizations for Hardwares

Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR

slide-33
SLIDE 33

Unified Schedule Optimizations for Hardwares

Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization

slide-34
SLIDE 34

Unified Schedule Optimizations for Hardwares

Scheduling Optimizations

(✔) Data layout

Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization

slide-35
SLIDE 35

Unified Schedule Optimizations for Hardwares

Scheduling Optimizations

(✔) Data layout (✔) Tiling

Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization

slide-36
SLIDE 36

Unified Schedule Optimizations for Hardwares

Scheduling Optimizations

(✔) Data layout (✔) Tiling (✔) Thread cooperation

Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization

slide-37
SLIDE 37

Unified Schedule Optimizations for Hardwares

Scheduling Optimizations

(✔) Data layout (✔) Tiling (✔) Thread cooperation (✔) Latency hiding

Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization

slide-38
SLIDE 38

Unified Schedule Optimizations for Hardwares

Scheduling Optimizations

(✔) Data layout (✔) Tiling (✔) Thread cooperation (✔) Latency hiding (✔) T ensorization

Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization

slide-39
SLIDE 39

Separation of Compilation and Deployment

TVM Runtimes Compilation Stack Lightweight, 300 to 600 KB Heavy optimizations Deploy TVM NNVM

TVM Graph Module

Framework Frontends

slide-40
SLIDE 40

Remote Execution and Profiling

Server with TVM Compiler Devices with TVM Runtime TVM RPC

slide-41
SLIDE 41

Performance Portable against state of art

2 4 6 8 ResNet18 MobileNet

MXNet NNVM Compiler

Time cost(ms) K80, Baseline MXNet with cuDNN auto tune enabled One grad student month 1.2x 1.2x

750 1500 2250 3000 ResNet18 MobileNet

MXNet NNVM Compiler

Time cost(ms) Raspberry Pi 3 Baseline: MXNet with OpenBLAS and NNPack Two undergrad weeks 2.2x 11.5x

Credit: Leyuan Wang(AWS/UCDavis), Yuwei Hu(TuSimple), Zheng Jiang(AWS/FDU)

slide-42
SLIDE 42

Coming Soon: Target New Accelerators

Tensorization Latency Hiding FPGA Example for building new hardware backend Open-source soon

slide-43
SLIDE 43

NNVM Compiler: Open Compiler for AI Systems

TVM NNVM

LLVM OpenCL Metal CUDA More hardware backends X86 ARM Javascript/WASM MXNet Keras CoreML PyTorch Caffe2 CNTK ONNX Caffe

Graph Optimizations TVM Primitives

External Support Supported Work in progress

AMDGPUs

Joint Work with AWS AI Team and DMLC community

slide-44
SLIDE 44

Deep Learning System Research is Exciting but Hard

Computational graph Hardware Frameworks Operator Libraries cuDNN, NNPack, MKL-DNN

CNTK

slide-45
SLIDE 45

Deep Learning System Research is Just Exciting

Hardware Frameworks

CNTK

TVM NNVM Graph Optimizations TVM Primitives I can program my new accelerators from python :) My new optimizations works on all platforms !

slide-46
SLIDE 46

Deep Learning System Research is Just Exciting

Hardware Frameworks

CNTK

TVM NNVM Graph Optimizations TVM Primitives

You can be part of it!

I can program my new accelerators from python :) My new optimizations works on all platforms !