End to End Optimization Stack for Deep Learning Presenter: Tianqi - PowerPoint PPT Presentation

End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington

Collaborators University of Washington AWS AI Team Ziheng Jiang Tianqi Chen Thierry Moreau Haichen Shen ARM, NNVM pipeline ML, Software Stack Hardware Stack GPU and many more contributors in the DMLC community Carlos Guestrin Luis Ceze Arvind Krishnamurthy

Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware Built a new accelerator

Deep Learning System Research is Exciting but Hard Need entire software stack CNTK Frameworks on top of it!   Layout transformation Quantization Computational graph Operator kernel optimization Benchmarking …. Operator Libraries cuDNN, NNPack, MKL-DNN Hardware Built a new accelerator

Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Operator Fusion Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Operator Fusion Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Need optimized hardware kernel Hardware for each variant, on each hardware!  

Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Operator Fusion Computational graph Serving Operator Libraries cuDNN, NNPack, MKL-DNN Need optimized hardware kernel Hardware for each variant, on each hardware!  

The End to End System Challenge Frameworks Hardware Back-Ends

The End to End System Challenge Frameworks Intermediate representation Hardware Back-Ends

Computational Graph IR and Remaining Gap Examples: NGraph, XLA, NNVM, DLVM … Computational Graph Auto Differentiation Memory Plan Operator Fusion Backends

Computational Graph IR and Remaining Gap Computational Graph Auto Differentiation Memory Plan Operator Fusion Backends

Computational Graph IR and Remaining Gap Computational Graph Auto Differentiation Memory Plan Operator Fusion too many possible choices: precision, layout, fused pattern, device, threading … Need a low level IR to express them explicitly Backends

TVM: Low Level IR Framework • Concise and compact description NNVM Graph Auto Differentiation • Explicit control on codegen Memory Plan • Ease of deployment TVM • Support new hardware backends Hardware backends

Tensor Index Expression Declaration Compute C = dot(A, B.T) import tvm m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h') Inputs A = tvm.placeholder((m, h), name='A') B = tvm.placeholder((n, h), name=‘B') k = tvm.reduce_axis((0, h), name=‘k') C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] * B[j, k], axis=k)) Computation Rule Shape of C

Challenge: Hardware Diversities IR

Challenge: Hardware Diversities CPU GPU Accelerators IR

Challenge: Hardware Diversities CPU GPU Accelerators L2 FIFO L3 Unified Memory IR SM SM L2 L2 Buffer subsystem TX/L1 TX/L1 Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed

Challenge: Hardware Diversities CPU GPU Accelerators L2 FIFO L3 Unified Memory IR SM SM L2 L2 Buffer subsystem TX/L1 TX/L1 Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed Compute primitives scalar vector tensor

Challenge: Hardware Diversities CPU GPU Accelerators L2 FIFO L3 Unified Memory IR SM SM L2 L2 Buffer subsystem TX/L1 TX/L1 Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed Compute primitives scalar vector tensor Data type fp32 fp16 int8

Unified Schedule Optimizations for Hardwares Algorithm described in IR Lowering Generated code (LLVM, CUDA, OpenCL…)

Unified Schedule Optimizations for Hardwares Algorithm Scheduling described in IR Optimization Lowering Generated code (LLVM, CUDA, OpenCL…)

Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout Lowering Generated code (LLVM, CUDA, OpenCL…)

Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering Generated code (LLVM, CUDA, OpenCL…)

Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering ( ✔ ) Thread cooperation Generated code (LLVM, CUDA, OpenCL…)

Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering ( ✔ ) Thread cooperation ( ✔ ) Latency hiding Generated code (LLVM, CUDA, OpenCL…)

Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering ( ✔ ) Thread cooperation ( ✔ ) Latency hiding Generated code ( ✔ ) T ensorization (LLVM, CUDA, OpenCL…)

Separation of Compilation and Deployment Compilation Stack TVM Runtimes Framework Frontends Deploy NNVM TVM TVM Graph Module Heavy optimizations Lightweight, 300 to 600 KB

Remote Execution and Profiling Devices with TVM Runtime TVM RPC Server with TVM Compiler

Performance Portable against state of art Raspberry Pi 3 K80, Baseline Baseline: MXNet with OpenBLAS and NNPack MXNet with cuDNN auto tune enabled Two undergrad weeks One grad student month 3000 8 MXNet MXNet NNVM Compiler NNVM Compiler 2250 1.2x 6 Time cost(ms) Time cost(ms) 1500 1.2x 4 750 2.2x 2 11.5x 0 0 ResNet18 MobileNet ResNet18 MobileNet Credit: Leyuan Wang(AWS/UCDavis), Yuwei Hu(TuSimple), Zheng Jiang(AWS/FDU)

Coming Soon: Target New Accelerators Tensorization Latency Hiding FPGA Example for building new hardware backend Open-source soon

NNVM Compiler: Open Compiler for AI Systems Caffe Keras MXNet PyTorch Caffe2 CNTK NNVM CoreML ONNX Graph Optimizations External Support TVM Supported Joint Work with AWS AI Team Work in progress TVM Primitives and DMLC community More hardware Metal OpenCL LLVM CUDA backends X86 AMDGPUs ARM Javascript/WASM

Deep Learning System Research is Just Exciting CNTK Frameworks My new optimizations NNVM works on all platforms ! Graph Optimizations TVM I can program my new accelerators from python :) TVM Primitives Hardware

Deep Learning System Research is Just Exciting CNTK Frameworks My new optimizations NNVM works on all platforms ! Graph Optimizations TVM I can program my new accelerators from python :) TVM Primitives Hardware You can be part of it!

End to End Optimization Stack for Deep Learning Presenter: Tianqi - PowerPoint PPT Presentation

End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team Ziheng Jiang Tianqi Chen Thierry

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Stack and Queue Stack Overview Stack ADT Basic operations of stack Pushing, popping

Stack ADT Tiziana Ligorio 1 Todays Plan Questons? Stack ADT 2 Abstract Data Types

Call Stack Stack Bottom Memory region managed with stack discipline Procedures and the Call

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Compilers Stack Machines Alex Aiken Stack Machines Only storage is a stack An

The Stack Eric McCreath The Stack The stack is a simple but useful data structure in computer

Optimization for marking and sweeping Optimization for marking Use a marking stack

Stack machines (Using slides adapted from the book) Stacks A stack machine maintains an

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Reinforcement Learning and Complex Environments Raia Hadsell End-to-end Deep Learning

Re-arquitetando o Re-arquitetando o Stack Overflow Stack Overflow ou como construmos o Stack

ADT Stack 1 Stacks of Coins and Plates 2 Stacks of Rocks and Books TOP OF THE STACK TOP OF

ADT Stack 1 Stacks of Coins and Plates 2 Stacks of Rocks and Books TOP OF THE STACK TOP OF

CS180 Recitation Apr 13, 2012 Stack Data structure Stack Class public class Stack { 1 private

Dragonblood: Analyzing the Dragonfly Handshake of WPA3 and EAP-pwd Mathy Vanhoef and Eyal Ronen

Spectrometer Assembly Connect raspberry pi to the screen: Connect pins on the touchscreen using

Vulnerability and Threat Management and Prevention Weston Hecker Security Expert With KLJ Systems

Everything I know about Kubernetes I learned from a cluster of Raspberry Pis Jeff Geerling

A Telegram bot for Amartyo Banerjee and SK Venkatesan printing LaTeX files TNQ Books and

H a C keR S Notes by Michael Madden & Kevin Madden, CoderDojo Athenry, 2019 A HackerSpace in

SITCH Inexpensive, coordinated GSM anomaly detection About Me 2000: Technology career started

Raspberry and Pharo Pharo run on RaspberryPI ArmVM: http://files.pharo.org/vm/pharo-spur32/

End to End Optimization Stack for Deep Learning Presenter: Tianqi - PowerPoint PPT Presentation

End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team Ziheng Jiang Tianqi Chen Thierry

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Stack and Queue Stack Overview Stack ADT Basic operations of stack Pushing, popping

Stack ADT Tiziana Ligorio 1 Todays Plan Questons? Stack ADT 2 Abstract Data Types

Call Stack Stack Bottom Memory region managed with stack discipline Procedures and the Call

Sorting with Pop Stacks Stack sorting Pop stack sorting 1-pop-stack sortability 2-pop-stack

Compilers Stack Machines Alex Aiken Stack Machines Only storage is a stack An

The Stack Eric McCreath The Stack The stack is a simple but useful data structure in computer

Optimization for marking and sweeping Optimization for marking Use a marking stack

Stack machines (Using slides adapted from the book) Stacks A stack machine maintains an

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Reinforcement Learning and Complex Environments Raia Hadsell End-to-end Deep Learning

Re-arquitetando o Re-arquitetando o Stack Overflow Stack Overflow ou como construmos o Stack

ADT Stack 1 Stacks of Coins and Plates 2 Stacks of Rocks and Books TOP OF THE STACK TOP OF

ADT Stack 1 Stacks of Coins and Plates 2 Stacks of Rocks and Books TOP OF THE STACK TOP OF

CS180 Recitation Apr 13, 2012 Stack Data structure Stack Class public class Stack { 1 private

Dragonblood: Analyzing the Dragonfly Handshake of WPA3 and EAP-pwd Mathy Vanhoef and Eyal Ronen

Spectrometer Assembly Connect raspberry pi to the screen: Connect pins on the touchscreen using

Vulnerability and Threat Management and Prevention Weston Hecker Security Expert With KLJ Systems

Everything I know about Kubernetes I learned from a cluster of Raspberry Pis Jeff Geerling

A Telegram bot for Amartyo Banerjee and SK Venkatesan printing LaTeX files TNQ Books and

H a C keR S Notes by Michael Madden &amp; Kevin Madden, CoderDojo Athenry, 2019 A HackerSpace in

SITCH Inexpensive, coordinated GSM anomaly detection About Me 2000: Technology career started

Raspberry and Pharo Pharo run on RaspberryPI ArmVM: http://files.pharo.org/vm/pharo-spur32/

H a C keR S Notes by Michael Madden & Kevin Madden, CoderDojo Athenry, 2019 A HackerSpace in