End to End Optimization Stack for Deep Learning
Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington
End to End Optimization Stack for Deep Learning Presenter: Tianqi - - PowerPoint PPT Presentation
End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team Ziheng Jiang Tianqi Chen Thierry
Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington
Tianqi Chen Thierry Moreau Haichen Shen Ziheng Jiang Carlos Guestrin Luis Ceze Arvind Krishnamurthy ML, Software Stack Hardware Stack GPU ARM, NNVM pipeline
University of Washington AWS AI Team
and many more contributors in the DMLC community
Computational graph Hardware Frameworks Operator Libraries cuDNN, NNPack, MKL-DNN
CNTK
Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN Frameworks
CNTK
Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN Frameworks
CNTK
Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN
Built a new accelerator
Frameworks
CNTK
Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN
Built a new accelerator Need entire software stack
Layout transformation Quantization Operator kernel optimization Benchmarking ….
Frameworks
CNTK
Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN Frameworks
CNTK
Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN Frameworks
CNTK
Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN
Data Layout Optimization
Frameworks
CNTK
Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN
Data Layout Optimization Operator Fusion
Frameworks
CNTK
Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN
Data Layout Optimization Operator Fusion
Need optimized hardware kernel for each variant, on each hardware!
Frameworks
CNTK
Computational graph Hardware Operator Libraries cuDNN, NNPack, MKL-DNN
Data Layout Optimization Operator Fusion
Need optimized hardware kernel for each variant, on each hardware!
Frameworks
CNTK
Serving
Hardware Back-Ends Frameworks
Hardware Back-Ends Frameworks
Hardware Back-Ends Frameworks
Hardware Back-Ends Frameworks
Hardware Back-Ends Frameworks
Hardware Back-Ends Frameworks
Hardware Back-Ends Frameworks
Hardware Back-Ends Frameworks
Computational Graph
Examples: NGraph, XLA, NNVM, DLVM …
Computational Graph
Computational Graph
TVM
NNVM Graph
Framework
import tvm m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h') A = tvm.placeholder((m, h), name='A') B = tvm.placeholder((n, h), name=‘B') k = tvm.reduce_axis((0, h), name=‘k') C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] * B[j, k], axis=k))
Inputs Shape of C
Compute C = dot(A, B.T)
Computation Rule
IR
IR
CPU GPU Accelerators
IR
CPU GPU Accelerators
L2 RF RF TX/L1 SM RF RF TX/L1 SM L1D L1I L2 L3 L1D L1I L2 Unified Buffer Acc FIFO
Memory subsystem implicitly managed mixed explicitly managed
IR
CPU GPU Accelerators
L2 RF RF TX/L1 SM RF RF TX/L1 SM L1D L1I L2 L3 L1D L1I L2 Unified Buffer Acc FIFO
Memory subsystem implicitly managed mixed explicitly managed Compute primitives scalar vector tensor
IR
CPU GPU Accelerators
L2 RF RF TX/L1 SM RF RF TX/L1 SM L1D L1I L2 L3 L1D L1I L2 Unified Buffer Acc FIFO
Memory subsystem implicitly managed mixed explicitly managed Compute primitives scalar vector tensor fp32 fp16 int8 Data type
Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR
Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization
(✔) Data layout
Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization
(✔) Data layout (✔) Tiling
Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization
(✔) Data layout (✔) Tiling (✔) Thread cooperation
Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization
(✔) Data layout (✔) Tiling (✔) Thread cooperation (✔) Latency hiding
Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization
(✔) Data layout (✔) Tiling (✔) Thread cooperation (✔) Latency hiding (✔) T ensorization
Generated code (LLVM, CUDA, OpenCL…) Lowering Algorithm described in IR Scheduling Optimization
TVM Runtimes Compilation Stack Lightweight, 300 to 600 KB Heavy optimizations Deploy TVM NNVM
TVM Graph Module
Framework Frontends
Server with TVM Compiler Devices with TVM Runtime TVM RPC
2 4 6 8 ResNet18 MobileNet
MXNet NNVM Compiler
Time cost(ms) K80, Baseline MXNet with cuDNN auto tune enabled One grad student month 1.2x 1.2x
750 1500 2250 3000 ResNet18 MobileNet
MXNet NNVM Compiler
Time cost(ms) Raspberry Pi 3 Baseline: MXNet with OpenBLAS and NNPack Two undergrad weeks 2.2x 11.5x
Credit: Leyuan Wang(AWS/UCDavis), Yuwei Hu(TuSimple), Zheng Jiang(AWS/FDU)
Tensorization Latency Hiding FPGA Example for building new hardware backend Open-source soon
TVM NNVM
LLVM OpenCL Metal CUDA More hardware backends X86 ARM Javascript/WASM MXNet Keras CoreML PyTorch Caffe2 CNTK ONNX Caffe
Graph Optimizations TVM Primitives
External Support Supported Work in progress
AMDGPUs
Joint Work with AWS AI Team and DMLC community
Computational graph Hardware Frameworks Operator Libraries cuDNN, NNPack, MKL-DNN
CNTK
Hardware Frameworks
CNTK
TVM NNVM Graph Optimizations TVM Primitives I can program my new accelerators from python :) My new optimizations works on all platforms !
Hardware Frameworks
CNTK
TVM NNVM Graph Optimizations TVM Primitives
I can program my new accelerators from python :) My new optimizations works on all platforms !