end to end optimization stack for deep learning
play

End to End Optimization Stack for Deep Learning Presenter: Tianqi - PowerPoint PPT Presentation

End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team Ziheng Jiang Tianqi Chen Thierry


  1. End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington

  2. Collaborators University of Washington AWS AI Team Ziheng Jiang Tianqi Chen Thierry Moreau Haichen Shen ARM, NNVM pipeline ML, Software Stack Hardware Stack GPU and many more contributors in the DMLC community Carlos Guestrin Luis Ceze Arvind Krishnamurthy

  3. Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

  4. Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

  5. Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

  6. Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware Built a new accelerator

  7. Deep Learning System Research is Exciting but Hard Need entire software stack CNTK Frameworks on top of it! 
 Layout transformation Quantization Computational graph Operator kernel optimization Benchmarking …. Operator Libraries cuDNN, NNPack, MKL-DNN Hardware Built a new accelerator

  8. Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

  9. Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

  10. Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

  11. Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Operator Fusion Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

  12. Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Operator Fusion Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Need optimized hardware kernel Hardware for each variant, on each hardware! 


  13. Deep Learning System Research is Exciting but Hard CNTK Frameworks Data Layout Optimization Operator Fusion Computational graph Serving Operator Libraries cuDNN, NNPack, MKL-DNN Need optimized hardware kernel Hardware for each variant, on each hardware! 


  14. The End to End System Challenge Frameworks Hardware Back-Ends

  15. The End to End System Challenge Frameworks Hardware Back-Ends

  16. The End to End System Challenge Frameworks Hardware Back-Ends

  17. The End to End System Challenge Frameworks Hardware Back-Ends

  18. The End to End System Challenge Frameworks Hardware Back-Ends

  19. The End to End System Challenge Frameworks Hardware Back-Ends

  20. The End to End System Challenge Frameworks Hardware Back-Ends

  21. The End to End System Challenge Frameworks Intermediate representation Hardware Back-Ends

  22. Computational Graph IR and Remaining Gap Examples: NGraph, XLA, NNVM, DLVM … Computational Graph Auto Differentiation Memory Plan Operator Fusion Backends

  23. Computational Graph IR and Remaining Gap Computational Graph Auto Differentiation Memory Plan Operator Fusion Backends

  24. Computational Graph IR and Remaining Gap Computational Graph Auto Differentiation Memory Plan Operator Fusion too many possible choices: precision, layout, fused pattern, device, threading … Need a low level IR to express them explicitly Backends

  25. TVM: Low Level IR Framework • Concise and compact description NNVM Graph Auto Differentiation • Explicit control on codegen Memory Plan • Ease of deployment TVM • Support new hardware backends Hardware backends

  26. Tensor Index Expression Declaration Compute C = dot(A, B.T) import tvm m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h') Inputs A = tvm.placeholder((m, h), name='A') B = tvm.placeholder((n, h), name=‘B') k = tvm.reduce_axis((0, h), name=‘k') C = tvm.compute((m, n), lambda i, j: tvm.sum(A[i, k] * B[j, k], axis=k)) Computation Rule Shape of C

  27. Challenge: Hardware Diversities IR

  28. Challenge: Hardware Diversities CPU GPU Accelerators IR

  29. Challenge: Hardware Diversities CPU GPU Accelerators L2 FIFO L3 Unified Memory IR SM SM L2 L2 Buffer subsystem TX/L1 TX/L1 Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed

  30. Challenge: Hardware Diversities CPU GPU Accelerators L2 FIFO L3 Unified Memory IR SM SM L2 L2 Buffer subsystem TX/L1 TX/L1 Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed Compute primitives scalar vector tensor

  31. Challenge: Hardware Diversities CPU GPU Accelerators L2 FIFO L3 Unified Memory IR SM SM L2 L2 Buffer subsystem TX/L1 TX/L1 Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed Compute primitives scalar vector tensor Data type fp32 fp16 int8

  32. Unified Schedule Optimizations for Hardwares Algorithm described in IR Lowering Generated code (LLVM, CUDA, OpenCL…)

  33. Unified Schedule Optimizations for Hardwares Algorithm Scheduling described in IR Optimization Lowering Generated code (LLVM, CUDA, OpenCL…)

  34. Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout Lowering Generated code (LLVM, CUDA, OpenCL…)

  35. Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering Generated code (LLVM, CUDA, OpenCL…)

  36. Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering ( ✔ ) Thread cooperation Generated code (LLVM, CUDA, OpenCL…)

  37. Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering ( ✔ ) Thread cooperation ( ✔ ) Latency hiding Generated code (LLVM, CUDA, OpenCL…)

  38. Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm Scheduling described in IR Optimization ( ✔ ) Data layout ( ✔ ) Tiling Lowering ( ✔ ) Thread cooperation ( ✔ ) Latency hiding Generated code ( ✔ ) T ensorization (LLVM, CUDA, OpenCL…)

  39. Separation of Compilation and Deployment Compilation Stack TVM Runtimes Framework Frontends Deploy NNVM TVM TVM Graph Module Heavy optimizations Lightweight, 300 to 600 KB

  40. Remote Execution and Profiling Devices with TVM Runtime TVM RPC Server with TVM Compiler

  41. Performance Portable against state of art Raspberry Pi 3 K80, Baseline Baseline: MXNet with OpenBLAS and NNPack MXNet with cuDNN auto tune enabled Two undergrad weeks One grad student month 3000 8 MXNet MXNet NNVM Compiler NNVM Compiler 2250 1.2x 6 Time cost(ms) Time cost(ms) 1500 1.2x 4 750 2.2x 2 11.5x 0 0 ResNet18 MobileNet ResNet18 MobileNet Credit: Leyuan Wang(AWS/UCDavis), Yuwei Hu(TuSimple), Zheng Jiang(AWS/FDU)

  42. Coming Soon: Target New Accelerators Tensorization Latency Hiding FPGA Example for building new hardware backend Open-source soon

  43. NNVM Compiler: Open Compiler for AI Systems Caffe Keras MXNet PyTorch Caffe2 CNTK NNVM CoreML ONNX Graph Optimizations External Support TVM Supported Joint Work with AWS AI Team Work in progress TVM Primitives and DMLC community More hardware Metal OpenCL LLVM CUDA backends X86 AMDGPUs ARM Javascript/WASM

  44. Deep Learning System Research is Exciting but Hard CNTK Frameworks Computational graph Operator Libraries cuDNN, NNPack, MKL-DNN Hardware

  45. Deep Learning System Research is Just Exciting CNTK Frameworks My new optimizations NNVM works on all platforms ! Graph Optimizations TVM I can program my new accelerators from python :) TVM Primitives Hardware

  46. Deep Learning System Research is Just Exciting CNTK Frameworks My new optimizations NNVM works on all platforms ! Graph Optimizations TVM I can program my new accelerators from python :) TVM Primitives Hardware You can be part of it!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend