december 12 2018 luis ceze welcome to the 1st tvm and
play

December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep - PowerPoint PPT Presentation

1st TVM and Deep Learning Compilation Conference December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep Learning Compilation Conference! Welcome to the 1st TVM and Deep Learning Compilation Conference! 180+ ppl! Machine learning is


  1. TVM + AWS Vin Sharma, Amazon SageMaker Neo Amazon: vinarm@ | Twitter: ciphr@

  2. How is AWS using TVM?

  3. How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware

  4. How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware • As an optimizer for Amazon AI services • Amazon Rekognition: To improve end-to-end latency • Amazon Alexa: To increase resource efficiency on Echo/Dot

  5. How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware • As an optimizer for Amazon AI services • Amazon Rekognition: To improve end-to-end latency • Amazon Alexa: To increase resource efficiency on Echo/Dot • In a tool chain for Amazon Inferentia

  6. We’re Hiring! How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware • As an optimizer for Amazon AI services • Amazon Rekognition: To improve end-to-end latency • Amazon Alexa: To increase resource efficiency on Echo/Dot • In a tool chain for Amazon Inferentia

  7. How is AWS enabling adoption of TVM? In a new service called Amazon SageMaker Neo

  8. How is AWS enabling adoption of TVM? In a new service called Amazon SageMaker Neo Model input files: MXNet: .json & .params Name and shape of input node: {“data”:[1,3,227,277]} Framework Output Locatio n Target Platform: Cloud Instance Type | Edge Device

  9. We’re Hiring! How is AWS enabling adoption of TVM? In a new service called Amazon SageMaker Neo Model input files: MXNet: .json & .params Name and shape of input node: {“data”:[1,3,227,277]} Framework Output Locatio n Target Platform: Cloud Instance Type | Edge Device

  10. How is AWS contributing to TVM? Releasing all TVM modifications and enhancements in Neo to open source • Frameworks: TensorFlow, MXNet, PyTorch, ONNX • Models: ResNet, VGG, Inception, MobileNet, DenseNet, SqueezeNet • Operators: Several new ops in NNVM/TVM • Optimizations: Node Annotation, Graph Partitioning, Ring Buffer, NHWC, Graph Tuning • Acceleration Library: Nvidia TensorRT • Hardware: Cross-Compilation to ARM, Intel, Nvidia; More Coming Soon

  11. We’re Hiring! How is AWS contributing to TVM? Releasing all TVM modifications and enhancements in Neo to open source • Frameworks: TensorFlow, MXNet, PyTorch, ONNX • Models: ResNet, VGG, Inception, MobileNet, DenseNet, SqueezeNet • Operators: Several new ops in NNVM/TVM • Optimizations: Node Annotation, Graph Partitioning, Ring Buffer, NHWC, Graph Tuning • Acceleration Library: Nvidia TensorRT • Hardware: Cross-Compilation to ARM, Intel, Nvidia; More Coming Soon

  12. Chen Tian, Technical VP

  13. TVM on Huawei’s AI portfolio AI Applications General Advanced HiAI Service Pre-integrated Solutions Application enabling: APIs1 Application APIs Full-pipeline services(ModelArts), hierarchical APIs, and pre- Enablement integrated solutions HiAI ModelArts Engine MindSpore : Framework Unified training and inference framework for device, edge, and MindSpore TensorFlow PyTorch PaddlePaddle … cloud (both standalone and cooperative) CANN (Compute Architecture for Neural Networks) Chip Enabler CANN : Chip operators library and highly automated operators Tensor Engine / TVM CCE lib/extensions development toolkit Ascend : IP & Chip Ascend- Ascend- Ascend- Ascend Ascend-Mini Ascend-Max AI chip series based on unified scalable architecture Nano Tiny Lite Edge Industrial Private Consumer Device Public Cloud Computing IoT Device Cloud Huawei Confidential

  14. How do we use TVM Frameworks Model Conversion Third-Party Operators TE/TVM model execution During model conversion we use TE/TVM to customize operators for completeness and performance. 70+ operators are written by TVM , bring us ~3x development efficiency improvement Huawei Confidential

  15. Successful Practice with Audi in Level 4 Autonomous Driving ~ A Complete City Commute Record ~ Driving in the evening High-speed cruise Traffic Jam Pilot (TJP) Traffic light identification Pedestrian identification Automatic parking Joint developed autonomous driving algorithm gains leading scores in industry authoritative KITTI 2D/3D/BEV tests! Huawei Confidential � 32

  16. TVM is working on Atlas series product Atlas 200 Developer Kit Atlas 800 AI Appliance Atlas 300 AI Accelerator Card Atlas 500 AI Edge Station • Capable of processing 16-channel HD • Provides optimized AI environment ● 16 TOPS INT8@24 W ● 64 TOPS INT8@75 W videos in the size of a set-top-box ● 1 USB type-C, 2 CCM interfaces, 1 ● 64-channel HD video real-time analysis based on the standard framework and (STB) GE network port, 1 SD card slot and JPEG decoding programming environment • Delivers 4x higher performance over • Leverages high-performance 
 ● 8 GB memory ● 32 GB memory, 204.8 GB/s memory counterparts bandwidth GPU scheduling algorithms, improving ● PCIe 3.0 x16, half-height half-length card resource utilization 
 by over 15% Smart Transportation Smart Manufacturing Intelligent Care (traffic light tuning, intelligent traffic guiding) (kindergarten and elderly care) (intelligent quality inspection and flexible manufacturing) Huawei Confidential � 33

  17. Huawei’s Contributions on TVM 8 Contributors : kun-zh, sgrechanik-h, libing4752, derisavi-huawei, solin319, ehsanmok, gaoxiong-1, jiacunjiang1215 
 4 Reviewers : Srkreddy1238 , PariksheetPinjari909 , siju-Samuel , Xqdan We are working on : 1.Huawei Ascend ASIC support. 2.Front end to support Darknet, ONNX. 3.Optimization on Auto-TVM, IR extensions. 4.Tensorize, cache read/write, access_ptr API. In the future we will try to : 1.Codegen for fused operators. 2.NLP support. 3.More optimization. 4.Training Operators. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential � 34

  18. Meghan Cowan

  19. VGG11 on Raspberry Pi 3B TensorflowLite 32bit fp 66% top-1ImageNet accuracy 1.42 fps

  20. VGG11 on Raspberry Pi 3B Trained binarized model Operators implemented with TVM TensorflowLite 32bit fp 66% top-1ImageNet accuracy 1.42 fps

  21. VGG11 on Raspberry Pi 3B Trained binarized model Operators implemented with TVM TensorflowLite TVM 32bit fp 2-bit activation 1-bit weight 66% top-1ImageNet accuracy 62% top-1 ImageNet accuracy 1.42 fps 4.67 fps

  22. Further down the stack…

  23. Thierry Moreau

  24. Open Source Stack Overview High-Level Differentiable IR Tensor Expression IR VTA Runtime & JIT Compiler VTA Hardware/Software Interface (ISA) Versatile Tensor Accelerator VTA MicroArchitecture VTA Simulator Stack (VTA)

  25. Open Source Stack Overview VTA Backends • Simulator : out-of- High-Level Differentiable IR the-box testing to Tensor Expression IR write compiler passes VTA Runtime & JIT Compiler VTA Hardware/Software Interface (ISA) Versatile Tensor Accelerator VTA MicroArchitecture VTA Simulator Stack (VTA)

  26. Open Source Stack Overview VTA Backends • Simulator : out-of- High-Level Differentiable IR the-box testing to Tensor Expression IR write compiler passes VTA Runtime & JIT Compiler • FPGA : fast design VTA Hardware/Software Interface (ISA) iteration, quick Versatile Tensor Accelerator deployment, flexibility VTA MicroArchitecture VTA Simulator Stack (VTA)

  27. Open Source Stack Overview VTA Backends • Simulator : out-of- High-Level Differentiable IR the-box testing to Tensor Expression IR write compiler passes VTA Runtime & JIT Compiler • FPGA : fast design VTA Hardware/Software Interface (ISA) iteration, quick Versatile Tensor Accelerator deployment, flexibility VTA MicroArchitecture VTA Simulator Stack (VTA) • ASIC : industrial- strength efficiency

  28. Hardware Exploration with VTA channel width { HW / SW Constraints # BRAMs FPGA DRAM channels logic resources Model batch size data types

  29. VTA Design Space } Hardware Exploration with VTA channel width { { HW / SW Constraints HW / SW Constraints Architecture Knobs # BRAMs GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16) FPGA DRAM channels # of units in tensor ALU : e.g. 32 vs. 16 logic resources logic resources BRAM allocation between bu ff ers, register file, micro-op cache Model batch size Circuit Knobs data types Circuit Pipelining: e.g. for GEMM core between [11, 20] stages PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz 1000s

  30. VTA Design Space } Hardware Exploration with VTA channel width { { e between [11, 20] stages } HW / SW Constraints HW / SW Constraints VTA Candidate Designs Architecture Knobs #1 Design AAA @ 307GOPs # BRAMs GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16) FPGA DRAM channels # of units in tensor ALU : e.g. 32 vs. 16 #2 Design BBB @ 307GOPs logic resources logic resources BRAM allocation between bu ff ers, register file, micro-op cache o-op cache #3 Design CCC @ 307GOPs Model #4 Design DDD @ 256GOPs batch size Circuit Knobs data types Circuit Pipelining: e.g. for GEMM core between [11, 20] stages Needs to pass place & route and pass timing closure PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz 1000s ~10

  31. Schedule Exploration with VTA } VTA Candidate Designs #1 Design AAA @ 307GOPs #2 Design BBB @ 307GOPs #3 Design CCC @ 307GOPs #4 Design DDD @ 256GOPs Needs to pass place & route and pass timing closure

  32. Schedule Exploration with VTA } } VTA Candidate Designs A Candidate Designs Operator Performance AutoTuning throughput #1 Design AAA @ 307GOPs 307GOPs 307 GOPs 256 GOPs #2 Design BBB @ 307GOPs 307GOPs #3 Design CCC @ 307GOPs 307GOPs #4 Design DDD @ 256GOPs 256GOPs Needs to pass place & route oute and pass timing closure e autotuning steps

  33. Schedule Exploration with VTA } } VTA Candidate Designs A Candidate Designs Operator Performance Operator Performance Deliverable AutoTuning Model throughput #1 Design AAA @ 307GOPs 307GOPs 307 GOPs 256 GOPs #2 Design BBB @ 307GOPs 307GOPs Graph Optimizer custom #3 Design CCC @ 307GOPs 307GOPs Tuned Operator Lib VTA Design BBB #4 Design DDD @ 256GOPs 256GOPs FPGA Needs to pass place & route oute and pass timing closure e autotuning steps autotuning steps

  34. TVM+VTA Stack Goals

  35. TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack

  36. TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack • Experimentation framework for cross- stack deep learning optimizations

  37. TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack • Experimentation framework for cross- stack deep learning optimizations • Open-source community for industrial- strength deep learning acceleration

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend