December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep - PowerPoint PPT Presentation

TVM + AWS Vin Sharma, Amazon SageMaker Neo Amazon: vinarm@ | Twitter: ciphr@

How is AWS using TVM?

How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware

How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware • As an optimizer for Amazon AI services • Amazon Rekognition: To improve end-to-end latency • Amazon Alexa: To increase resource efficiency on Echo/Dot

How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware • As an optimizer for Amazon AI services • Amazon Rekognition: To improve end-to-end latency • Amazon Alexa: To increase resource efficiency on Echo/Dot • In a tool chain for Amazon Inferentia

We’re Hiring! How is AWS using TVM? • As a back-end for Apache MXNet • To deploy easily onto edge devices • To improve performance on target hardware • As an optimizer for Amazon AI services • Amazon Rekognition: To improve end-to-end latency • Amazon Alexa: To increase resource efficiency on Echo/Dot • In a tool chain for Amazon Inferentia

How is AWS enabling adoption of TVM? In a new service called Amazon SageMaker Neo

How is AWS enabling adoption of TVM? In a new service called Amazon SageMaker Neo Model input files: MXNet: .json & .params Name and shape of input node: {“data”:[1,3,227,277]} Framework Output Locatio n Target Platform: Cloud Instance Type | Edge Device

We’re Hiring! How is AWS enabling adoption of TVM? In a new service called Amazon SageMaker Neo Model input files: MXNet: .json & .params Name and shape of input node: {“data”:[1,3,227,277]} Framework Output Locatio n Target Platform: Cloud Instance Type | Edge Device

How is AWS contributing to TVM? Releasing all TVM modifications and enhancements in Neo to open source • Frameworks: TensorFlow, MXNet, PyTorch, ONNX • Models: ResNet, VGG, Inception, MobileNet, DenseNet, SqueezeNet • Operators: Several new ops in NNVM/TVM • Optimizations: Node Annotation, Graph Partitioning, Ring Buffer, NHWC, Graph Tuning • Acceleration Library: Nvidia TensorRT • Hardware: Cross-Compilation to ARM, Intel, Nvidia; More Coming Soon

We’re Hiring! How is AWS contributing to TVM? Releasing all TVM modifications and enhancements in Neo to open source • Frameworks: TensorFlow, MXNet, PyTorch, ONNX • Models: ResNet, VGG, Inception, MobileNet, DenseNet, SqueezeNet • Operators: Several new ops in NNVM/TVM • Optimizations: Node Annotation, Graph Partitioning, Ring Buffer, NHWC, Graph Tuning • Acceleration Library: Nvidia TensorRT • Hardware: Cross-Compilation to ARM, Intel, Nvidia; More Coming Soon

Chen Tian, Technical VP

TVM on Huawei’s AI portfolio AI Applications General Advanced HiAI Service Pre-integrated Solutions Application enabling: APIs1 Application APIs Full-pipeline services(ModelArts), hierarchical APIs, and pre- Enablement integrated solutions HiAI ModelArts Engine MindSpore ： Framework Unified training and inference framework for device, edge, and MindSpore TensorFlow PyTorch PaddlePaddle … cloud (both standalone and cooperative) CANN (Compute Architecture for Neural Networks) Chip Enabler CANN ： Chip operators library and highly automated operators Tensor Engine / TVM CCE lib/extensions development toolkit Ascend ： IP & Chip Ascend- Ascend- Ascend- Ascend Ascend-Mini Ascend-Max AI chip series based on unified scalable architecture Nano Tiny Lite Edge Industrial Private Consumer Device Public Cloud Computing IoT Device Cloud Huawei Confidential

How do we use TVM Frameworks Model Conversion Third-Party Operators TE/TVM model execution During model conversion we use TE/TVM to customize operators for completeness and performance. 70+ operators are written by TVM ， bring us ~3x development efficiency improvement Huawei Confidential

Successful Practice with Audi in Level 4 Autonomous Driving ~ A Complete City Commute Record ~ Driving in the evening High-speed cruise Traffic Jam Pilot (TJP) Traffic light identification Pedestrian identification Automatic parking Joint developed autonomous driving algorithm gains leading scores in industry authoritative KITTI 2D/3D/BEV tests! Huawei Confidential � 32

TVM is working on Atlas series product Atlas 200 Developer Kit Atlas 800 AI Appliance Atlas 300 AI Accelerator Card Atlas 500 AI Edge Station • Capable of processing 16-channel HD • Provides optimized AI environment ● 16 TOPS INT8@24 W ● 64 TOPS INT8@75 W videos in the size of a set-top-box ● 1 USB type-C, 2 CCM interfaces, 1 ● 64-channel HD video real-time analysis based on the standard framework and (STB) GE network port, 1 SD card slot and JPEG decoding programming environment • Delivers 4x higher performance over • Leverages high-performance   ● 8 GB memory ● 32 GB memory, 204.8 GB/s memory counterparts bandwidth GPU scheduling algorithms, improving ● PCIe 3.0 x16, half-height half-length card resource utilization   by over 15% Smart Transportation Smart Manufacturing Intelligent Care (traffic light tuning, intelligent traffic guiding) (kindergarten and elderly care) (intelligent quality inspection and flexible manufacturing) Huawei Confidential � 33

Huawei’s Contributions on TVM 8 Contributors ： kun-zh, sgrechanik-h, libing4752, derisavi-huawei, solin319, ehsanmok, gaoxiong-1, jiacunjiang1215   4 Reviewers ： Srkreddy1238 , PariksheetPinjari909 , siju-Samuel , Xqdan We are working on ： 1.Huawei Ascend ASIC support. 2.Front end to support Darknet, ONNX. 3.Optimization on Auto-TVM, IR extensions. 4.Tensorize, cache read/write, access_ptr API. In the future we will try to ： 1.Codegen for fused operators. 2.NLP support. 3.More optimization. 4.Training Operators. HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential � 34

Meghan Cowan

VGG11 on Raspberry Pi 3B TensorflowLite 32bit fp 66% top-1ImageNet accuracy 1.42 fps

VGG11 on Raspberry Pi 3B Trained binarized model Operators implemented with TVM TensorflowLite 32bit fp 66% top-1ImageNet accuracy 1.42 fps

VGG11 on Raspberry Pi 3B Trained binarized model Operators implemented with TVM TensorflowLite TVM 32bit fp 2-bit activation 1-bit weight 66% top-1ImageNet accuracy 62% top-1 ImageNet accuracy 1.42 fps 4.67 fps

Further down the stack…

Thierry Moreau

Open Source Stack Overview High-Level Differentiable IR Tensor Expression IR VTA Runtime & JIT Compiler VTA Hardware/Software Interface (ISA) Versatile Tensor Accelerator VTA MicroArchitecture VTA Simulator Stack (VTA)

Open Source Stack Overview VTA Backends • Simulator : out-of- High-Level Differentiable IR the-box testing to Tensor Expression IR write compiler passes VTA Runtime & JIT Compiler VTA Hardware/Software Interface (ISA) Versatile Tensor Accelerator VTA MicroArchitecture VTA Simulator Stack (VTA)

Open Source Stack Overview VTA Backends • Simulator : out-of- High-Level Differentiable IR the-box testing to Tensor Expression IR write compiler passes VTA Runtime & JIT Compiler • FPGA : fast design VTA Hardware/Software Interface (ISA) iteration, quick Versatile Tensor Accelerator deployment, flexibility VTA MicroArchitecture VTA Simulator Stack (VTA)

Open Source Stack Overview VTA Backends • Simulator : out-of- High-Level Differentiable IR the-box testing to Tensor Expression IR write compiler passes VTA Runtime & JIT Compiler • FPGA : fast design VTA Hardware/Software Interface (ISA) iteration, quick Versatile Tensor Accelerator deployment, flexibility VTA MicroArchitecture VTA Simulator Stack (VTA) • ASIC : industrial- strength efficiency

Hardware Exploration with VTA channel width { HW / SW Constraints # BRAMs FPGA DRAM channels logic resources Model batch size data types

VTA Design Space } Hardware Exploration with VTA channel width { { HW / SW Constraints HW / SW Constraints Architecture Knobs # BRAMs GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16) FPGA DRAM channels # of units in tensor ALU : e.g. 32 vs. 16 logic resources logic resources BRAM allocation between bu ff ers, register file, micro-op cache Model batch size Circuit Knobs data types Circuit Pipelining: e.g. for GEMM core between [11, 20] stages PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz 1000s

VTA Design Space } Hardware Exploration with VTA channel width { { e between [11, 20] stages } HW / SW Constraints HW / SW Constraints VTA Candidate Designs Architecture Knobs #1 Design AAA @ 307GOPs # BRAMs GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16) FPGA DRAM channels # of units in tensor ALU : e.g. 32 vs. 16 #2 Design BBB @ 307GOPs logic resources logic resources BRAM allocation between bu ff ers, register file, micro-op cache o-op cache #3 Design CCC @ 307GOPs Model #4 Design DDD @ 256GOPs batch size Circuit Knobs data types Circuit Pipelining: e.g. for GEMM core between [11, 20] stages Needs to pass place & route and pass timing closure PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz 1000s ~10

Schedule Exploration with VTA } VTA Candidate Designs #1 Design AAA @ 307GOPs #2 Design BBB @ 307GOPs #3 Design CCC @ 307GOPs #4 Design DDD @ 256GOPs Needs to pass place & route and pass timing closure

Schedule Exploration with VTA } } VTA Candidate Designs A Candidate Designs Operator Performance AutoTuning throughput #1 Design AAA @ 307GOPs 307GOPs 307 GOPs 256 GOPs #2 Design BBB @ 307GOPs 307GOPs #3 Design CCC @ 307GOPs 307GOPs #4 Design DDD @ 256GOPs 256GOPs Needs to pass place & route oute and pass timing closure e autotuning steps

Schedule Exploration with VTA } } VTA Candidate Designs A Candidate Designs Operator Performance Operator Performance Deliverable AutoTuning Model throughput #1 Design AAA @ 307GOPs 307GOPs 307 GOPs 256 GOPs #2 Design BBB @ 307GOPs 307GOPs Graph Optimizer custom #3 Design CCC @ 307GOPs 307GOPs Tuned Operator Lib VTA Design BBB #4 Design DDD @ 256GOPs 256GOPs FPGA Needs to pass place & route oute and pass timing closure e autotuning steps autotuning steps

TVM+VTA Stack Goals

TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack

TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack • Experimentation framework for cross- stack deep learning optimizations

TVM+VTA Stack Goals • Blue-print for a complete deep learning acceleration stack • Experimentation framework for cross- stack deep learning optimizations • Open-source community for industrial- strength deep learning acceleration

December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep - PowerPoint PPT Presentation

1st TVM and Deep Learning Compilation Conference December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep Learning Compilation Conference! Welcome to the 1st TVM and Deep Learning Compilation Conference! 180+ ppl! Machine learning is

TVM at Facebook Lots of contributors at FB and elsewhere TVM at Facebook Why TVM? Examples from

Quantization for TVM Ziheng Jiang TVM Conference, Dec 12th 2018 Quantization for TVM What is

VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack

TVM @ FB Andrew Tulloch Research Scientist Background Excited to be here! Lots of FB

TVM TVM f for ed or edge c e com omputin ting p g pla latf tform orm NTT Software Inno

TVM Deep Learning on Bare-Metal Devices Pratyush Patel No OS stack Extend TVM to support

Jug Tutorial: Coarse-Level Parallel Programming in Python Luis Pedro Coelho luis@luispedro.org

Atos Origin Year 2001 1st Half Results 1st Half 2001 Results September 2001 Agenda 2 1st

2018-19 1st Interim Budget December 11, 2018 So... why do a 1st Interim budget report? The

TVM & THE APACHE SOFTWARE FOUNDATION MARKUS WEIMER MEMBER, APACHE SOFTWARE FOUNDATION

Approximate Computing and Storage from Programming Language to Hardware and Molecules Luis Ceze

Security Risks to Third-Party Genetic Genealogy Services Peter Ney , Luis Ceze, Tadayoshi Kohno

Security Risks to Third-Party Genetic Genealogy Services Peter Ney , Luis Ceze, Tadayoshi Kohno

Data-Race Exceptions Have Benefits Beyond the Memory Model Benjamin P . Wood , Luis Ceze, Dan

HardwareSoftware Co-Design: Not Just a Clich Adrian Sampson James Bornholt Luis Ceze sa

Approximating to the Last Bit Thierry Moreau , Adrian Sampson, Luis Ceze {moreau,

Access: bwUniCluster, bwForCluster, ForHLR Shamna Shamsudeen, SCC, KIT Steinbuch Centre for

CONSISTENT PKCS#11 CONSISTENT PKCS#11 IN OPERATING SYSTEMS IN OPERATING SYSTEMS IMPROVING USER

Template based code generation for networks of hybrid systems Cambdridge CodeGen Workshop Boris

Computing Motorcycle Graphs Based on Kinetic Triangulations Willi Mann 1 Martin Held 1 Stefan

Awk, Awk Pattern matching and processing language Looks for pattern in file If pattern

(.ahp () * L-pt^ce Tr.*'f,**r + ?.qe. Je..u { che 5) !Js- n Sc.oe/ ".y'Pr /rq?.- ,Ft. C

Iterative Design of a Robot- Centered Curriculum for the Introduction to Computer Science Course

LinuxCon Europe UEFI Mini-Summit 7 October 2015 Session 1 UEFI Forum Update and Open

December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep - PowerPoint PPT Presentation

1st TVM and Deep Learning Compilation Conference December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep Learning Compilation Conference! Welcome to the 1st TVM and Deep Learning Compilation Conference! 180+ ppl! Machine learning is

TVM at Facebook Lots of contributors at FB and elsewhere TVM at Facebook Why TVM? Examples from

Quantization for TVM Ziheng Jiang TVM Conference, Dec 12th 2018 Quantization for TVM What is

VTA: Open &amp; Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack

TVM @ FB Andrew Tulloch Research Scientist Background Excited to be here! Lots of FB

TVM TVM f for ed or edge c e com omputin ting p g pla latf tform orm NTT Software Inno

TVM Deep Learning on Bare-Metal Devices Pratyush Patel No OS stack Extend TVM to support

Jug Tutorial: Coarse-Level Parallel Programming in Python Luis Pedro Coelho luis@luispedro.org

Atos Origin Year 2001 1st Half Results 1st Half 2001 Results September 2001 Agenda 2 1st

2018-19 1st Interim Budget December 11, 2018 So... why do a 1st Interim budget report? The

TVM &amp; THE APACHE SOFTWARE FOUNDATION MARKUS WEIMER MEMBER, APACHE SOFTWARE FOUNDATION

Approximate Computing and Storage from Programming Language to Hardware and Molecules Luis Ceze

Security Risks to Third-Party Genetic Genealogy Services Peter Ney , Luis Ceze, Tadayoshi Kohno

Security Risks to Third-Party Genetic Genealogy Services Peter Ney , Luis Ceze, Tadayoshi Kohno

Data-Race Exceptions Have Benefits Beyond the Memory Model Benjamin P . Wood , Luis Ceze, Dan

HardwareSoftware Co-Design: Not Just a Clich Adrian Sampson James Bornholt Luis Ceze sa

Approximating to the Last Bit Thierry Moreau , Adrian Sampson, Luis Ceze {moreau,

Access: bwUniCluster, bwForCluster, ForHLR Shamna Shamsudeen, SCC, KIT Steinbuch Centre for

CONSISTENT PKCS#11 CONSISTENT PKCS#11 IN OPERATING SYSTEMS IN OPERATING SYSTEMS IMPROVING USER

Template based code generation for networks of hybrid systems Cambdridge CodeGen Workshop Boris

Computing Motorcycle Graphs Based on Kinetic Triangulations Willi Mann 1 Martin Held 1 Stefan

Awk, Awk Pattern matching and processing language Looks for pattern in file If pattern

(.ahp () * L-pt^ce Tr.*'f,**r + ?.qe. Je..u { che 5) !Js- n Sc.oe/ &quot;.y'Pr /rq?.- ,Ft. C

Iterative Design of a Robot- Centered Curriculum for the Introduction to Computer Science Course

LinuxCon Europe UEFI Mini-Summit 7 October 2015 Session 1 UEFI Forum Update and Open

VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack

TVM & THE APACHE SOFTWARE FOUNDATION MARKUS WEIMER MEMBER, APACHE SOFTWARE FOUNDATION

(.ahp () * L-pt^ce Tr.*'f,**r + ?.qe. Je..u { che 5) !Js- n Sc.oe/ ".y'Pr /rq?.- ,Ft. C