December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep - - PowerPoint PPT Presentation

december 12 2018 luis ceze welcome to the 1st tvm and
SMART_READER_LITE
LIVE PREVIEW

December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep - - PowerPoint PPT Presentation

1st TVM and Deep Learning Compilation Conference December 12, 2018 Luis Ceze Welcome to the 1st TVM and Deep Learning Compilation Conference! Welcome to the 1st TVM and Deep Learning Compilation Conference! 180+ ppl! Machine learning is


slide-1
SLIDE 1

1st TVM and Deep Learning Compilation Conference

December 12, 2018

slide-2
SLIDE 2

Luis Ceze

slide-3
SLIDE 3

Welcome to the 1st TVM and Deep Learning Compilation Conference!

slide-4
SLIDE 4

180+ ppl!

Welcome to the 1st TVM and Deep Learning Compilation Conference!

slide-5
SLIDE 5

Machine learning is amazing…

slide-6
SLIDE 6

Machine learning is amazing…

super human accuracy, self driving cars, automated scientific discoveries…

slide-7
SLIDE 7

Machine learning is amazing…

super human accuracy, self driving cars, automated scientific discoveries…

wow!

slide-8
SLIDE 8

Problem to solve Write code Run on fast machine

Software era:

slide-9
SLIDE 9

Problem to solve Write code Run on fast machine

Software era:

slide-10
SLIDE 10

Problem to solve Write code Run on fast machine

Software era:

Train on fastest machine Problem to solve Data + model templates Inference on fast & cheap machine

Machine learning era:

slide-11
SLIDE 11

Problem to solve Write code Run on fast machine

Software era:

Train on fastest machine Problem to solve Data + model templates Inference on fast & cheap machine

Machine learning era:

by Eugenio Culurciello Model size and compute cost growing fast
slide-12
SLIDE 12

Problem to solve Write code Run on fast machine

Software era:

Train on fastest machine Problem to solve Data + model templates Inference on fast & cheap machine

Machine learning era:

by Eugenio Culurciello Model size and compute cost growing fast by Open AI Training costs growing exponentially
slide-13
SLIDE 13
slide-14
SLIDE 14

Popularity and computational cost of

  • ML. Oops.
slide-15
SLIDE 15

Popularity and computational cost of

  • ML. Oops.

Fundamental trade-off between specialization and performance/efficiency.

slide-16
SLIDE 16

Popularity and computational cost of

  • ML. Oops.

More General/Programmable Be0er Performance/ Energy Efficiency

Fixed Func*on Chips FPGA GPUs General Purpose CPUs

Fundamental trade-off between specialization and performance/efficiency.

slide-17
SLIDE 17

Popularity and computational cost of

  • ML. Oops.

More General/Programmable Be0er Performance/ Energy Efficiency

Fixed Func*on Chips FPGA GPUs General Purpose CPUs

Machine learning algorithms are relatively simple to implement in HW… great! Fundamental trade-off between specialization and performance/efficiency.

slide-18
SLIDE 18

Popularity and computational cost of

  • ML. Oops.

More General/Programmable Be0er Performance/ Energy Efficiency

Fixed Func*on Chips FPGA GPUs General Purpose CPUs

Machine learning algorithms are relatively simple to implement in HW… great!

Machine Learning Makes Computer Architecture Cool Again!

Fundamental trade-off between specialization and performance/efficiency.

slide-19
SLIDE 19

slide-20
SLIDE 20

… +~50 startups

slide-21
SLIDE 21
slide-22
SLIDE 22

… +~50 startups

slide-23
SLIDE 23

CNN GAN RNN MLP DQNN

Models:

… +~50 startups

slide-24
SLIDE 24

CNN GAN RNN MLP DQNN

Models: Frameworks:

… +~50 startups

slide-25
SLIDE 25

Challenge: Efficiently deploying deep learning everywhere

CNN GAN RNN MLP DQNN

Models: Frameworks:

… +~50 startups

slide-26
SLIDE 26

Gaurav Kapoor, Core Machine Learning

slide-27
SLIDE 27

HW+SW optimization is key for efficiency

slide-28
SLIDE 28

HW+SW optimization is key for efficiency

Lots of hand-tuning, full automation would be a holy grail

slide-29
SLIDE 29

Academic group focused on Systems + Computer Architecture + Machine Learning + Programming Languages

slide-30
SLIDE 30

Academic group focused on Systems + Computer Architecture + Machine Learning + Programming Languages

slide-31
SLIDE 31

Academic group focused on Systems + Computer Architecture + Machine Learning + Programming Languages

slide-32
SLIDE 32

Computer Architecture: Extensible, energy efficient hardware designs for inference and training Compilers: Extensible support for future models,

  • ptimizations and hardware architectures

PL: High-level support for future ML applications

Systems: On-device and cloud-based training, distributed systems for ML

slide-33
SLIDE 33

Computer Architecture: Extensible, energy efficient hardware designs for inference and training Compilers: Extensible support for future models,

  • ptimizations and hardware architectures

PL: High-level support for future ML applications

Systems: On-device and cloud-based training, distributed systems for ML

ML for Systems: Automatic Learning- Based Design and Optimizations

slide-34
SLIDE 34

Computer Architecture: Extensible, energy efficient hardware designs for inference and training Compilers: Extensible support for future models,

  • ptimizations and hardware architectures

PL: High-level support for future ML applications

Systems: On-device and cloud-based training, distributed systems for ML

ML for Systems: Automatic Learning- Based Design and Optimizations ML for better ML systems!

slide-35
SLIDE 35

Computer Architecture: Extensible, energy efficient hardware designs for inference and training Compilers: Extensible support for future models,

  • ptimizations and hardware architectures

PL: High-level support for future ML applications

Systems: On-device and cloud-based training, distributed systems for ML

ML for Systems: Automatic Learning- Based Design and Optimizations ML for better ML systems!

slide-36
SLIDE 36

Open Source Deployment

Provides infrastructure Open source when ready

Computer Architecture: Extensible, energy efficient hardware designs for inference and training Compilers: Extensible support for future models,

  • ptimizations and hardware architectures

PL: High-level support for future ML applications

Systems: On-device and cloud-based training, distributed systems for ML

ML for Systems: Automatic Learning- Based Design and Optimizations ML for better ML systems!

slide-37
SLIDE 37

First major open source compiler collection

Open source compilers have transformed our industry

slide-38
SLIDE 38

First major open source compiler collection LLVM: Higher-Level IR, new

  • ptimizations, easier extensibility

Open source compilers have transformed our industry

slide-39
SLIDE 39

In the age of domain-specialized systems… First major open source compiler collection LLVM: Higher-Level IR, new

  • ptimizations, easier extensibility

Open source compilers have transformed our industry

slide-40
SLIDE 40

In the age of domain-specialized systems… First major open source compiler collection LLVM: Higher-Level IR, new

  • ptimizations, easier extensibility

Open source compilers have transformed our industry

Specialized compiler stack for Deep Learning

slide-41
SLIDE 41

End the tyranny of closed deep learning systems!

slide-42
SLIDE 42

Tianqi Chen

slide-43
SLIDE 43
slide-44
SLIDE 44

High-Level Differentiable IR

slide-45
SLIDE 45

High-Level Differentiable IR Tensor Expression IR

slide-46
SLIDE 46

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal

slide-47
SLIDE 47

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA

Edge FPGA Cloud FPGA ASIC

slide-48
SLIDE 48

import tvm from tvm import relay 
 graph, params = frontend.from_keras(keras_resnet50) graph, lib, params = relay.build(graph, target)

Compile

slide-49
SLIDE 49

import tvm from tvm import relay 
 graph, params = frontend.from_keras(keras_resnet50) graph, lib, params = relay.build(graph, target)

Compile

slide-50
SLIDE 50

import tvm from tvm import relay 
 graph, params = frontend.from_keras(keras_resnet50) graph, lib, params = relay.build(graph, target)

Compile

Deployable Module tabby, tabby cat module = runtime.create(graph, lib, tvm.gpu(0)) module.set_input(**params) module.run(data=data_array)
  • utput = tvm.nd.empty(out_shape, ctx=tvm.gpu(0))
module.get_output(0, output) prediction input

Deploy

slide-51
SLIDE 51

On languages and platforms you choose

import tvm from tvm import relay 
 graph, params = frontend.from_keras(keras_resnet50) graph, lib, params = relay.build(graph, target)

Compile

Deployable Module tabby, tabby cat module = runtime.create(graph, lib, tvm.gpu(0)) module.set_input(**params) module.run(data=data_array)
  • utput = tvm.nd.empty(out_shape, ctx=tvm.gpu(0))
module.get_output(0, output) prediction input

Deploy

slide-52
SLIDE 52

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA

Edge FPGA Cloud FPGA ASIC

Automated by Machine Learning

slide-53
SLIDE 53

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA

Edge FPGA Cloud FPGA ASIC

Optimization AutoTVM AutoVTA Hardware Fleet

Automated by Machine Learning

slide-54
SLIDE 54

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA

Edge FPGA Cloud FPGA ASIC

Optimization AutoTVM AutoVTA Hardware Fleet

Automated by Machine Learning

slide-55
SLIDE 55

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA

Edge FPGA Cloud FPGA ASIC

Optimization AutoTVM AutoVTA Hardware Fleet

Automated by Machine Learning

slide-56
SLIDE 56

Diverse Hardware backends

High-Level Differentiable IR Tensor Expression IR Optimization AutoTVM

slide-57
SLIDE 57

Diverse Hardware backends

High-Level Differentiable IR Tensor Expression IR Optimization AutoTVM LLVM ARM x86 AMDGPU NVPTX Javascript WASM

slide-58
SLIDE 58

Diverse Hardware backends

High-Level Differentiable IR Tensor Expression IR Optimization AutoTVM LLVM ARM x86 AMDGPU NVPTX Javascript WASM CUDA Metal Vulkan C

slide-59
SLIDE 59

Diverse Hardware backends

High-Level Differentiable IR Tensor Expression IR VTA Optimization AutoTVM LLVM ARM x86 AMDGPU NVPTX Javascript WASM CUDA Metal Vulkan C

slide-60
SLIDE 60

TVM Open Source Community

slide-61
SLIDE 61

TVM Open Source Community

Apache governance model: grant project ownership by merit. 11 committers, 29 reviewers, 166 contributors. Contributed by the community, for the community.

slide-62
SLIDE 62

Industrial Impact

slide-63
SLIDE 63

Vin Sharma, Amazon SageMaker Neo

Amazon: vinarm@ | Twitter: ciphr@

TVM + AWS

slide-64
SLIDE 64

How is AWS using TVM?

slide-65
SLIDE 65

How is AWS using TVM?

  • As a back-end for Apache MXNet
  • To deploy easily onto edge devices
  • To improve performance on target hardware
slide-66
SLIDE 66

How is AWS using TVM?

  • As a back-end for Apache MXNet
  • To deploy easily onto edge devices
  • To improve performance on target hardware
  • As an optimizer for Amazon AI services
  • Amazon Rekognition: To improve end-to-end latency
  • Amazon Alexa: To increase resource efficiency on Echo/Dot
slide-67
SLIDE 67

How is AWS using TVM?

  • As a back-end for Apache MXNet
  • To deploy easily onto edge devices
  • To improve performance on target hardware
  • As an optimizer for Amazon AI services
  • Amazon Rekognition: To improve end-to-end latency
  • Amazon Alexa: To increase resource efficiency on Echo/Dot
  • In a tool chain for Amazon Inferentia
slide-68
SLIDE 68

How is AWS using TVM?

  • As a back-end for Apache MXNet
  • To deploy easily onto edge devices
  • To improve performance on target hardware
  • As an optimizer for Amazon AI services
  • Amazon Rekognition: To improve end-to-end latency
  • Amazon Alexa: To increase resource efficiency on Echo/Dot
  • In a tool chain for Amazon Inferentia

We’re Hiring!

slide-69
SLIDE 69

How is AWS enabling adoption of TVM?

In a new service called Amazon SageMaker Neo

slide-70
SLIDE 70

How is AWS enabling adoption of TVM?

In a new service called Amazon SageMaker Neo

Model input files: MXNet: .json & .params Framework Output Location Name and shape of input node: {“data”:[1,3,227,277]} Target Platform: Cloud Instance Type | Edge Device
slide-71
SLIDE 71

How is AWS enabling adoption of TVM?

In a new service called Amazon SageMaker Neo

Model input files: MXNet: .json & .params Framework Output Location Name and shape of input node: {“data”:[1,3,227,277]} Target Platform: Cloud Instance Type | Edge Device

We’re Hiring!

slide-72
SLIDE 72

How is AWS contributing to TVM?

Releasing all TVM modifications and enhancements in Neo to open source

  • Frameworks: TensorFlow, MXNet, PyTorch, ONNX
  • Models: ResNet, VGG, Inception, MobileNet, DenseNet, SqueezeNet
  • Operators: Several new ops in NNVM/TVM
  • Optimizations: Node Annotation, Graph Partitioning, Ring Buffer, NHWC, Graph

Tuning

  • Acceleration Library: Nvidia TensorRT
  • Hardware: Cross-Compilation to ARM, Intel, Nvidia; More Coming Soon
slide-73
SLIDE 73

How is AWS contributing to TVM?

Releasing all TVM modifications and enhancements in Neo to open source

  • Frameworks: TensorFlow, MXNet, PyTorch, ONNX
  • Models: ResNet, VGG, Inception, MobileNet, DenseNet, SqueezeNet
  • Operators: Several new ops in NNVM/TVM
  • Optimizations: Node Annotation, Graph Partitioning, Ring Buffer, NHWC, Graph

Tuning

  • Acceleration Library: Nvidia TensorRT
  • Hardware: Cross-Compilation to ARM, Intel, Nvidia; More Coming Soon

We’re Hiring!

slide-74
SLIDE 74
slide-75
SLIDE 75

Chen Tian, Technical VP

slide-76
SLIDE 76 Huawei Confidential

TVM on Huawei’s AI portfolio

CANN (Compute Architecture for Neural Networks) Ascend … Ascend-Max Ascend-Mini Ascend- Tiny Ascend- Lite Ascend- Nano

AI Applications

Consumer Device Public Cloud Private Cloud Industrial IoT Device Edge Computing Application Enablement Framework Chip Enabler IP & Chip General APIs1 Advanced APIs Pre-integrated Solutions HiAI Service HiAI Engine ModelArts MindSpore TensorFlow PyTorch PaddlePaddle CCE lib/extensions

Tensor Engine / TVM

Application enabling: Full-pipeline services(ModelArts), hierarchical APIs, and pre- integrated solutions MindSpore: Unified training and inference framework for device, edge, and cloud (both standalone and cooperative) CANN: Chip operators library and highly automated operators development toolkit Ascend: AI chip series based on unified scalable architecture
slide-77
SLIDE 77 Huawei Confidential

Frameworks

model execution TE/TVM During model conversion we use TE/TVM to customize

  • perators for completeness and performance.

Third-Party Operators

Model Conversion

How do we use TVM

70+ operators are written by TVM , bring us ~3x development efficiency improvement

slide-78
SLIDE 78 Huawei Confidential 32

Successful Practice with Audi in Level 4 Autonomous Driving

~ A Complete City Commute Record ~

Driving in the evening Traffic light identification Pedestrian identification High-speed cruise Traffic Jam Pilot (TJP) Automatic parking Joint developed autonomous driving algorithm gains leading scores in industry authoritative KITTI 2D/3D/BEV tests!

slide-79
SLIDE 79 Huawei Confidential 33 Smart Manufacturing (intelligent quality inspection and flexible manufacturing) Intelligent Care (kindergarten and elderly care) Smart Transportation (traffic light tuning, intelligent traffic guiding) Atlas 200 Developer Kit
  • 16 TOPS INT8@24 W
  • 1 USB type-C, 2 CCM interfaces, 1
GE network port, 1 SD card slot
  • 8 GB memory
Atlas 300 AI Accelerator Card
  • 64 TOPS INT8@75 W
  • 64-channel HD video real-time analysis
and JPEG decoding
  • 32 GB memory, 204.8 GB/s memory
bandwidth
  • PCIe 3.0 x16, half-height half-length card
Atlas 800 AI Appliance
  • Provides optimized AI environment
based on the standard framework and programming environment
  • Leverages high-performance 

GPU scheduling algorithms, improving resource utilization 
 by over 15%
  • Capable of processing 16-channel HD
videos in the size of a set-top-box (STB)
  • Delivers 4x higher performance over
counterparts Atlas 500 AI Edge Station

TVM is working on Atlas series product

slide-80
SLIDE 80 HUAWEI TECHNOLOGIES CO., LTD. Huawei Confidential 34

Huawei’s Contributions on TVM

8 Contributors: kun-zh, sgrechanik-h, libing4752, derisavi-huawei, solin319, ehsanmok, gaoxiong-1, jiacunjiang1215
 4 Reviewers: Srkreddy1238 , PariksheetPinjari909 , siju-Samuel , Xqdan We are working on: 1.Huawei Ascend ASIC support. 2.Front end to support Darknet, ONNX. 3.Optimization on Auto-TVM, IR extensions. 4.Tensorize, cache read/write, access_ptr API. In the future we will try to: 1.Codegen for fused operators. 2.NLP support. 3.More optimization. 4.Training Operators.

slide-81
SLIDE 81

Meghan Cowan

slide-82
SLIDE 82

TensorflowLite 32bit fp 66% top-1ImageNet accuracy 1.42 fps

VGG11 on Raspberry Pi 3B

slide-83
SLIDE 83

TensorflowLite 32bit fp 66% top-1ImageNet accuracy 1.42 fps Trained binarized model Operators implemented with TVM

VGG11 on Raspberry Pi 3B

slide-84
SLIDE 84

TVM 2-bit activation 1-bit weight 62% top-1 ImageNet accuracy 4.67 fps TensorflowLite 32bit fp 66% top-1ImageNet accuracy 1.42 fps Trained binarized model Operators implemented with TVM

VGG11 on Raspberry Pi 3B

slide-85
SLIDE 85

Further down the stack…

slide-86
SLIDE 86

Thierry Moreau

slide-87
SLIDE 87

Open Source Stack Overview

High-Level Differentiable IR Tensor Expression IR VTA Runtime & JIT Compiler VTA Hardware/Software Interface (ISA) VTA MicroArchitecture VTA Simulator

Versatile Tensor Accelerator Stack (VTA)

slide-88
SLIDE 88

Open Source Stack Overview

VTA Backends

  • Simulator: out-of-

the-box testing to write compiler passes

High-Level Differentiable IR Tensor Expression IR VTA Runtime & JIT Compiler VTA Hardware/Software Interface (ISA) VTA MicroArchitecture VTA Simulator

Versatile Tensor Accelerator Stack (VTA)

slide-89
SLIDE 89

Open Source Stack Overview

VTA Backends

  • Simulator: out-of-

the-box testing to write compiler passes

  • FPGA: fast design

iteration, quick deployment, flexibility

High-Level Differentiable IR Tensor Expression IR VTA Runtime & JIT Compiler VTA Hardware/Software Interface (ISA) VTA MicroArchitecture VTA Simulator

Versatile Tensor Accelerator Stack (VTA)

slide-90
SLIDE 90

Open Source Stack Overview

VTA Backends

  • Simulator: out-of-

the-box testing to write compiler passes

  • FPGA: fast design

iteration, quick deployment, flexibility

  • ASIC: industrial-

strength efficiency

High-Level Differentiable IR Tensor Expression IR VTA Runtime & JIT Compiler VTA Hardware/Software Interface (ISA) VTA MicroArchitecture VTA Simulator

Versatile Tensor Accelerator Stack (VTA)

slide-91
SLIDE 91

Hardware Exploration with VTA

HW / SW Constraints

FPGA

# BRAMs DRAM channels logic resources

Model

batch size data types channel width{
slide-92
SLIDE 92

Hardware Exploration with VTA

HW / SW Constraints

FPGA

# BRAMs DRAM channels logic resources

Model

batch size data types channel width{

HW / SW Constraints

logic resources Architecture Knobs GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16) # of units in tensor ALU : e.g. 32 vs. 16 BRAM allocation between buffers, register file, micro-op cache Circuit Knobs Circuit Pipelining: e.g. for GEMM core between [11, 20] stages PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz

{

VTA Design Space }

1000s

slide-93
SLIDE 93

Hardware Exploration with VTA

HW / SW Constraints

FPGA

# BRAMs DRAM channels logic resources

Model

batch size data types channel width{

HW / SW Constraints

logic resources Architecture Knobs GEMM Intrinsic: e.g. (1,32) x (32,32) vs. (4,16) x (16,16) # of units in tensor ALU : e.g. 32 vs. 16 BRAM allocation between buffers, register file, micro-op cache Circuit Knobs Circuit Pipelining: e.g. for GEMM core between [11, 20] stages PLL Frequency Sweeps: e.g. 250 vs. 300 vs. 333MHz

{

VTA Design Space }

  • -op cache
e between [11, 20] stages}

VTA Candidate Designs

#1 Design AAA @ 307GOPs #2 Design BBB @ 307GOPs #3 Design CCC @ 307GOPs #4 Design DDD @ 256GOPs

Needs to pass place & route and pass timing closure

1000s ~10

slide-94
SLIDE 94

Schedule Exploration with VTA

}

VTA Candidate Designs

#1 Design AAA @ 307GOPs #2 Design BBB @ 307GOPs #3 Design CCC @ 307GOPs #4 Design DDD @ 256GOPs

Needs to pass place & route and pass timing closure

slide-95
SLIDE 95

Schedule Exploration with VTA

}

VTA Candidate Designs

#1 Design AAA @ 307GOPs #2 Design BBB @ 307GOPs #3 Design CCC @ 307GOPs #4 Design DDD @ 256GOPs

Needs to pass place & route and pass timing closure

}

A Candidate Designs

307GOPs 307GOPs 307GOPs 256GOPs

  • ute

e

307 GOPs 256 GOPs

throughput autotuning steps

Operator Performance AutoTuning

slide-96
SLIDE 96

Schedule Exploration with VTA

}

VTA Candidate Designs

#1 Design AAA @ 307GOPs #2 Design BBB @ 307GOPs #3 Design CCC @ 307GOPs #4 Design DDD @ 256GOPs

Needs to pass place & route and pass timing closure

}

A Candidate Designs

307GOPs 307GOPs 307GOPs 256GOPs

  • ute

e

307 GOPs 256 GOPs

throughput autotuning steps

Operator Performance AutoTuning

autotuning steps

Operator Performance Deliverable

Tuned Operator Lib VTA Design BBB

FPGA

Graph Optimizer Model custom

slide-97
SLIDE 97

TVM+VTA Stack Goals

slide-98
SLIDE 98

TVM+VTA Stack Goals

  • Blue-print for a complete deep learning

acceleration stack

slide-99
SLIDE 99

TVM+VTA Stack Goals

  • Blue-print for a complete deep learning

acceleration stack

  • Experimentation framework for cross-

stack deep learning optimizations

slide-100
SLIDE 100

TVM+VTA Stack Goals

  • Blue-print for a complete deep learning

acceleration stack

  • Experimentation framework for cross-

stack deep learning optimizations

  • Open-source community for industrial-

strength deep learning acceleration

slide-101
SLIDE 101

Carlos Guestrin

slide-102
SLIDE 102

Training Deep Learning Models with TVM

slide-103
SLIDE 103

Jared Roesch

slide-104
SLIDE 104

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC

Model

Standalone inference deployment

slide-105
SLIDE 105

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC

Model

Standalone inference deployment

slide-106
SLIDE 106

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC

Model

Automatic Differentiation

Gradient Program for Training

slide-107
SLIDE 107

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC

Model

Standalone training deployment Automatic Differentiation

Gradient Program for Training

slide-108
SLIDE 108

High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA Edge FPGA Cloud FPGA ASIC

Model

Standalone training deployment

  • Automatic generation of gradient programs
  • Support for customized data types and FPGA

training

  • Support for distributed execution, and

integration with technology such as PHub (see Liang’s talk). More details on the Relay talk later today!

Automatic Differentiation

Gradient Program for Training

slide-109
SLIDE 109

Road ahead…

slide-110
SLIDE 110

On the horizon…

Automation Training Hardware

slide-111
SLIDE 111

On the horizon…

Automation Training Hardware

AutoDiff 
 with Relay Training 


  • n-device

Tradeoff accuracy/ throughput/Joules

slide-112
SLIDE 112

On the horizon…

Automation Training Hardware

AutoDiff 
 with Relay Training 


  • n-device

Tradeoff accuracy/ throughput/Joules Auto
 quantization Full-program 


  • ptimization

Automated 
 HW design

slide-113
SLIDE 113

On the horizon…

Automation Training Hardware

AutoDiff 
 with Relay Training 


  • n-device

Tradeoff accuracy/ throughput/Joules Auto
 quantization Full-program 


  • ptimization

Automated 
 HW design VTA Chisel
 design ASIC
 flow Training on
 VTA

slide-114
SLIDE 114

Big THANKS to our sponsors!

slide-115
SLIDE 115

Keynote, TVM Overview,TVM @ Amazon

9:00

Automation, HW Specialization, Security Boxed lunches

12:30

Training, Programming Systems, Hardware

13:30

Break, contributors meetup

15:20

Compilers, FPGAs

15:50

Lightning talks

16:30

Community formation

17:35 18:10

Social (food, drinks)

20:00 adjourn

Break

11:05 11:25

Keynote (SAMPL, Apple, Amazon, Huawei) TVM Overview – Tianqi Chen, UW Deep Learning Compilation at Amazon – Yida Wang, Amazon AutoTVM & Device Fleet – Eddie Yan, UW VTA Open Source Deep Learning Accelerator – Thierry Moreau, UW Secure Enclaves for Deep Learning – Nick Hynes, UC Berkeley/Oasis Labs Kunle Olukotun/Raghu Prabhakar, Stanford & SambaNova Machine Programming – Justin Gottschlich, Intel PlaidML Stripe: Polyhedral IR + Model-guided Optimization – Brian Retford, Intel The Relay Differentiable IR for TVM – Jared Roesch, UW Scalable Distributed Training with Parameter Hub – Liang Luo, UW The HammerBlade ML Supercomputer – Michael Taylor, UW Andrew Tulloch, Facebook Graham Schelle, Xilinx Markus Weimer, Microsoft and Apache Software Foundation