AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs - - PowerPoint PPT Presentation

autotvm device fleet
SMART_READER_LITE
LIVE PREVIEW

AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs - - PowerPoint PPT Presentation

AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware


slide-1
SLIDE 1

AutoTVM & Device Fleet

`

slide-2
SLIDE 2

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations Hardware Frameworks

slide-3
SLIDE 3

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations Hardware Frameworks

slide-4
SLIDE 4

Machine Learning based Program Optimizer

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations Hardware Frameworks

slide-5
SLIDE 5

Machine Learning based Program Optimizer

Learning to Optimize Tensor Programs

High-level data flow graph and optimizations Learning to generate optimized program for new operator workloads and hardware Hardware Frameworks

slide-6
SLIDE 6

Search over Possible Program Transformations

Hardware

Loop Transformations Thread Bindings Cache Locality Thread Cooperation Tensorization Latency Hiding C = tvm.compute((m, n), 
 lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Compute Description

slide-7
SLIDE 7

Search over Possible Program Transformations

Hardware

Loop Transformations Thread Bindings Cache Locality Thread Cooperation Tensorization Latency Hiding C = tvm.compute((m, n), 
 lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Compute Description

slide-8
SLIDE 8

Search over Possible Program Transformations

Hardware

Loop Transformations Thread Bindings Cache Locality Thread Cooperation Tensorization Latency Hiding

Billions

  • f possible
  • ptimization

choices

C = tvm.compute((m, n), 
 lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))

Compute Description

slide-9
SLIDE 9

Learning-based Program Optimizer

4

Program Optimizer Program

Code Generator

slide-10
SLIDE 10

Learning-based Program Optimizer

Runtime Measurements

4

Program Optimizer Program

Code Generator

slide-11
SLIDE 11

Learning-based Program Optimizer

Runtime Measurements

High experiment cost, each trial costs ~1second

4

Program Optimizer Program

Code Generator

slide-12
SLIDE 12

Learning-based Program Optimizer

5

Program Optimizer Program

Code Generator

slide-13
SLIDE 13

Learning-based Program Optimizer

5

Program Optimizer Program

Code Generator

Cost Model

slide-14
SLIDE 14

Learning-based Program Optimizer

Need reliable cost model per hardware

5

Program Optimizer Program

Code Generator

Cost Model

slide-15
SLIDE 15

Learning-based Program Optimizer

Program Optimizer Program

Code Generator

slide-16
SLIDE 16

Learning-based Program Optimizer

Program Optimizer Program

Code Generator

D

<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit>

Training data

slide-17
SLIDE 17

Learning-based Program Optimizer

Program Optimizer Program

Code Generator

D

<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit>

Training data

Learning

Statistical Cost Model

slide-18
SLIDE 18

Learning-based Program Optimizer

  • Relatively low experiment cost
  • Domain-specific problem structure
  • Large quantity of similar tasks

Unique Problem Characteristics

Program Optimizer Program

Code Generator

D

<latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit><latexit sha1_base64="1Z6CzjBl0OMVztfQ+m452YDkcY0=">AB8nicbVDLSsNAFL2pr1pfVZdugkVwVRIRdFnUhcsK9gFtKJPpB06mQkzN0IJ/Qw3LhRx69e482+ctFlo64GBwzn3MueMBHcoOd9O6W19Y3NrfJ2ZWd3b/+genjUNirVlLWoEkp3Q2KY4JK1kKNg3UQzEoeCdcLJbe53npg2XMlHnCYsiMlI8ohTglbq9WOCY0pEdjcbVGte3ZvDXSV+QWpQoDmofvWHiqYxk0gFMabnewkGdHIqWCzSj81LCF0QkasZ6kMTNBNo8c8+sMnQjpe2T6M7V3xsZiY2ZxqGdzCOaZS8X/N6KUbXQcZlkiKTdPFRlAoXlZvf7w65ZhTF1BJCNbdZXTomlC0LVsCf7yaukfVH3vbr/cFlr3BR1lOETuEcfLiCBtxDE1pAQcEzvMKbg86L8+58LEZLTrFzDH/gfP4AdN2RWg=</latexit>

Training data

Learning

Statistical Cost Model

slide-19
SLIDE 19

Program-aware Cost Modeling

High-Level Configuration

slide-20
SLIDE 20

Program-aware Cost Modeling

High-Level Configuration

for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]

Low-level Abstract Syntax Tree (shared between tasks)

slide-21
SLIDE 21

Program-aware Cost Modeling

High-Level Configuration

for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]

Low-level Abstract Syntax Tree (shared between tasks)

C A B y 64 64 64 x 8 8 64 k 1 8 8

y 1 x 8 k 64

touched memory

  • uter

loop length

statistical features Boosted Tree Ensembles

slide-22
SLIDE 22

Program-aware Cost Modeling

High-Level Configuration

for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]

Low-level Abstract Syntax Tree (shared between tasks)

for context vec of x for for context vec of y context vec of k

+

soft scatter final embedding

TreeGRU

C A B y 64 64 64 x 8 8 64 k 1 8 8

y 1 x 8 k 64

touched memory

  • uter

loop length

statistical features Boosted Tree Ensembles

slide-23
SLIDE 23

Effectiveness of ML based Model

8

100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50

slide-24
SLIDE 24

Effectiveness of ML based Model

8

100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50

One Conv2D Layer of ResNet18 on Titan X

slide-25
SLIDE 25

Effectiveness of ML based Model

8

100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50

Number of Trials One Conv2D Layer of ResNet18 on Titan X

slide-26
SLIDE 26

Effectiveness of ML based Model

8

100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50

Number of Trials One Conv2D Layer of ResNet18 on Titan X R e l a t i v e S p e e d u p Baseline: CuDNN

slide-27
SLIDE 27

Effectiveness of ML based Model

9

100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50 100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50

Number of Trials One Conv2D Layer of ResNet18 on Titan X R e l a t i v e S p e e d u p Baseline: CuDNN TVM: Random Search

slide-28
SLIDE 28

Effectiveness of ML based Model

10

100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50 100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50 100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50

One Conv2D Layer of ResNet18 on Titan X R e l a t i v e S p e e d u p TVM: Random Search TVM: ML-based Model Number of Trials Baseline: CuDNN

slide-29
SLIDE 29

Transfer Learning Among Different Workloads

Historical Optimization Tasks

slide-30
SLIDE 30

Transfer Learning Among Different Workloads

Historical Optimization Tasks Domain Invariant Program Representations

slide-31
SLIDE 31

Transfer Learning Among Different Workloads

Historical Optimization Tasks Domain Invariant Program Representations Transferable Models to speedup new tasks

slide-32
SLIDE 32

Transfer Learning Among Different Workloads

Historical Optimization Tasks Domain Invariant Program Representations Transferable Models to speedup new tasks

slide-33
SLIDE 33

NVIDIA GPU Optimization (GTX 1080 Ti)

Latency (ms) 1.75 3.5 5.25 7 ResNet-50 MobileNet VGG-19 Inception V3 DenseNet-121

MXNet + TensorRT 4.0 AutoTVM

slide-34
SLIDE 34

AMD GPU Optimization (Vega FE)

Latency (ms) 1.75 3.5 5.25 7 ResNet-50 MobileNet DenseNet-121

MIOpen AutoTVM

slide-35
SLIDE 35

Bonus (INT8, GTX 1080)

Latency (ms) 0E+00 4E-05 8E-05 1.2E-04 1.6E-04 1-7-512-512-1 4-7-512-512-1 1-7-512-512-3 4-7-512-512-3 1-14-256-256-1 4-14-256-256-1

cuDNN AutoTVM

slide-36
SLIDE 36

High Level: Scaling Automatic Performance Profiling

15

@ @ @ @

Fleet Tracker

slide-37
SLIDE 37

Low Level: Portable RPC Tracker + Server

16

Resource Manager (Tracker) Nvidia GPU Server RPC RT CUDA tasks Android Phone RPC RT OpenCL tasks AMD GPU Server RPC RT ROCm tasks Zynq FPGA board RPC RT JIT driver Raspberry Pi RPC RT ARM tasks Shared cluster of heterogeneous devices Optimization Service RPC client Resource token Resource Allocation RPC Session Data Path Optimization Service RPC client cross compiler Red modules can be reconfigured remotely in each session cross compiler Running optimization services Prioritizer Workload 1 Workload 2 Workload 3 ML-based cost model … Hardware bitstream

slide-38
SLIDE 38

RPC Communication Flow

17

Client Tracker Client Device upload code run code return run time Device

slide-39
SLIDE 39

RPC Communication Flow

17

Client Tracker Client Device upload code run code return run time Device device free

slide-40
SLIDE 40

RPC Communication Flow

17

Client Tracker Client Device upload code run code return run time Device device free

slide-41
SLIDE 41

RPC Communication Flow

17

Client Tracker request device Client Device upload code run code return run time Device device free

slide-42
SLIDE 42

RPC Communication Flow

17

Client Tracker request device Client Device upload code run code return run time Device device free

slide-43
SLIDE 43

RPC Communication Flow

17

Client Tracker request device Client Device upload code run code return run time Device device free

slide-44
SLIDE 44

RPC Communication Flow

17

Client Tracker request device return handle Client Device upload code run code return run time Device device free

slide-45
SLIDE 45

Model to Tuned Implementation

18

Model Bag of Operators

  • perator extraction

AutoTVM tuning

Tuned Model

slide-46
SLIDE 46

Next: Autoscheduler, Lianmin @ 16:30

19

conv2d, x86 conv2d, GPU, winograd conv2d, ARM, spatial packing Handcrafted Schedule Templates