AutoTVM & Device Fleet
`
AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs - - PowerPoint PPT Presentation
AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware
`
High-level data flow graph and optimizations Hardware Frameworks
High-level data flow graph and optimizations Hardware Frameworks
Machine Learning based Program Optimizer
High-level data flow graph and optimizations Hardware Frameworks
Machine Learning based Program Optimizer
High-level data flow graph and optimizations Learning to generate optimized program for new operator workloads and hardware Hardware Frameworks
Hardware
Loop Transformations Thread Bindings Cache Locality Thread Cooperation Tensorization Latency Hiding C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
Compute Description
Hardware
Loop Transformations Thread Bindings Cache Locality Thread Cooperation Tensorization Latency Hiding C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
Compute Description
Hardware
Loop Transformations Thread Bindings Cache Locality Thread Cooperation Tensorization Latency Hiding
C = tvm.compute((m, n), lambda y, x: tvm.sum(A[k, y] * B[k, x], axis=k))
Compute Description
4
Program Optimizer Program
Code Generator
Runtime Measurements
4
Program Optimizer Program
Code Generator
Runtime Measurements
4
Program Optimizer Program
Code Generator
5
Program Optimizer Program
Code Generator
5
Program Optimizer Program
Code Generator
Cost Model
5
Program Optimizer Program
Code Generator
Cost Model
Program Optimizer Program
Code Generator
Program Optimizer Program
Code Generator
Training data
Program Optimizer Program
Code Generator
Training data
Statistical Cost Model
Program Optimizer Program
Code Generator
Training data
Statistical Cost Model
High-Level Configuration
High-Level Configuration
for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]
High-Level Configuration
for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]
C A B y 64 64 64 x 8 8 64 k 1 8 8
y 1 x 8 k 64
touched memory
loop length
High-Level Configuration
for y in range(8): for x in range(8): C[y][x]=0 for k in range(8): C[y][x]+=A[k][y]*B[k][x]
for context vec of x for for context vec of y context vec of k
+
soft scatter final embedding
C A B y 64 64 64 x 8 8 64 k 1 8 8
y 1 x 8 k 64
touched memory
loop length
8
100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50
8
100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50
One Conv2D Layer of ResNet18 on Titan X
8
100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50
Number of Trials One Conv2D Layer of ResNet18 on Titan X
8
100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50
Number of Trials One Conv2D Layer of ResNet18 on Titan X R e l a t i v e S p e e d u p Baseline: CuDNN
9
100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50 100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50
Number of Trials One Conv2D Layer of ResNet18 on Titan X R e l a t i v e S p e e d u p Baseline: CuDNN TVM: Random Search
10
100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50 100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50 100 200 300 400 500 600 700 800 0.00 0.25 0.50 0.75 1.00 1.25 1.50
One Conv2D Layer of ResNet18 on Titan X R e l a t i v e S p e e d u p TVM: Random Search TVM: ML-based Model Number of Trials Baseline: CuDNN
Historical Optimization Tasks
Historical Optimization Tasks Domain Invariant Program Representations
Historical Optimization Tasks Domain Invariant Program Representations Transferable Models to speedup new tasks
Historical Optimization Tasks Domain Invariant Program Representations Transferable Models to speedup new tasks
Latency (ms) 1.75 3.5 5.25 7 ResNet-50 MobileNet VGG-19 Inception V3 DenseNet-121
MXNet + TensorRT 4.0 AutoTVM
Latency (ms) 1.75 3.5 5.25 7 ResNet-50 MobileNet DenseNet-121
MIOpen AutoTVM
Latency (ms) 0E+00 4E-05 8E-05 1.2E-04 1.6E-04 1-7-512-512-1 4-7-512-512-1 1-7-512-512-3 4-7-512-512-3 1-14-256-256-1 4-14-256-256-1
cuDNN AutoTVM
15
Fleet Tracker
16
Resource Manager (Tracker) Nvidia GPU Server RPC RT CUDA tasks Android Phone RPC RT OpenCL tasks AMD GPU Server RPC RT ROCm tasks Zynq FPGA board RPC RT JIT driver Raspberry Pi RPC RT ARM tasks Shared cluster of heterogeneous devices Optimization Service RPC client Resource token Resource Allocation RPC Session Data Path Optimization Service RPC client cross compiler Red modules can be reconfigured remotely in each session cross compiler Running optimization services Prioritizer Workload 1 Workload 2 Workload 3 ML-based cost model … Hardware bitstream
17
Client Tracker Client Device upload code run code return run time Device
17
Client Tracker Client Device upload code run code return run time Device device free
17
Client Tracker Client Device upload code run code return run time Device device free
17
Client Tracker request device Client Device upload code run code return run time Device device free
17
Client Tracker request device Client Device upload code run code return run time Device device free
17
Client Tracker request device Client Device upload code run code return run time Device device free
17
Client Tracker request device return handle Client Device upload code run code return run time Device device free
18
Model Bag of Operators
AutoTVM tuning
Tuned Model
19
conv2d, x86 conv2d, GPU, winograd conv2d, ARM, spatial packing Handcrafted Schedule Templates