Woodpeck
- dpecker
er-DL DL
an efficient compiler for accelerating deep learning
- n heterogeneous computing architectures
Yong Chen Ant Financial, China
- Dec. 19, 2019
Woodpeck odpecker er-DL DL an efficient compiler for accelerating - - PowerPoint PPT Presentation
Woodpeck odpecker er-DL DL an efficient compiler for accelerating deep learning on heterogeneous computing architectures Yong Chen Ant Financial, China Dec. 19, 2019 Outline Woodp dpec ecker er-DL DL Overvi rview Key Components
Yong Chen Ant Financial, China
Woodp
Key Components and Technology Graph Optimization Auto Search with GA, RL Algorithm DSL Compiler (Halide) Integration with TensorRT and Experiment Figures
Accelerating model training and inference is crucial to deep
Graph-level optimizations to increase inner-node compute efficiency and
reduce inter-node data movement overhead
Operator-level optimizations to speed up mathematical function execution. Exploration of specialized hardware targeting deep learning
Aims to accelerate deep learning on heterogeneous computing
Explores multi-level optimizations from compute graph to hardware Exploits machine-learning based fast hyper-parameter search approaches
to yield better math function kernels.
Supports diverse hardware including CPUs, GPUs, FPGAs and et al.
Woodp
ecker er-DL DL is part rt of Woodp
ecker er, a ge gene neric ric compi piler er frame amewor
k for r he heter erog
eneous comp mputin ing g develo veloped ped in n Ant nt Fina nanci cial al.
Expert- Optimized Libraries
Optimization Framework Compute Graph Optimization Tensor Expressions of Graph DSL Compiler
Software DL (LLVM, CUDA, Metal) Hardware DL (Verilog, HLS, Spatial)
Auto-Tuners
Shape Inference Graph Optimization (in-place, pruning, fusion)
Woodpecker Frontend TensorRT Plugins Woodpecker Addons Model Safeguard Custom TF Ops
Custom PyTorch Extensions
Woodpecker AutoSearch Optimizer Ordinary Functions Composite Functions
TensorRT
Woodpecker Runtime Engine
TensorFlow
PyTorch
Proprietary Engine
CUDA assembly codes generated Optimized graph Math functions in optimized graph
Woodpecker-DL Overview Key Components and Technology Graph
Auto Search with GA, RL Algorithm DSL Compiler (Halide) Integration with TensorRT and Experiment Figures
Support multiple deep learning frameworks
TensorFlow, PyTorch, Caffe, CoreML
Compute graph optimization
Simplification, removal and fusion Horizontal or vertical compositional transformation.
Shape inference of operators Convolution
Batch Norm
Convolution Activation
Smart
Operator Activation BiasAdd
Simplify Fuse A well ll-kno known n graph ph optimiz timizati tion
ple
Convolution before merging Batch Normalization Merge conv and bn
Woodpecker-DL Overview Key Components and Technology Graph Optimization Auto
DSL Compiler (Halide) Integration with TensorRT and Experiment Figures
A machine learning-based framework for automated mathematical kernel
Optimization algorithms
Bayesian RL MCMC Genetic SA …
Perf model
Program Hardware
Measurement
Profiling Historical data
Algorithms from various domains
Deep learning Graph computing Math optimizations
Hardware
CPU GPU FPGA Ali-NPU Plasticine
Parameterized program
Halide GraphIt
Efficient program
Weld Spatial Data analysis CUDA
Feedback
…… Mobile/Embed
Genetic Algorithm
Varies population size as per scales of real search space Joins all hyper-parameters (genes) in order to form a
chromosome
Uses Roulette wheel selection
Take convolution as an example:
Image size (1, 64, 56, 56), filter size (64, 64, 1, 1) 9 optimizing dimensions: data splitting dimension, granularity, processing order,
caching or not
56 * 56 * 64 * 8 * 8* 8 * 6 * 4 * 6 = 14 billion choices Brute force: 14 billion * 100 ms per iteration → 22
22 years ars
Brute force with pruning: 230 thousands choices → 1.
1.35 35 da days
Genetic search: 1600 choices → 12
12 mi minu nutes tes
Options ThreadX ThreadY ThreadZ TileX TileY TileZ TileRZ Shared Mem Thread LoopOrder Range (1, 56) (1, 56) (1, 64) (1, 8) (1, 8) (1, 8) (1, 6) (1, 4) (1, 6)
Converges in 10 minutes with a population size of 64 2.8x faster than NVIDIA cuDNN, 1.5x faster than TVM
Reinforcement Learning
Customized environment and policy graph Uses RLlib scalable reinforcement learning framework
Operations taken from a convolutional model for Ant Financial Payment business RL finds better performance than GA in some cases (within the same time)
1.25 0.75 1.14 2.13 2.31 1.36 1.18 1.89 2.67 2.60 0.5 1 1.5 2 2.5 3 conv1a conv1b conv2 conv3 conv4 Relative Speedup cuDNN Woodpecker GA Woodpecker RL
RL does s not always ays outper tperfor forms s GA
Woodpecker-DL Overview Key Components and Technology Graph Optimization Auto Search with GA, RL Algorithm DSL
Integration with TensorRT and Experiment Figures
A Domain-Specific Language (DSL) and compiler for image processing pipelines.
Separates algorithm from schedule Enables more efficient and flexible optimizations Open source: : https://github.com/halide/Halide
Algorithm:
Schedule:
Drawbacks
Still needs domain-specific knowledge and skills to get good performance. Given a specific architecture, there are considerable number of schedules to explore Some schedules are architecture-aware, and thus different hardware needs to exploit different
schedules.
Example schedules
Loop
split, reorder, unroll, tile, storage layout and et al.
Stage granularity
Coarse-grained: insufficient shared memory, limiting other schedules (storage granularity) Fine-grained: insufficient data reuse and inefficient load/store
Schedules are crucial for gaining high performance given a math function
Thus motivated the development of automated search approaches for optimal schedules.
Storage transform (put Co inner-most)
(N, Co, H, W) (N, H, W, Co)
1.625 1 0.5 1 1.5 2 layout optimize w/o layout
Performance
96.30% 100% 9.10% 8.90% 14.10% 50.00% 0.00% 4.50% 0% 50% 100% 150% Global Load Efficiency Global Store Efficiency shared Efficiency Occupancy
Profiling
layout optimize w/o layout optimize
Woodpecker-DL Overview Key Components and Technology Graph Optimization Auto Search with GA, RL Algorithm DSL Compiler (Halide) Integr
Supports multiple inference engines
Proprietary engine, external serving (TensorFlow, PyTorch, and TensorRT)
Diagr gram am showin wing Wood
er-DL DL Integ tegrati tion
th Tenso sorRT rRT
For separate convolution operations
0.79 0.53 1.81 2.17 3.46 0.76 2.66 3.67 0.90 2.36 3.89 1.69 0.86 0.61 2.56 2.56 5.40 0.84 3.60 4.15 1.46 3.18 3.41 1.80 0.00 1.00 2.00 3.00 4.00 5.00 6.00 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 Relative Speedup
Resnet-18 (Higher is better)
cuDNN TVM Woodpecker
1
1.67
2.04 0.5 1 1.5 2 2.5 cuDNN TVM Woodpecker Accumulated Relative Speedup
Resnet-18(Higher is better)
Sum up the runtim ntimes es of all l convo voluti lution
ations
2.12 2.00 1.77 1.52 1.48 1.43 1.40 1.50 1.31 1.26 1.34 1.23 1.20 1.23 1.33 1.24 0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Relative Speedup Batch Size
Ant Financial Payment Business (Higher is better)
TensorRT Woodpecker
Dynami amic batchi hing ng enabl bled ed
S. Chetlur et al. (2014) cuDNN: Efficient Primitives for Deep Learning.
T. Chen et al. (2018) TVM: an automated end-to-end optimizing compiler for deep
E. Liang et al. (2018) RLlib: Abstractions for Distributed Reinforcement Learning.
J. Ragan-Kelley et al. (2018) Halide: decoupling algorithms from schedules for
NVIDIA TensorRT (2019) Programmable Inference Accelerator.
Jin, Yue Zhang, Yao Liu, Yongchao Teng, Teng Ou, Hang Chen, Yong Zhao, Rui