Woodpeck odpecker er-DL DL an efficient compiler for accelerating - - PowerPoint PPT Presentation

woodpeck odpecker er dl dl
SMART_READER_LITE
LIVE PREVIEW

Woodpeck odpecker er-DL DL an efficient compiler for accelerating - - PowerPoint PPT Presentation

Woodpeck odpecker er-DL DL an efficient compiler for accelerating deep learning on heterogeneous computing architectures Yong Chen Ant Financial, China Dec. 19, 2019 Outline Woodp dpec ecker er-DL DL Overvi rview Key Components


slide-1
SLIDE 1

Woodpeck

  • dpecker

er-DL DL

an efficient compiler for accelerating deep learning

  • n heterogeneous computing architectures

Yong Chen Ant Financial, China

  • Dec. 19, 2019
slide-2
SLIDE 2

Outline

 Woodp

dpec ecker er-DL DL Overvi rview

 Key Components and Technology  Graph Optimization  Auto Search with GA, RL Algorithm  DSL Compiler (Halide)  Integration with TensorRT and Experiment Figures

slide-3
SLIDE 3

Introduction

 Accelerating model training and inference is crucial to deep

learning

 Graph-level optimizations to increase inner-node compute efficiency and

reduce inter-node data movement overhead

 Operator-level optimizations to speed up mathematical function execution.  Exploration of specialized hardware targeting deep learning

 Aims to accelerate deep learning on heterogeneous computing

architectures by means of compiling techniques

 Explores multi-level optimizations from compute graph to hardware  Exploits machine-learning based fast hyper-parameter search approaches

to yield better math function kernels.

 Supports diverse hardware including CPUs, GPUs, FPGAs and et al.

Woodp

  • dpeck

ecker er-DL DL is part rt of Woodp

  • dpeck

ecker er, a ge gene neric ric compi piler er frame amewor

  • rk

k for r he heter erog

  • geneous

eneous comp mputin ing g develo veloped ped in n Ant nt Fina nanci cial al.

slide-4
SLIDE 4

Expert- Optimized Libraries

Deep Learning Compilers

Optimization Framework Compute Graph Optimization Tensor Expressions of Graph DSL Compiler

Software DL (LLVM, CUDA, Metal) Hardware DL (Verilog, HLS, Spatial)

Auto-Tuners

slide-5
SLIDE 5

Woodpecker-DL Architecture

Shape Inference Graph Optimization (in-place, pruning, fusion)

Woodpecker Frontend TensorRT Plugins Woodpecker Addons Model Safeguard Custom TF Ops

Custom PyTorch Extensions

Woodpecker AutoSearch Optimizer Ordinary Functions Composite Functions

TensorRT

Woodpecker Runtime Engine

TensorFlow

PyTorch

Proprietary Engine

CUDA assembly codes generated Optimized graph Math functions in optimized graph

slide-6
SLIDE 6

Outline

 Woodpecker-DL Overview  Key Components and Technology  Graph

aph Optimi mizat zatio ion

 Auto Search with GA, RL Algorithm  DSL Compiler (Halide)  Integration with TensorRT and Experiment Figures

slide-7
SLIDE 7

Graph Optimization

 Support multiple deep learning frameworks

 TensorFlow, PyTorch, Caffe, CoreML

 Compute graph optimization

 Simplification, removal and fusion  Horizontal or vertical compositional transformation.

 Shape inference of operators Convolution

Batch Norm

Convolution Activation

Smart

Operator Activation BiasAdd

Simplify Fuse A well ll-kno known n graph ph optimiz timizati tion

  • n example

ple

Convolution before merging Batch Normalization Merge conv and bn

slide-8
SLIDE 8

Outline

 Woodpecker-DL Overview  Key Components and Technology  Graph Optimization  Auto

  • Sear

arch h with th GA, RL Algorith

  • rithm

 DSL Compiler (Halide)  Integration with TensorRT and Experiment Figures

slide-9
SLIDE 9

AutoSearch Optimizer

 A machine learning-based framework for automated mathematical kernel

  • ptimizations

Optimization algorithms

Bayesian RL MCMC Genetic SA …

Perf model

Program Hardware

Measurement

Profiling Historical data

Algorithms from various domains

Deep learning Graph computing Math optimizations

Hardware

CPU GPU FPGA Ali-NPU Plasticine

Parameterized program

Halide GraphIt

Efficient program

  • r hardware

Weld Spatial Data analysis CUDA

Feedback

…… Mobile/Embed

slide-10
SLIDE 10

 Genetic Algorithm

 Varies population size as per scales of real search space  Joins all hyper-parameters (genes) in order to form a

chromosome

 Uses Roulette wheel selection

AutoSearch: Genetic Algorithm

slide-11
SLIDE 11

AutoSearch: Search Space

 Take convolution as an example:

 Image size (1, 64, 56, 56), filter size (64, 64, 1, 1)  9 optimizing dimensions: data splitting dimension, granularity, processing order,

caching or not

 56 * 56 * 64 * 8 * 8* 8 * 6 * 4 * 6 = 14 billion choices  Brute force: 14 billion * 100 ms per iteration → 22

22 years ars

 Brute force with pruning: 230 thousands choices → 1.

1.35 35 da days

 Genetic search: 1600 choices → 12

12 mi minu nutes tes

Options ThreadX ThreadY ThreadZ TileX TileY TileZ TileRZ Shared Mem Thread LoopOrder Range (1, 56) (1, 56) (1, 64) (1, 8) (1, 8) (1, 8) (1, 6) (1, 4) (1, 6)

slide-12
SLIDE 12

AutoSearch Performance: Genetic Algorithm

 Converges in 10 minutes with a population size of 64  2.8x faster than NVIDIA cuDNN, 1.5x faster than TVM

slide-13
SLIDE 13

AutoSearch: Reinforcement Learning

 Reinforcement Learning

 Customized environment and policy graph  Uses RLlib scalable reinforcement learning framework

slide-14
SLIDE 14

AutoSearch Performance: Reinforcement Learning

 Operations taken from a convolutional model for Ant Financial Payment business  RL finds better performance than GA in some cases (within the same time)

1.25 0.75 1.14 2.13 2.31 1.36 1.18 1.89 2.67 2.60 0.5 1 1.5 2 2.5 3 conv1a conv1b conv2 conv3 conv4 Relative Speedup cuDNN Woodpecker GA Woodpecker RL

RL does s not always ays outper tperfor forms s GA

slide-15
SLIDE 15

Outline

 Woodpecker-DL Overview  Key Components and Technology  Graph Optimization  Auto Search with GA, RL Algorithm  DSL

L Compil iler er (Hali lide de)

 Integration with TensorRT and Experiment Figures

slide-16
SLIDE 16

DSL Compiler: Halide

 A Domain-Specific Language (DSL) and compiler for image processing pipelines.

 Separates algorithm from schedule  Enables more efficient and flexible optimizations  Open source: : https://github.com/halide/Halide

 Algorithm:

g x, y = x + y f x, y = g x, y − 1 + g x, y + g(x, y + 1) 3

 Schedule:

  • f. gpu_tile(x, y, xo, yo, xi, yi, 8,8)
slide-17
SLIDE 17

Intermediate Codes Generated by Halide

slide-18
SLIDE 18

Halide Schedules

 Drawbacks

 Still needs domain-specific knowledge and skills to get good performance.  Given a specific architecture, there are considerable number of schedules to explore  Some schedules are architecture-aware, and thus different hardware needs to exploit different

schedules.

 Example schedules

 Loop

 split, reorder, unroll, tile, storage layout and et al.

 Stage granularity

 Coarse-grained: insufficient shared memory, limiting other schedules (storage granularity)  Fine-grained: insufficient data reuse and inefficient load/store

 Schedules are crucial for gaining high performance given a math function

 Thus motivated the development of automated search approaches for optimal schedules.

slide-19
SLIDE 19

An Example Schedule without Layout Optimization

Storage transform (put Co inner-most)

  • N: batch size
  • Co: output channels
  • H: output height
  • W: output width

(N, Co, H, W) (N, H, W, Co)

1.625 1 0.5 1 1.5 2 layout optimize w/o layout

  • ptimize

Performance

96.30% 100% 9.10% 8.90% 14.10% 50.00% 0.00% 4.50% 0% 50% 100% 150% Global Load Efficiency Global Store Efficiency shared Efficiency Occupancy

Profiling

layout optimize w/o layout optimize

slide-20
SLIDE 20

Outline

 Woodpecker-DL Overview  Key Components and Technology  Graph Optimization  Auto Search with GA, RL Algorithm  DSL Compiler (Halide)  Integr

egrati ation

  • n with

th Tenso sorRT rRT and Experi eriment ent Figur ures es

slide-21
SLIDE 21

Runtime Engines

 Supports multiple inference engines

 Proprietary engine, external serving (TensorFlow, PyTorch, and TensorRT)

Diagr gram am showin wing Wood

  • dpecker

er-DL DL Integ tegrati tion

  • n with

th Tenso sorRT rRT

slide-22
SLIDE 22

Performance: ResNet-18 (Breakdown)

 For separate convolution operations

0.79 0.53 1.81 2.17 3.46 0.76 2.66 3.67 0.90 2.36 3.89 1.69 0.86 0.61 2.56 2.56 5.40 0.84 3.60 4.15 1.46 3.18 3.41 1.80 0.00 1.00 2.00 3.00 4.00 5.00 6.00 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 Relative Speedup

Resnet-18 (Higher is better)

cuDNN TVM Woodpecker

slide-23
SLIDE 23

Performance: ResNet-18 (Summation)

1

1.67

2.04 0.5 1 1.5 2 2.5 cuDNN TVM Woodpecker Accumulated Relative Speedup

Resnet-18(Higher is better)

Sum up the runtim ntimes es of all l convo voluti lution

  • n operation

ations

slide-24
SLIDE 24

Performance: Ant Financial Payment Model

2.12 2.00 1.77 1.52 1.48 1.43 1.40 1.50 1.31 1.26 1.34 1.23 1.20 1.23 1.33 1.24 0.5 1 1.5 2 2.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Relative Speedup Batch Size

Ant Financial Payment Business (Higher is better)

TensorRT Woodpecker

Dynami amic batchi hing ng enabl bled ed

slide-25
SLIDE 25

References

 S. Chetlur et al. (2014) cuDNN: Efficient Primitives for Deep Learning.

arXiv:1410.0759v3.

 T. Chen et al. (2018) TVM: an automated end-to-end optimizing compiler for deep

  • learning. In Proceedings of the 12th USENIX conference on Operating Systems

Design and Implementation (OSDI'18).

 E. Liang et al. (2018) RLlib: Abstractions for Distributed Reinforcement Learning.

Proceedings of the 35th International Conference on Machine Learning, 80:3053- 3062.

 J. Ragan-Kelley et al. (2018) Halide: decoupling algorithms from schedules for

high-performance image processing. Communications of the ACM, 61(1): 106-115.

 NVIDIA TensorRT (2019) Programmable Inference Accelerator.

https://developer.nvidia.com/tensorrt

slide-26
SLIDE 26

Team Members

Jin, Yue Zhang, Yao Liu, Yongchao Teng, Teng Ou, Hang Chen, Yong Zhao, Rui

Thank You !