AI Compiler @ Alibaba Xiaoyong Liu Presenting the work of many - - PowerPoint PPT Presentation

▶

Sep 08, 2023 1.72k likes •1.97k views

AI Compiler @ Alibaba Xiaoyong Liu Presenting the work of many people PAI (Platform of AI) Alibaba Cloud Intelligence AI Compiler Stack PAI PAI EAS PAI Tensorflow PAI Blade AI Compiler High Performance libraries TVM PAI TAO CPU

SLIDE 1

AI Compiler @ Alibaba

Xiaoyong Liu

PAI (Platform of AI) Alibaba Cloud Intelligence Presenting the work of many people

SLIDE 2

AI Compiler Stack

PAI

AI Compiler

PAI EAS TVM High Performance libraries PAI TAO PAI Tensorflow CPU GPU ASIC FPGA PAI Blade

SLIDE 3

How TVM is used @ Alibaba

An End-to-End Deep Learning Compiler

Ø Empower AI service Ø Generate high performance operators

subgraph & kernel
heterogenous computing
An Optimizer & Compiler

Ø Enable chips such as CPU, GPU, DSP & etc. potentially FPGA, AI Chips. Ø Deploy algorithm automatically

All Scenarios

Ø Cloud, Edge & IoT Ø Training & Inference

SLIDE 4

TVM + AIServicePAI-Blade

SLIDE 5

Things We Experienced

Current approach is too much engineering effort , difficult for platform service
TVM is good at

Ø To generate high-performance computing intensive kernels

Automatic is the key

Ø Heterogenous hardware friendly, if ISA is provided

Performance portability

Ø Software Architect friendly to Auto TVM / Schedule… Ø Whole-Graph Optimization

Challenges

Ø Easy of deployment, including coverage, quality & compatibility

Correctness, Performance & easy of new device enabling

Ø Systems don’t interop Ø Maturity/Standardization …

SLIDE 6

Contributed to TVM Community

Automatic Tensor Core Scheduling

Ø Nvidia tensor core in V100/T4

Schedule Algorithm enablement, as Batch Matmul and etc.

Ø Know what/how

Support TFLite models

Ø Automatically

C++ RPC Server

Ø Tuning your program in embedded environment without python

NMT

SLIDE 7

Ongoing Effort to Community

Automatic Tensor Core Scheduling Enhancement

Ø vthread supporting

Operators: New Ops in Relay / TVM

Ø HashTable Embedding…

Standardize GraphRuntime Exports Into a Single DLL

Ø A way to unify runtime models exports

SLIDE 8

Product-driven TVM enhancement

Brings online inference service
Compiles heterogenous hardware at cloud & edge

Ø Nvidia server GPU

V100/T4 on FP16/INT8/INT4/INT1

Ø Intel X86 Server CPU

on INT8/FP32/BF16

Ø ARM64 CPU

on INT8 / FP32

Ø ARM32 CPU

Any general solution is planed to contribute back to TVM Community!

Enhances Infrastructure

Ø HIFI4 DSP Ø Hexagon DSP Ø PowerVR GPU Ø Intel GPU

SLIDE 9

TVM produced more performance

Ø Chip supplier’s latest manually optimized high performance library Ø Assembly-level optimized edge machine learning framework

Optimized to gain decent performance on various products

Ø Server Nvidia GPU Ø Edge Arm64 Ø IoT Arm32

Automatic TC Scheduling + tensorization + tensorcore

SLIDE 10

Performance on V100 (FP16)

M, N, K cuBLAS TensorCore TVM TensorCore speedup 512, 16, 512 7.7470us 5.2570us 1.47X 512, 32, 512 8.0140us 6.0220us 1.33X 512, 64, 512 8.7530us 6.2390us 1.40X 512, 128, 512 9.0290us 7.1610us 1.26X 256, 256, 256 6.9380us 4.5930us 1.51X 1024, 32, 512 8.3320us 6.3770us 1.30X 2048, 32, 512 9.0640us 7.5070us 1.21X

SLIDE 11

Performance on T4

SLIDE 12

AliOS enhances TVM on vehicles

To accelerate NLU and AR-Nav models
ARM64 CPU performance on INT8 / FP32

Ø NHWC/img2col+pack/no tensorize&co-optimized with llvm Ø Planning to contribute back to community

Hexagon DSP

Ø vrmpy tensorize/llvm-codegen Ø Could run end-to-end Mobilenet V2 INT8 model

Intel GPU

Ø Schedule algorithm Ø Boost 1.6X performance of Lanenet model

SLIDE 13

Performance on ARM64 INT8

1.00 1.00 1.00 1.00 1.00 1.00 1.11 1.32 1.29 1.60 1.23 1.46 3.83 1.61 2.08 8.87

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 10.00

MobileNetV1 MobileNetV2 LaneNet Performance Comparison @ rasp 3b+ AARCH64 TFLite 1 core TFLite 4 core QNNPACK 1 core QNNPACK 4 core TVM 1 core TVM 4 core

SLIDE 14

Performance on ARM64 FP32

1.07 1.03 1.17 1.13

0.95 1 1.05 1.1 1.15 1.2

Mobilenet V1 Mobilenet V2 Performance Comparison AARCH64 TVM / MNN @ A53 TVM / MNN @ A72

SLIDE 15

AI Labs Compiles TMallGenie Models

ARM32 CPU

Ø Overflow-Aware Quantization (INT16 = INT8 * INT8) Ø GEMM Tensorize

HIFI4 DSP

Ø GEMM Tensorize, 10X speed up

PowerVR GPU

Ø Schedule Algorithm

SLIDE 16

Performance

322 336 230 214 140 50 100 150 200 250 300 350 400 1 TF Lite 8bit NCNN 8bit QNNPACK 8bit MNN 8bit ACE Overflow-aware (Assembly)

CPUMTK8167SARM32 A35 1.5GHz ModelMobileNetV2_1.0_224

SLIDE 17

A DL Compiler in T-HEAD SoC

TVM has been integrated into WuJian() SoC toolchain
Support Caffe Frontend

Ø Tested pass alexnet / resnet 50 / mobilenet v1 / mobilenet v2 / …

Tensorflow Caffe

TVM

T-HEAD NN LLVM

WuJian SoC

Customized AI Accelerator

SLIDE 18

TVM Roadmap @ Alibaba

Keep contributing general effort back to community
Auto Schedule (“with Berkeley team”)

Ø Auto* is the key to build machine-learning-powered system

Interpolate with top frameworks
Auto heterogenous hardware placement in system level
Infra Maturity

Ø Completeness & Seamless Deployment, as quantization, model compatibility

Workload Characterization

Ø To improve the key workloads within community

AI Service & Operators

Ø more chips, more models

SLIDE 19

Alibaba & OpenSource

Embrace OpenSource Contribute OpenSource Win-Win OpenSource

SLIDE 20

Takeaways

A golden age of deep learning compiler
Industry-grade deep-learning compilation solution is still in evolution
We are working to contribute to TVM

Ø Development & Research

Welcome to join us to contribute to TVM together

Ø Xiaoyong Liu (xiaoyong.liu@Alibaba-inc.com)

SLIDE 21

Thank you!