End to End Deep Learning Solution on Arm Architecture Jan. 14 2019, - - PowerPoint PPT Presentation

end to end deep learning solution on arm architecture
SMART_READER_LITE
LIVE PREVIEW

End to End Deep Learning Solution on Arm Architecture Jan. 14 2019, - - PowerPoint PPT Presentation

End to End Deep Learning Solution on Arm Architecture Jan. 14 2019, Jammy Zhou HPC and AI convergence TOP500 Trend Arm on the road More than 50 percent of additional flops in the Astra at Sandia National Lab of US is the first latest TOP500


slide-1
SLIDE 1

End to End Deep Learning Solution

  • n Arm Architecture
  • Jan. 14 2019, Jammy Zhou
slide-2
SLIDE 2

HPC and AI convergence

TOP500 Trend

More than 50 percent of additional flops in the latest TOP500 rankings were from Nvidia Tesla GPUs according to TOP500 report Half of TOP10 systems use Nvidia GPUs, and 122 systems of TOP500 use Nvidia GPUs (64 systems uses P100 GPUs, 46 systems uses V100 GPUs, 12 systems uses Kepler GPUs) More AI/ML/DL workloads are being added to HPC applications with wide adoption of Nvidia GPUs

Arm on the road

Astra at Sandia National Lab of US is the first Arm based supercomputer entering TOP500 list, numbered at 203 in the latest ranking. Good momentum of Arm based supercomputers around the world, Post-K from Japan, Tianhe-3 from China, Catalyst UK, GW4 Isambad and CEA system from Europe Arm SVE is enabled by Post-K together with the Tofu D interconnect and HBM2 memory, and will be used for some AI workloads Besides Nvidia GPUs, there are some other accelerator options in the market, for example, MI60/MI50 Radeon Instinct GPUs from AMD, Xilinx and Intel FPGAs, customized ASIC products, etc

slide-3
SLIDE 3

HPC and AI in the Cloud

CPU Accelerator Network Storage AI & ML Services HPC Services

100 Gbps Ethernet, InfiniBand, Omni-Path, RDMA and RoCE Fast and scalable storage, such as NVMe based local SSD

Arm on the road

Science Cloud with Arm based HPC from HPC Systems (supporting Hisilicon Hi1616 and Marvell Thunder X2) Amazon EC2 A1 instances based on AWS Graviton Arm 64-bit processor for scale-out and Arm based workloads Arm Neoverse continuous improvement Accelerators (GPUs, FPGAs, ASICs) HPC & AI software stack (languages, frameworks, libraries, drivers, compilers, etc), multi-node distributed support and MPI

slide-4
SLIDE 4

HW Diversity & SW Fragmentation

DL Frameworks HAL and Drivers Libraries Hardware (CPU, GPU, FPGA, ASIC, DSP) TensorFlow Caffe MXNet Theano Caffe2 CNTK

PaddlePaddle BLAS FFT RNG SPARSE Eigen

PyTorch Keras Framework support for multiple accelerators

CMSIS-NN ACL

Model Formats (framework specific, ONNX, NNEF) Deep Learning Compilers (TVM, Glow, XLA, ONNC, etc) 1. Difficult to switch between frameworks by application and algorithm developers 2. Different backends to maintain by framework developers for various accelerators 3. Multiple frameworks to support by chip and IP vendors with duplicated efforts, and out-of-tree support by forking the upstream 4. Multiple configurations to support by OEMs/ODMs and cloud vendors Chainer...

1 2 3 4

Big Data Analytics TensorFlowOnSpark CaffeOnSpark SparkFlow ...

cuDNN MIOpen ...

slide-5
SLIDE 5

Open Neural Network eXchange Ecosystem

Framework interoperability & Hardware optimizations ONNX Format ONNX Models ONNXIFI ONNX Runtime ONNX Tools Create Convert Deploy Optimize

slide-6
SLIDE 6

ONNX Specifications

Neural-network-only ONNX

Defines an extensible computation graph model, built-in operators and standard data types Support only tensors for input/output data types

ONNX-ML Extension

Classical machine learning extension Also support data types of sequences and maps, extend ONNX operator set with ML algorithms not based on neural networks

ONNX v1.3 Released on Sep. 1st 2018

Control Flow support Functions (composable operators, experimental) Enhanced shape inference Additional optimization passes ONNXIFI 1.0 (C-backend for accelerators)

More to come...

Quantization Test/Compliance Data pipelines Edge/Mobile/IoT

slide-7
SLIDE 7

ONNX Interface for Framework Integration

ONNXIFI

Standardized interface for NN inference on different accelerators Runtime discovery and selection of execution backends, as well as ONNX operators supported

  • n each backend

Support ONNX format & online model conversion

ONNXIFI Backend

A combination of software layer and hardware device used to run an ONNX graph The same software layer can expose multiple backends Heterogeneous type of backend can distribute work across multiple device types internally ONNXIFI

libonnxifi.so

Glow Library A

libonnxifi-glow.so libonnxifi-a.so

Applications ONNX Models Frameworks Library B

libonnxifi-b.so

Library C

libonnxifi-c.dll

Library D

libonnxifi-d.dylib

...

slide-8
SLIDE 8

ONNX Runtime

High-performance and cross-platform inference engine for ONNX models Fully implements the ONNX specification including the ONNX-ML extension Arm platforms are supported on both Linux (experimental) and Windows

Diagram from https://github.com/Microsoft/onnxruntime/blob/master/docs/HighLevelDesign.md TensorRT and nGraph support are work in progress

slide-9
SLIDE 9

Machine Intelligence

A Linaro Strategic Initiative

Provide the best-in-class Deep Learning performance by leveraging Neural Network acceleration in IP and SoCs from the Arm ecosystem, through collaborative seamless integration with the ecosystem of AI/ML software frameworks and libraries

slide-10
SLIDE 10

Scope from HPC to microcontroller

HPC, Data Center & Cloud *

SVE based optimization for DL frameworks & libraries PCIe/CCIX based heterogeneous accelerator support

  • n Arm servers (drivers, compilers and framework

integration, etc) Scale out support for distributed training

Edge node & device

Initial focus on inference support for Cortex-A SOCs Common model description format and APIs to the runtime Common optimized runtime inference engine for Arm-based SoC Plug-in framework to support multiple 3rd party IPs (NPU, GPU, DSP, FPGA) Continuous integration testing and benchmarking

Microcontroller *

CMSIS-NN optimized frameworks/libraries on RTOS Frameworks like uTendor and TensorFlow Lite (quantization, footprint reduction, etc) IP based accelerator support & optimization * under discussion training inference

slide-11
SLIDE 11

ArmNN based collaborations - ongoing

https://developer.arm.com/products/processors/machine-learning/arm-nn https://community.arm.com/tools/b/blog/posts/arm-nn-the-easy-way-to-deploy-edge-ml

A good base for future collaborations:

100 man-years of effort, 340,000 lines of code Shipping in over 200 million Android devices based

  • n estimation

Impressive performance uplift by software-only improvements over a period of 6 months

slide-12
SLIDE 12

Thanks!