End to End Deep Learning Solution
- n Arm Architecture
- Jan. 14 2019, Jammy Zhou
End to End Deep Learning Solution on Arm Architecture Jan. 14 2019, - - PowerPoint PPT Presentation
End to End Deep Learning Solution on Arm Architecture Jan. 14 2019, Jammy Zhou HPC and AI convergence TOP500 Trend Arm on the road More than 50 percent of additional flops in the Astra at Sandia National Lab of US is the first latest TOP500
More than 50 percent of additional flops in the latest TOP500 rankings were from Nvidia Tesla GPUs according to TOP500 report Half of TOP10 systems use Nvidia GPUs, and 122 systems of TOP500 use Nvidia GPUs (64 systems uses P100 GPUs, 46 systems uses V100 GPUs, 12 systems uses Kepler GPUs) More AI/ML/DL workloads are being added to HPC applications with wide adoption of Nvidia GPUs
Astra at Sandia National Lab of US is the first Arm based supercomputer entering TOP500 list, numbered at 203 in the latest ranking. Good momentum of Arm based supercomputers around the world, Post-K from Japan, Tianhe-3 from China, Catalyst UK, GW4 Isambad and CEA system from Europe Arm SVE is enabled by Post-K together with the Tofu D interconnect and HBM2 memory, and will be used for some AI workloads Besides Nvidia GPUs, there are some other accelerator options in the market, for example, MI60/MI50 Radeon Instinct GPUs from AMD, Xilinx and Intel FPGAs, customized ASIC products, etc
CPU Accelerator Network Storage AI & ML Services HPC Services
100 Gbps Ethernet, InfiniBand, Omni-Path, RDMA and RoCE Fast and scalable storage, such as NVMe based local SSD
Science Cloud with Arm based HPC from HPC Systems (supporting Hisilicon Hi1616 and Marvell Thunder X2) Amazon EC2 A1 instances based on AWS Graviton Arm 64-bit processor for scale-out and Arm based workloads Arm Neoverse continuous improvement Accelerators (GPUs, FPGAs, ASICs) HPC & AI software stack (languages, frameworks, libraries, drivers, compilers, etc), multi-node distributed support and MPI
DL Frameworks HAL and Drivers Libraries Hardware (CPU, GPU, FPGA, ASIC, DSP) TensorFlow Caffe MXNet Theano Caffe2 CNTK
PaddlePaddle BLAS FFT RNG SPARSE Eigen
PyTorch Keras Framework support for multiple accelerators
CMSIS-NN ACL
Model Formats (framework specific, ONNX, NNEF) Deep Learning Compilers (TVM, Glow, XLA, ONNC, etc) 1. Difficult to switch between frameworks by application and algorithm developers 2. Different backends to maintain by framework developers for various accelerators 3. Multiple frameworks to support by chip and IP vendors with duplicated efforts, and out-of-tree support by forking the upstream 4. Multiple configurations to support by OEMs/ODMs and cloud vendors Chainer...
1 2 3 4
Big Data Analytics TensorFlowOnSpark CaffeOnSpark SparkFlow ...
cuDNN MIOpen ...
Framework interoperability & Hardware optimizations ONNX Format ONNX Models ONNXIFI ONNX Runtime ONNX Tools Create Convert Deploy Optimize
Defines an extensible computation graph model, built-in operators and standard data types Support only tensors for input/output data types
Classical machine learning extension Also support data types of sequences and maps, extend ONNX operator set with ML algorithms not based on neural networks
Control Flow support Functions (composable operators, experimental) Enhanced shape inference Additional optimization passes ONNXIFI 1.0 (C-backend for accelerators)
Quantization Test/Compliance Data pipelines Edge/Mobile/IoT
Standardized interface for NN inference on different accelerators Runtime discovery and selection of execution backends, as well as ONNX operators supported
Support ONNX format & online model conversion
A combination of software layer and hardware device used to run an ONNX graph The same software layer can expose multiple backends Heterogeneous type of backend can distribute work across multiple device types internally ONNXIFI
libonnxifi.so
Glow Library A
libonnxifi-glow.so libonnxifi-a.so
Applications ONNX Models Frameworks Library B
libonnxifi-b.so
Library C
libonnxifi-c.dll
Library D
libonnxifi-d.dylib
...
Diagram from https://github.com/Microsoft/onnxruntime/blob/master/docs/HighLevelDesign.md TensorRT and nGraph support are work in progress
SVE based optimization for DL frameworks & libraries PCIe/CCIX based heterogeneous accelerator support
integration, etc) Scale out support for distributed training
Initial focus on inference support for Cortex-A SOCs Common model description format and APIs to the runtime Common optimized runtime inference engine for Arm-based SoC Plug-in framework to support multiple 3rd party IPs (NPU, GPU, DSP, FPGA) Continuous integration testing and benchmarking
CMSIS-NN optimized frameworks/libraries on RTOS Frameworks like uTendor and TensorFlow Lite (quantization, footprint reduction, etc) IP based accelerator support & optimization * under discussion training inference
https://developer.arm.com/products/processors/machine-learning/arm-nn https://community.arm.com/tools/b/blog/posts/arm-nn-the-easy-way-to-deploy-edge-ml
A good base for future collaborations:
100 man-years of effort, 340,000 lines of code Shipping in over 200 million Android devices based
Impressive performance uplift by software-only improvements over a period of 6 months