early experience in benchmarking edge ai processors with
play

Early Experience in Benchmarking Edge AI Processors with Object - PowerPoint PPT Presentation

Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads Bench 2019 Yujie Hui 1 , Jeffrey Lien 2 , and Xiaoyi Lu 1 1 Department of Computer Science and Engineering, The Ohio State University {hui.82, lu.932}@osu.edu 2


  1. Early Experience in Benchmarking Edge AI Processors with Object Detection Workloads Bench 2019 Yujie Hui 1 , Jeffrey Lien 2 , and Xiaoyi Lu 1 1 Department of Computer Science and Engineering, The Ohio State University {hui.82, lu.932}@osu.edu 2 NovuMind Inc. jlien@novumind.com The Ohio State University

  2. Overview • Introduction • Overview of Edge AI Processors • Benchmarking Methodology • Evaluation • Conclusion The Ohio State University 2

  3. Edge Computing DATA APP APP APP DATA APP Edge DATA APP Network Computing • Store and process the data closer to the location where it is needed • Deliver low latency to the end users The Ohio State University 3

  4. Artificial Intelligence at the Edge Datacenter (e.g., GPU) • Inference is moving to the edge Data Features Training Evaluation Inference ❖ Heavy workloads in datacenters ❖ Less computationally demanding Edge Devices Datacenter (e.g., GPU) ❖ Low power consumption ❖ Low cost Data Features Training Evaluation Inference The Ohio State University 4

  5. Killer Applications for AI@Edge – Object Detection Ma Machine Learning Use Cases in Facebook • Object Detection: Recommendat Face ID Recommendation ion 3% 2% RNN ASR RNN ASR Object ❖ Higher resolution of input RNN 10% Segmentation Translator 3% images Image RNN Translator Classification 6% Object Object ❖ Larger output tensors Detection Image Detection 34% Classification Object 42% Segmentation ❖ More complicated tasks Face ID Wu et al., Machine Learning at Facebook: Understanding Inference at Edge, HPCA-2019 C. Wu, At-Scale Infrastructure Challenges for Machine Learning, IISWC-2019 (Invited Talk) The Ohio State University 5

  6. Object Detection Workloads - Demo Real life applications: ❖ Self driving cars ❖ Tracking objects ❖ Face detection ❖ Pedestrian detection ❖ Medical imaging ❖ Robotics Low latency and high accuracy inference needs high performance edge devices! The Ohio State University 6

  7. Overview • Introduction • Overview of Edge AI Processors • Edge TPU • NVIDIA Xavier • NovuTensor • Benchmarking Methodology • Evaluation • Conclusion The Ohio State University 7

  8. Edge AI Processors - EdgeTPU • A single-board computer • On-board Edge TPU coprocessor with capable for performing 4 TOPS • 1 GB LPDDR4 memory • Precision: INT 8 • Power: 2.5 watts https://coral.withgoogle.com/products/dev-board • Supports TensorFlow Lite model The Ohio State University 8

  9. Edge AI Processors - Xavier • Volta GPU with 512 CUDA cores • TOPS: 22.6/11.3/1.3 • 16GB LPDDR4X memory • Precision: INT8/FP16/FP32 • Power: 10/15/30 watts https://developer.nvidia.com/embedded/jetson-agx-xavier-developer-kit • Supports CUDA, cuDNN, TensorRT The Ohio State University 9

  10. Edge AI Processors – NovuTensor • Domain specific architecture focusing on performing 3D tensor computation • 2GB DDR4 memory, 15 TOPS • Precision: INT8 • TOPS: 15 Output Tensor Weight Tensor • Power: 20 watts Tensor Convolution Data Tensor • Support PyTorch Novutensor’s 3D Operation [1] https:// patentscope .wipo.int/search/en/detail.jsf?docId=US225521272&tab=NATIONALBIBLIO The Ohio State University 10

  11. Challenges of Benchmarking Edge AI Processors • Challenge-1: Workload Selection v What are the representative models and datasets for benchmarking edge AI processors with object detection workload? • Challenge-2: Deployment v How to deploy deep neural networks on edge devices, given that each edge device needs a specific framework? • Challenge-3: Metrics and Dimensions v How to select an essential set of metrics and dimensions to comprehensively evaluate edge AI devices? The Ohio State University 11

  12. Overview • Introduction • Overview of Edge AI Processors • Benchmarking Methodology • Workload and Dataset Selection • Deployment Experience • Metrics and Dimensions Selection • Evaluation • Conclusion The Ohio State University 12

  13. Object Detection Workloads – YOLOv2 https://pjreddie.com/darknet/yolov2/ Darknet-19 • A real-time object detection system, which tells us what objects are seen • Tiny-YOLO is a lite version of YOLOv2 • Based on Darknet framework, can detect objects in an image or a video • Darknet-19 neural network YOLO9000: Better, Faster, Stronger. Joseph Redmon, Ali Farhadi The Ohio State University 13

  14. Object Detection Workloads – MS COCO • 330K images (>200K labeled) • 1.5 million object instances • 80 object categories Microsoft COCO Dataset Examples ❖ Images contain rich information with many objects per image ❖ Large in number of instances per category http://cocodataset.org/#home Microsoft COCO: Common Objects in Context. Lin et al. The Ohio State University 14

  15. Deployment Experience Retrain the model using ReLU activation function EdgeTPU Xavier NovuTensor Ed�e�TPU���de� Te����F������de� Modify the weights of NVIDIA’s deepstream NovuSDK 32-b������a�����b��� .����������� first convolutional reference applications [3] layer TensorRT 5.0.3 C������� Post-Training Quantization Ed�� TPU D����� Integer 15-watt and 30-watt DarkFlow [1] Ca���a�����da�a modes Ed�e�TPU���de� Te����F����L��e Post-Training Integer 8-b�������d����b��� .����������� Quantization [2] EdgeTPU compiler O��E��� O��H��� �P� [1]https://github.com/thtrieu/darkflow [2]https://medium.com/tensorflow/tensorflow-model-optimization-toolkit-post-training-integer-quantization-b4964a1ea9ba [3]https://github.com/NVIDIA-AI-IOT/deepstream_reference_apps The Ohio State University 15

  16. Metrics and Dimensions Execution time: Mean Average Precision: & • Preprocess !" = $ ' ( )( • Execution Latency (ms) Accuracy % • Postprocess 0 *!" = 1 N - !" ./& Energy Efficiency (Images/sec/watt) Number of input images can be fully processed per unit-power The Ohio State University 16

  17. Overview • Introduction • Overview of Edge AI Processors • Benchmarking Methodology • Evaluation • Conclusion The Ohio State University 17

  18. Accuracy Dimension Execution time: Mean Average Precision: & • Preprocess !" = $ ' ( )( • Execution Latency (ms) Accuracy % • Postprocess 0 *!" = 1 N - !" ./& Energy Efficiency (Images/sec/watt) Number of input images can be fully processed per unit-power The Ohio State University 18

  19. Evaluation Results - Accuracy 0.6 Tiny-YOLO YOLOv2 0.4 mAP 0.2 0 Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti Performance running YOLOv2 and Tiny-YOLO with 416x416 input images • Provide accurate results with 1% to 3% accuracy difference due to lower precision arithmetic • Accuracy degradation is different since the diversified implementation of quantization The Ohio State University 19

  20. Latency Dimension Execution time: Mean Average Precision: & • Preprocess !" = $ ' ( )( • Execution Latency (ms) Accuracy % • Postprocess 0 *!" = 1 N - !" ./& Energy Efficiency (Images/sec/watt) Number of input images can be fully processed per unit-power The Ohio State University 20

  21. Evaluation Results - Latency 100 Tiny-YOLO YOLOv2 Latency 80 (ms) 60 40 20 0 Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti Performance running YOLOv2 and Tiny-YOLO with 416x416 input images ❖ EdgeTPU is 9.5X and 14.79X slower than GPU with running Tiny-YOLO and YOLOv2 ❖ NovuTensor and Xavier are 4.66X - 6.08X slower than the GPU ❖ Xavier is 2X and 5.28X faster than EdgeTPU in the max power mode ❖ NovuTensor is 2.04X and 3.8X faster than EdgeTPU for YOLOv2 and Tiny-YOLO The Ohio State University 21

  22. Energy Efficiency Dimension Execution time: Mean Average Precision: & • Preprocess !" = $ ' ( )( • Execution Latency (ms) Accuracy % • Postprocess 0 *!" = 1 N - !" ./& Energy Efficiency (Images/sec/watt) Number of input images can be fully processed per unit-power The Ohio State University 22

  23. Evaluation Results – Energy Efficiency 15 (image/sec/watt) Tiny-YOLO YOLOv2 Efficiency 10 Energy 5 0 Edge TPU Xavier 15w Xavier MAXW NovuTensor 1080Ti+TensorRT 1080Ti Performance running YOLOv2 and Tiny-YOLO with 416x416 input images ❖ All edge AI processors have higher energy efficiency due to low power consumptions ❖ EdgeTPU delivers 2.9X and 1.13X higher energy efficiency than Xavier; 1.96X and 1.04X higher than NovuTensor The Ohio State University 23

  24. Evaluation Results – Large Images 1 200 0.8 Energy Efficiency (image/sec/watt) Latency (ms) 0.6 100 0.4 0.2 0 0 w W T r i o T R 5 s X 0 1 n r 8 o w W r T i A e r 0 o T T s R e 5 M 1 s 0 n X i u 1 n r v e 8 o v A e T a r r 0 s e o T X e + M n 1 i N i u v i v e T v T a a r e o X 0 X + i N 8 v i 0 T a 1 X 0 8 0 1 (a) Latency (b) Energy Efficiency Performance running YOLOv2 and Tiny-YOLO with 1024X1024 input images • Xavier in the 15-watt mode delivers the best energy efficiency • 1080Ti using TensorRT has the best performance of latency The Ohio State University 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend