Deep Learning Acceleration via Low Precision Computing Zha Zhaoxia - PowerPoint PPT Presentation

Deep Learning Acceleration via Low Precision Computing Zha Zhaoxia (S (Summer) r) Deng AI AI Sy Syst stem Co em Co-design design @ @ face facebook

Team Introduction • AI System Co-design team mission: • AI application-driven sw & hw co-design through • High performance numerical and architectural optimizations • HW performance modeling and simulations • Expertise • HPC and parallel algorithms • Computer architecture • Performance optimization and modeling • Numerical linear algebra, ML, and graph analytics

Agenda • Facebook AI workload characteristics • Low precision computing • Reduced precision floating point optimization • Fixed point quantization • AI system co-design for low precision computing • Model co-design • Hardware co-design

AI Growth and Its Drivers Big and better data Better algorithms More compute

AI Driven Services at Facebook Figure credit: Misha Smelyanski

AI Execution Flow Data Features Training Eval Inference Model Predictions

AI Inference in Facebook Datacenters

Workload characteristics Category Model Types Model Size (# Max. Live Op. Intensity Op. Intensity params) Activations (w.r.t. weights) (w.r.t. act & weights) FCs 1-10M > 10K 20-200 20-200 Recommendation Embeddings >10 Billion > 10K 1-2 1-2 avg. 380 Avg. 188 ResNeXt101-32x4-48 43-829M 2-29M Min. 100 Min. 28 Faster-RCNN (with Avg. 3.5K Avg. 145 Computer Vision 6M 13M ShuffleNet) Min. 2.5K Min. 4 Avg. 22K Avg. 172 ResNeXt3D-101 21M 58M Min. 2K Min. 6 Language seq2seq 100M-1B >100K 2-20 2-20 Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications https://arxiv.org/abs/1811.09886

Efficient AI inference challenges • Capacity crunch • Realtime model serving efficiency • Scale to billions of users Accuracy vs Capacity Increase of server capacity by Xiaodong Wang

Low-precision computing • Default precision: fp32 • Reduced-precision floating point Example of reduced precision representations • Fp16, bf16, fp8, etc. fp32 • Fixed point quantization fp16 • Int8, int4, etc. sign exponent (5 bits)fraction (10 bits) • Others bf16 • Posits (Gustafson 2016) • Logarithmic, k-means, etc.

Performance modeling Given FC (m, n, k), assume T = max(cpu_t, mem_t) • cpu_t = 2 * m * n * k / C • mem_t = S * (m * n + m * k + n * k) / B System performance is: • memory bandwidth bound when cpu_t <= mem_t; • Otherwise, compute bound. Compute bound scenarios: • CV Memory bound scenarios: • Language translation, recommendation Roofline: An Insightful Visual Performance Model for Floating- point Programs and Multicore Architecture. Williams et al.

Reduced precision optimizations • Fp16: Recommendation systems • Good programmability and negligible accuracy loss • Use cases: • Prepack the weights in NNs into fp16 • Convert dense and sparse features to fp16 for end-to-end performance optimizations Figure credit: Maxim Naumov

Int8 quantization • Dequantization: x = scale·(x q – offset) • Quantization: x q = clip(round(x/scale) + offset, -128, 127)

Challenges • Accuracy requirements • 0.02% for recommendation systems • 0.5% for computer vision models • Performance optimizations

Accuracy improving techniques (1) • Symmetric vs. Asymmetric • preserve sparsity, no nasty handling of offsets during matmul • slight loss of accuracy if using int8 for both weights and activations • Unsigned vs. Signed • Including 0 or not • Channel-wise quantization • Assign scale and offset for each channel

Accuracy improving techniques (2) • L2 error minimization vs. Min-max • Minimize the quantization errors for the more common values while allowing relatively large errors for outliers. • Requires the activation histogram profiling offline. • Outlier-aware quantization

FBGEMM • Facebook high performance FBGEMM performance for compute bound scenarios linear algebra library • Optimized on-CPU performance for low precision calculations • Supports accuracy-loss-minimizing techniques • Dynamically generates matrix- https://code.fb.com/ml-applications/fbgemm/ shape specific vectorized code

Int8 quantization for CV models • OCR text recognition using Rosetta • 2x speedups using int8 and int32 acc. • 2x speedups using int8 and int16 acc. • Outlier-aware quantization Rosetta: Large scale system for text • Model adjustments detection and recognition in images Fedor Borisyuk et al. • Int8 quantization workflow • Activation histogram profiling, graph transformation, kernel optimization, quantization space exploration

Model co-design • Int8 quantization on Rosetta • + 0.5% accuracy in both fp32 and int8 models • Int8 quantization on Relu recommendation systems • Wider FC layers to compensate for accuracy loss ShuffleNet https://arxiv.org/pdf/1707.01083.pdf

Hardware co-design • Low-precision computing can achieve 2x ~ 4x performance improvements on today’s hardware • How to meet the fast growing AI demand for tomorrow?

AI Inference Hardware Technology, energy and Dennard scaling Inference ASIC Hardware Nanometers Future of Computing, John Hennessey

AI Inference Hardware • Facebook designs its own hardware since 2010 • All designs released through open compute! • Facebook is partnering with HW vendors to build inference ASIC • Done via co-design with FB workloads in mind • Simulate performance with production models • Advise the quantization support from hardware

Thanks! • Q&A

Deep Learning Acceleration via Low Precision Computing Zha Zhaoxia - PowerPoint PPT Presentation

Deep Learning Acceleration via Low Precision Computing Zha Zhaoxia (S (Summer) r) Deng AI AI Sy Syst stem Co em Co-design design @ @ face facebook Team Introduction AI System Co-design team mission: AI application-driven sw

Mixed Precision Training PAI Overview What is mixed-precision

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

MIXED PRECISION TRAINING OF DEEP NEURAL NETWORKS Carl Case, NVIDIA OUTLINE 1. What is mixed

VLVK EHF. VLVK EHF. Precision machining Precision machining Professional precision for

2018 Milken Institute Hamptons Dialogues Precision, Precision, Precision: The Future of Health

Acceleration at North Allegheny Mathematics Acceleration (Elementary) Students may qualify for

Particle Driven Acceleration Experiments Edda Gschwendtner CAS, Plasma Wake Acceleration 2014 2

Motion with Constant Acceleration 1 Particle Under Constant Acceleration In the case of motion

acceleration Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada NSS acceleration

Data Flow Computing James Spooner, VP of Acceleration QCon, Finance Track, 08 March 2012

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Week 1: Introduc/on Precision and covariance matrix 2 1.2C

Lender of Last Resort (LOLR) Theory of the LOLR (Bagehot, 1873) Financial crises are

A New Holographic Model of Spacelike Singulari7es Gary

Stateless Determinis-c NAT (SD-NAT)

Slides for Lecture 20 ENCM 501: Principles of Computer Architecture Winter 2014 Term Steve

The Game Development Process Postmortems Those who do not learn from history are doomed to

Ethical Hacking The Culture for the Curious Jayashree S Kumar, IBM About Me IBM-Javas

Overview of the DES A block cipher: encrypts blocks of 64 bits using a 64 bit key

CS461/ECE422 Spring 2012 Text Chapters 2 and 21