My Current Dream : Managing Machine Learning on Modern Hardware Ce - PowerPoint PPT Presentation

My Current Dream : Managing Machine Learning on Modern Hardware Ce Zhang

2 ML: fancy UDF breaking into traditional data workflow Input Recovered Ground Truth 1. Play well with existing data ecosystems (e.g., SciDB) 2. Fast! (< 20TB/s data) — seems to be a hardware problem MNRAS’17

3 ML on Modern Hardware: How to manage this messy cross-product? Enterprise Tomographic Image Speech APPLICA TIONS Analytics Reconstruction Classification Recognition x Linear Deep Decision Bayesian MODELS Models Learning Tree Models x HARDWARE FPGA GPU Xeon Phi CPU How? & Will it actually help?

4 Hasn’t Stochastic Gradient Descent already solved the whole thing?

5 Hasn’t Stochastic Gradient Descent already solved the whole thing? Logically , not tooooooo far off (I can live with it, unhappily) Physically , things get sophisticated

6 Goal: How to manage this messy cross-product? Different Error Tolerance Enterprise Tomographic Image Different $$$ Speech APPLICA TIONS Analytics Reconstruction Classification Recognition A: 20GB A: 4MB Different Performance Expectation x x: 4MB x: 240GB Linear Deep Decision Bayesian Models MODELS Learning Tree Models We need an “optimizer” for x machine learning on modern hardware HARDWARE FPGA GPU Xeon Phi CPU Very Different System Architecture

7 No idea about the full answer

8 How many bits do you need to represent a single number in machine learning systems?

9 Data Flow in Machine Learning Systems Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache Computation Device Takeaway: You can do all FPGA GPU, CPU three channels with low- precision, with some care

10 ZipML 1 Naive solution: nearest rounding (=1) => Converge to a different solution […. 0.7 ….] Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7 0 Gradient: dot(A r, x)A r Expectation matches => OK! 3 (Over-simplified, need to be careful about variance!) Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache Computation Device FPGA GPU, CPU NIPS’15

11 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache […. 0.7 ….] Computation Loss: ( ax - b) 2 Device 0 (p=0.3) Gradient: 2 a ( ax -b) FPGA 1 (p=0.7) GPU, CPU Expectation matches => OK!

12 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache […. 0.7 ….] Computation 0.029 Training Loss Expectation matches Device Naive Sampling 0 (p=0.3) 0.027 => OK? FPGA NO!! 1 (p=0.7) 0.024 GPU, CPU 0.022 32-bit 0.019 Why? Gradient 2 a ( ax -b) is not linear in a . 2 4 6 8 10 12 14 16 # Iterations

13 ZipML: “Double Sampling” How many more bits do we need to store the second sample? How to generate samples for a to get Not 2x Overhead! an unbiased estimator for 2 a ( ax -b)? 1 2 a 1 ( a 2 x -b) 3bits to store the first sample TWO 0.857 Independent 0.714 2nd sample only have 3 choices: First Second - up, down, same Samples! 0.571 Sample Sample => 2bits to store 0.429 0.029 We can do even better—Samples are symmetric! Training Loss Naive Sampling 0.286 0.027 15 different possibilities 0.143 0.024 Double Sampling => 4bits to store 2 samples 32-bit 0.022 0 1bit Overhead 0.019 2 4 6 8 10 12 14 16 # Iterations arXiv’16

14 It works!

15 Experiments 32bit Floating Points Tomographic Reconstruction 12bit Fixed Points Linear regression with fancy regularization (but a 240GB model)

16 Just to make it a little bit more “Swiss” 32bit 8bit

17 There is no magic, it is tradeoff, tradeoff, & TRADEOFF Ground 32 16 8 4 2 1 Truth bit bit bit bit bit bit Fixed Stepsize RMSE: 0.000 0.000 0.000 0.007 0.172 0.929 Variance gets Larger => You can still converge, but need smaller step sizes => Potentially more iterations to the same quality

18 More “classic” analytics is easier (1)Quantized Data (2) Data and Gradient (3) Data, Gradient and Model Billions of rows 8 8 8 2Bit Thousands of columns (a)Linear Regression 2Bit 2Bit Training Loss (x10 9 ) 4 4 4 Original Data Original Data Original Data 0.89% 0.21% 0.06% 0 0 0 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 # Epochs # Epochs # Epochs 0.3 0.3 0.3 2Bit 2Bit 2Bit 0.2 (b)LS-SVM 0.2 0.2 Training Loss Original Data Original Data Original Data 3.82% 0.88% 3Bit 1.50% 3Bit 3Bit 0.1 0.1 0.1 0 0 0 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 # Epochs # Epochs # Epochs

19 It works, but is what we are doing optimal?

20 Not Really Data-optimal Quantization Strategy a b Probability of quantizing to A: P A = b / (a+b) 1 1 Probability of quantizing to B: P B = a / (a+b) A B Expectation of quantization error for A, B (variance) 0.75 = a P A + b P B = 2ab / (a+b) 0.5 Intuitively, shouldn’t we put more markers 0.25 here? 0 0

21 Not Really Data-optimal Quantization Strategy a b b’ Probability of quantizing to A: P A = b / (a+b) 1 1 Probability of quantizing to B: P B = a / (a+b) A B B’ Expectation of quantization error for A, B (variance) 0.75 = a P A + b P B = 2ab / (a+b) Expectation of quantization error for A, B’ (variance) = 2a(b+b’) / (a+b+b’) 0.5 Intuitively, P-TIME Dynamic Programming shouldn’t Find a set of markers c 1 < … < c s , we put more markers 0.25 ( a − c j )( c j +1 − a ) X X min here? c 1 ,...,c s c j +1 − c j j a ∈ [ c j ,c j +1 ] 0 0 All data points Dense Markers falling into an interval

22 Experiments 150 120 Training Loss Uniform Levels, 5bits 90 Uniform Levels, 3bits 60 Data-Optimal Levels, 3bits Original Data, 32bits 30 0 0 6 12 18 24 30 # Epochs

23 Enough “Theory”! Let’s Build Something REAL !

24 Data Flow in Machine Learning Systems Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache Computation Device FPGA 1 2 3 GPU, CPU Low-precision Quantization   + Optimal Quantization

25 Two Implementations Gradient FPGA Data Source Storage Device Data A r Model x Sensor DRAM Database CPU Cache Param. GPU Gradient Server Data Source Storage Device Data A r Model x Sensor DRAM Database CPU Cache

26 “Fancy” things first :) Deep Learning

27 GPU - Quantization Impact of Quantization Impact of Optimal Quantization 2.4 Training Loss 0.15 1.8 32-bit Full Precision 1.2 XNOR5 0.05 0.6 32bit Optimal5 0 0 10 20 30 40 2bit #Epochs (b) Deep Learning

28 Full precision SGD on FPGA Data Source Sensor Database 32-bit floating-point: 16 values 64B cache-line 16 floats Processing rate: 12.8 GB/s 16 float multipliers 1 x x x x 16 float2fixed 16 fixed2float 2 converters converters C C C C C + + + + + + + + + + + + 3 + + 16 fixed + B A adders 4 + b fifo a fifo Dot product 5 c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x γ x 7 γ( ax -b) a Gradient calculation D 8 Batch size 16 float multipliers x x x is reached? Computation γ( ax -b) a 9 16 float2fixed C C C Custom Logic Storage Device converters x - - - FPGA BRAM Model update 16 fixed C loading adders x - γ( ax -b) a

29 Challenge + Solution 64B cache-line 16 floats 8-bit ZipML: 32 values Q8 16 float multipliers 1 x x x x 4-bit ZipML: 64 values Q4 16 float2fixed 16 fixed2float 2 converters converters C C C C 2-bit ZipML: 128 values Q2 C + + + + + + + + 1-bit ZipML: 256 values Q1 + + + + 3 + + 16 fixed + B A adders => Scaling out is not trivial! 4 + b fifo a fifo Dot product 5 c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x γ 1) We can get rid of floating- x 7 γ( ax -b) a Gradient point arithmetic. calculation D 8 Batch size 2) We can further simplify 16 float multipliers x x x is reached? γ( ax -b) a 9 integer arithmetic for 16 float2fixed C C C converters lowest precision data. x - - - Model update 16 fixed C loading adders x - γ( ax -b) a

30 FPGA - Speed 0.030 0.014 float CPU 1-thread float CPU 1-thread 0.025 float CPU 10-threads float CPU 10-threads 0.012 Q4 FPGA Q8 FPGA Training Loss Training Loss 0.020 float FPGA float FPGA 0.010 0.015 0.008 0.006 0.010 0.004 0.005 0.002 0.000 2x 7.2x 0.000 1E-4 0.001 0.01 0.1 0.001 0.01 0.1 1 Time (s) Time (s) When data are already stored in a specific format.

31 VISION | ZipML: The Precision Manager for Machine Learning Enterprise Tomographic Image Speech Analytics Reconstruction Classification Recognition Linear Deep Decision Bayesian Models Learning Tree Models FPGA GPU Xeon Phi CPU

My Current Dream : Managing Machine Learning on Modern Hardware Ce - PowerPoint PPT Presentation

My Current Dream : Managing Machine Learning on Modern Hardware Ce Zhang 2 ML: fancy UDF breaking into traditional data workflow Input Recovered Ground Truth 1. Play well with existing data ecosystems (e.g., SciDB) 2. Fast! (< 20TB/s

IFSC Club Members IFSC skate a Disney Dream IFSC skate a Disney Dream IFSC skate a

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

The Dream Center of Pickens County The Dream Center of Pickens County What is The Dream Center?

Y2 Parent Information Night The Year 2 Team Grow. Discover. Dream. People Working With Year 2

The Dream Machine The Dream Machine Vic Callaghan http://victor.callaghan.info vic@essex.ac.uk

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

DREAM FACTORY SHENZHEN UNIVERSITY CHINA Boston, USA 11/02/2014 Alkaline Cellulase Dream

IT TAKES TEAMWORK TO MAKE THE DREAM WORK It's the possibility of having a dream come true

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

1. Introduction After the Higgs discovery in 2012, The Standard model (SM) has been found to

Cosmology with photometric redshift surveys: challenges and opportunities David Alonso

CSSOS and dark energy Gong-Bo Zhao National Astronomical Observatories, China (NAOC)

t rts t rs

Rotational Magneto-Acousto-Electric Tomography: Theory and Experiments L. Kunyansky, P. Ingram,

Western U.S. Volcanism due to Intruding Oceanic Mantle Driven by Ancient Farallon Slabs Project PI

3.36pt Introduction Exclusive heavy quark photoproduction in UPC Cross section results

Implications of Routing Coherence and Consistency on Network Optimization Yvonne-Anne Pignolet