my current dream managing machine learning on modern
play

My Current Dream : Managing Machine Learning on Modern Hardware Ce - PowerPoint PPT Presentation

My Current Dream : Managing Machine Learning on Modern Hardware Ce Zhang 2 ML: fancy UDF breaking into traditional data workflow Input Recovered Ground Truth 1. Play well with existing data ecosystems (e.g., SciDB) 2. Fast! (< 20TB/s


  1. My Current Dream : Managing Machine Learning on Modern Hardware Ce Zhang

  2. 2 ML: fancy UDF breaking into traditional data workflow Input Recovered Ground Truth 1. Play well with existing data ecosystems (e.g., SciDB) 2. Fast! (< 20TB/s data) — seems to be a hardware problem MNRAS’17

  3. 3 ML on Modern Hardware: How to manage this messy cross-product? Enterprise Tomographic Image Speech APPLICA TIONS Analytics Reconstruction Classification Recognition x Linear Deep Decision Bayesian MODELS Models Learning Tree Models x HARDWARE FPGA GPU Xeon Phi CPU How? & Will it actually help?

  4. 4 Hasn’t Stochastic Gradient Descent already solved the whole thing?

  5. 5 Hasn’t Stochastic Gradient Descent already solved the whole thing? Logically , not tooooooo far off (I can live with it, unhappily) Physically , things get sophisticated

  6. 6 Goal: How to manage this messy cross-product? Different Error Tolerance Enterprise Tomographic Image Different $$$ Speech APPLICA TIONS Analytics Reconstruction Classification Recognition A: 20GB A: 4MB Different Performance Expectation x x: 4MB x: 240GB Linear Deep Decision Bayesian Models MODELS Learning Tree Models We need an “optimizer” for x machine learning on modern hardware HARDWARE FPGA GPU Xeon Phi CPU Very Different System Architecture

  7. 7 No idea about the full answer

  8. 8 How many bits do you need to represent a single number in machine learning systems?

  9. 9 Data Flow in Machine Learning Systems Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache Computation Device Takeaway: You can do all FPGA GPU, CPU three channels with low- precision, with some care

  10. 10 ZipML 1 Naive solution: nearest rounding (=1) => Converge to a different solution […. 0.7 ….] Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7 0 Gradient: dot(A r, x)A r Expectation matches => OK! 3 (Over-simplified, need to be careful about variance!) Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache Computation Device FPGA GPU, CPU NIPS’15

  11. 11 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache […. 0.7 ….] Computation Loss: ( ax - b) 2 Device 0 (p=0.3) Gradient: 2 a ( ax -b) FPGA 1 (p=0.7) GPU, CPU Expectation matches => OK!

  12. 12 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache […. 0.7 ….] Computation 0.029 Training Loss Expectation matches Device Naive Sampling 0 (p=0.3) 0.027 => OK? FPGA NO!! 1 (p=0.7) 0.024 GPU, CPU 0.022 32-bit 0.019 Why? Gradient 2 a ( ax -b) is not linear in a . 2 4 6 8 10 12 14 16 # Iterations

  13. 13 ZipML: “Double Sampling” How many more bits do we need to store the second sample? How to generate samples for a to get Not 2x Overhead! an unbiased estimator for 2 a ( ax -b)? 1 2 a 1 ( a 2 x -b) 3bits to store the first sample TWO 0.857 Independent 0.714 2nd sample only have 3 choices: First Second - up, down, same Samples! 0.571 Sample Sample => 2bits to store 0.429 0.029 We can do even better—Samples are symmetric! Training Loss Naive Sampling 0.286 0.027 15 different possibilities 0.143 0.024 Double Sampling => 4bits to store 2 samples 32-bit 0.022 0 1bit Overhead 0.019 2 4 6 8 10 12 14 16 # Iterations arXiv’16

  14. 14 It works!

  15. 15 Experiments 32bit Floating Points Tomographic Reconstruction 12bit Fixed Points Linear regression with fancy regularization (but a 240GB model)

  16. 16 Just to make it a little bit more “Swiss” 32bit 8bit

  17. 17 There is no magic, it is tradeoff, tradeoff, & TRADEOFF Ground 32 16 8 4 2 1 Truth bit bit bit bit bit bit Fixed Stepsize RMSE: 0.000 0.000 0.000 0.007 0.172 0.929 Variance gets Larger => You can still converge, but need smaller step sizes => Potentially more iterations to the same quality

  18. 18 More “classic” analytics is easier (1)Quantized Data (2) Data and Gradient (3) Data, Gradient and Model Billions of rows 8 8 8 2Bit Thousands of columns (a)Linear Regression 2Bit 2Bit Training Loss (x10 9 ) 4 4 4 Original Data Original Data Original Data 0.89% 0.21% 0.06% 0 0 0 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 # Epochs # Epochs # Epochs 0.3 0.3 0.3 2Bit 2Bit 2Bit 0.2 (b)LS-SVM 0.2 0.2 Training Loss Original Data Original Data Original Data 3.82% 0.88% 3Bit 1.50% 3Bit 3Bit 0.1 0.1 0.1 0 0 0 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 # Epochs # Epochs # Epochs

  19. 19 It works, but is what we are doing optimal?

  20. 20 Not Really Data-optimal Quantization Strategy a b Probability of quantizing to A: P A = b / (a+b) 1 1 Probability of quantizing to B: P B = a / (a+b) A B Expectation of quantization error for A, B (variance) 0.75 = a P A + b P B = 2ab / (a+b) 0.5 Intuitively, shouldn’t we put more markers 0.25 here? 0 0

  21. 21 Not Really Data-optimal Quantization Strategy a b b’ Probability of quantizing to A: P A = b / (a+b) 1 1 Probability of quantizing to B: P B = a / (a+b) A B B’ Expectation of quantization error for A, B (variance) 0.75 = a P A + b P B = 2ab / (a+b) Expectation of quantization error for A, B’ (variance) = 2a(b+b’) / (a+b+b’) 0.5 Intuitively, P-TIME Dynamic Programming shouldn’t Find a set of markers c 1 < … < c s , we put more markers 0.25 ( a − c j )( c j +1 − a ) X X min here? c 1 ,...,c s c j +1 − c j j a ∈ [ c j ,c j +1 ] 0 0 All data points Dense Markers falling into an interval

  22. 22 Experiments 150 120 Training Loss Uniform Levels, 5bits 90 Uniform Levels, 3bits 60 Data-Optimal Levels, 3bits Original Data, 32bits 30 0 0 6 12 18 24 30 # Epochs

  23. 23 Enough “Theory”! Let’s Build Something REAL !

  24. 24 Data Flow in Machine Learning Systems Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM 1 2 Database CPU Cache Computation Device FPGA 1 2 3 GPU, CPU Low-precision Quantization 
 + Optimal Quantization

  25. 25 Two Implementations Gradient FPGA Data Source Storage Device Data A r Model x Sensor DRAM Database CPU Cache Param. GPU Gradient Server Data Source Storage Device Data A r Model x Sensor DRAM Database CPU Cache

  26. 26 “Fancy” things first :) Deep Learning

  27. 27 GPU - Quantization Impact of Quantization Impact of Optimal Quantization 2.4 Training Loss 0.15 1.8 32-bit Full Precision 1.2 XNOR5 0.05 0.6 32bit Optimal5 0 0 10 20 30 40 2bit #Epochs (b) Deep Learning

  28. 28 Full precision SGD on FPGA Data Source Sensor Database 32-bit floating-point: 16 values 64B cache-line 16 floats Processing rate: 12.8 GB/s 16 float multipliers 1 x x x x 16 float2fixed 16 fixed2float 2 converters converters C C C C C + + + + + + + + + + + + 3 + + 16 fixed + B A adders 4 + b fifo a fifo Dot product 5 c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x γ x 7 γ( ax -b) a Gradient calculation D 8 Batch size 16 float multipliers x x x is reached? Computation γ( ax -b) a 9 16 float2fixed C C C Custom Logic Storage Device converters x - - - FPGA BRAM Model update 16 fixed C loading adders x - γ( ax -b) a

  29. 29 Challenge + Solution 64B cache-line 16 floats 8-bit ZipML: 32 values Q8 16 float multipliers 1 x x x x 4-bit ZipML: 64 values Q4 16 float2fixed 16 fixed2float 2 converters converters C C C C 2-bit ZipML: 128 values Q2 C + + + + + + + + 1-bit ZipML: 256 values Q1 + + + + 3 + + 16 fixed + B A adders => Scaling out is not trivial! 4 + b fifo a fifo Dot product 5 c ax b 6 1 fixed2float converter - 1 float adder 1 float multiplier x γ 1) We can get rid of floating- x 7 γ( ax -b) a Gradient point arithmetic. calculation D 8 Batch size 2) We can further simplify 16 float multipliers x x x is reached? γ( ax -b) a 9 integer arithmetic for 16 float2fixed C C C converters lowest precision data. x - - - Model update 16 fixed C loading adders x - γ( ax -b) a

  30. 30 FPGA - Speed 0.030 0.014 float CPU 1-thread float CPU 1-thread 0.025 float CPU 10-threads float CPU 10-threads 0.012 Q4 FPGA Q8 FPGA Training Loss Training Loss 0.020 float FPGA float FPGA 0.010 0.015 0.008 0.006 0.010 0.004 0.005 0.002 0.000 2x 7.2x 0.000 1E-4 0.001 0.01 0.1 0.001 0.01 0.1 1 Time (s) Time (s) When data are already stored in a specific format.

  31. 31 VISION | ZipML: The Precision Manager for Machine Learning Enterprise Tomographic Image Speech Analytics Reconstruction Classification Recognition Linear Deep Decision Bayesian Models Learning Tree Models FPGA GPU Xeon Phi CPU

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend