faster machine learning via low precision communication
play

Faster Machine Learning via Low-Precision Communication & - PowerPoint PPT Presentation

Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine learning systems? Takeaways


  1. Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

  2. 2 How many bits do you need to represent a single number in machine learning systems? Takeaways 8 Training Neural Networks 4 bits is enough for co communica cation 3bit Training Linear Models 4 4 bits is enough en end-to to-en end 32bit floating point Beyond Empirical Rigorous theoretical guarantees 0 1 10 100

  3. 3 First Example: GPUs What happens in practice? Regular model • GPUs have plenty of compute • Yet, bandwidth relatively limited Compute Exchange Update • PCIe or (newer) NVLINK gradient gradient params Trend towards large models and datasets • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Minibatch 2 Minibatch 1 Gradient transmission is expensive .

  4. 4 First Example: GPUs What happens in practice? Bigger model • GPUs have plenty of compute • Yet, bandwidth relatively limited Compute Exchange Update • PCIe or (newer) NVLINK gradient gradient params General trend towards large models • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Minibatch 1 Minibatch Gradient transmission is expensive .

  5. 5 First Example: GPUs What happens in practice? Biggerer model • GPUs have plenty of compute • Yet, bandwidth relatively limited • PCIe or (newer) NVLINK Compute Exchange gradient gradient General trend towards large models • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive .

  6. 6 First Example: GPUs Compression [Seide et al., Microsoft CNTK] • GPUs have plenty of compute • Yet, bandwidth relatively limited • PCIe or (newer) NVLINK General trend towards large models • Vision : ImageNet ( 1.8M images) • ResNet-152 model [He+15]: 60M parameters (~240 MB) • Speech : NIST2000 2000 hours • LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive . Minibatch 1

  7. The Key Question Can lossy compression provide speedup , while preserving convergence ? Yes. Quantized SGD (QSGD) can converge as fast as SGD, with considerably less bandwidth. > 2x faster Top-1 accuracy for AlexNet (ImageNet). Top-1 accuracy vs Time for AlexNet (ImageNet).

  8. 8 Why does QSGD work?

  9. Notation in One Slide argmin 𝒚 𝑔 𝒚 Solved via optimization procedure. 7 𝑔 𝑦 =(1/𝑁)0 𝑚𝑝𝑡𝑡(𝑦, 𝑓𝑗) Notion of “quality” 89: Model 𝑦 E.g., image classification Task Data (M examples)

  10. Background on Stochastic Gradient Descent ▪ St Stochastic Gradient Descent: Goal : find argmin 𝒚 𝑔 𝒚 . Go Let 𝒉 A(𝒚) = 𝒚 ‘s gradient at a ra point . randomly chosen da data po Iteration: A(𝒚𝒖) , where 𝜡[𝑕 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝒉 K(𝑦𝑢) ] = 𝛼𝑔 𝑦 𝑢 . 𝟑 ≤ 𝝉 𝟑 (variance bound) Let 𝜡 𝒉 A 𝒚 − 𝜶𝒈 𝒚 Theorem [ Informal ] : Given 𝑔 nice (e.g., convex and smooth) , and 𝑆 2 = ||𝑦 0 − 𝑦 ∗ || 2 . To converge within 𝜻 of optimal it is sufficient to run for 𝑼 = 𝓟( 𝑺 𝟑 𝟑 𝝉 𝟑 𝜻 𝟑 ) iterations. Higher variance = more iterations to convergence.

  11. 11 Data Flow: Data-Parallel Training (e.g. GPUs) Standard SGD step: A 𝒚 𝒖 at step t. 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝜶𝒉 12000 10000 8000 6000 4000 𝑹( ) ] 𝒚 𝒖 ] 𝒚 𝒖 𝜶𝒉 𝟐 𝜶𝒉 𝟑 𝑹( ) 2000 0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Data Data GPU 1 GPU 2 ] 𝒚 𝒖 + 𝜶𝒉 𝟑 ] 𝒚 𝒖 ) Model 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 (𝜶𝒉 𝟐 Model 𝒚 𝒖 Quantized SGD step: A 𝒚 𝒖 ) at step t. 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝑹(𝜶𝒉

  12. How Do We Quantize? 1 Pr[ 1 ] = 𝒘 𝒋 = 0.7 ▪ or 𝒘 of on 𝒐 , no Gradient = vec vector of dimen ension normalized 𝑤 𝑗 = 0.7 ▪ Quantization function Pr[ 0 ] = 1 - 𝒘 𝒋 = 0.3 𝑅[𝑤 𝑗 ] = 𝜊 8 𝑤 𝑗 ⋅ sgn 𝑤 8 0 where 𝜊 8 𝑤 𝑗 = 𝟐 with probability 𝒘 𝒋 , and 0 , otherwise. ▪ unbiased estimator: 𝑭 𝑹 𝒘 = 𝒘. Quantization is an unb ▪ Why do this? float float float float float v1 v2 v3 v4 ⋯ vn n bits and signs float Compression rate > 15x scaling +1 0 0 -1 -1 0 1 1 1 0 0 0 -1

  13. � � 13 Gradient Compression 1 • We apply stochastic rounding to gradients • The SGD iteration becomes: 0.857 A 𝒚 𝒖 ) at step t. 𝒚 𝒖D𝟐 = 𝒚 𝒖 − 𝜽 𝒖 𝑹(𝒉 0.714 0.571 Theorem [QSGD: Alistarh, Grubic, Li, Tomioka, Vojnovic, 2016] 0.429 Given dimension n , QSGD guarantees the following: 0.286 1. Convergence : If SGD converges, then QSGD converges . 0.143 2. Convergence speed: 0 If SGD converges in T iterations, QSGD converges in ≤ 𝒐 T iterations. 3. Bandwidth cost : Each gradient can be coded using ≤ 𝟑 𝒐 log 𝒐 bits . The Gamble : The benefit of reduced communication will outweigh the Generalizes to arbitrarily many performance hit because of extra iterations/variance and coding/decoding . quantization levels.

  14. 14 ually work? Does it act actual

  15. Experimental Setup Whe Where ? ▪ Am Amazon p16xLarge ( 16 16 x NVIDIA K80 80 GPUs ) ▪ Microsoft CN CNTK v2.0 , with MP MPI-ba based d communic icatio ion (no NVIDIA NCCL) Wha What? ▪ ImageNet) and sp Ta Tasks: image cl classifica cation (Im speech recognition (C (CMU AN4) ▪ Ne Nets : Re ResNet , VG VGG , In Ince ception, Al AlexNet, , respectively LS LSTM ▪ Wi With h defaul ult pa parameters Why Why? ▪ Ac Accuracy vs. Speed/Scalability Op Open-so source implementation, as s well as s do docker co containers.

  16. Experiments: Communication Cost ▪ AlexNet x Im Al ImageNet-1K 1K x 2 2 GP GPUs 60% Compute 95% Compute 40% Communicate 5% Communicate SGD vs QSGD on AlexNet.

  17. Experiments: “Strong” Scaling 2.3x 3.5x 1.6x 1.3x

  18. Experiments: A Closer Look at Accuracy 4bit: - 0.2% 8bit: + 0.3% 2.5x ResNet50 on ImageNet 3-Layer LSTM on CMU AN4 (Speech) Across all networks we tried, 4 bits are sufficient for full accuracy. (QSGD arxiv tech report contains full numbers and comparisons.)

  19. 19 How many bits do you need to represent a single number in machine learning systems? Takeaways 8 Training Neural Networks 4 bits is enough for co communica cation 3bit Training Linear Models 4 4 bits is enough en end-to to-en end 32bit floating point 0 1 10 100

  20. 20 Data Flow in Machine Learning Systems Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache Computation Device GPU, CPU FPGA

  21. 21 ZipML 1 Naive solution: nearest rounding (=1) => Converge to a different solution […. 0.7 ….] Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7 0 Gradient: dot(A r, x)A r Expectation matches => OK! 3 (Over-simplified, need to be careful about variance!) Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache Computation Device FPGA GPU, CPU NIPS’15

  22. 22 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache […. 0.7 ….] Computation Loss: ( ax - b) 2 Device 0 (p=0.3) Gradient: 2 a ( ax -b) FPGA 1 (p=0.7) GPU, CPU Expectation matches => OK!

  23. 23 ZipML Gradient: dot(A r, x)A r 3 Data Source Storage Device Data A r Model x Sensor DRAM Database 1 2 CPU Cache […. 0.7 ….] Computation Expectation matches Device 0 (p=0.3) => OK? FPGA NO!! 1 (p=0.7) GPU, CPU Why? Gradient 2 a ( ax -b) is not linear in a .

  24. 24 ZipML: “Double Sampling” How many more bits do we need to store the second sample? How to generate samples for a to get Not 2x Overhead! an unbiased estimator for 2 a ( ax -b)? 2 a 1 ( a 2 x -b) TWO 3bits to store the first sample Independent 2nd sample only have 3 choices: First Second - up, down, same Samples! Sample Sample => 2bits to store We can do even better—Samples are symmetric! 15 different possibilities => 4bits to store 2 samples 1bit Overhead arXiv’16

  25. 25 It works!

  26. 26 Experiments 32bit Floating Points Tomographic Reconstruction 12bit Fixed Points Linear regression with fancy regularization (but a 240GB model)

  27. 28 It works, but is what we are doing optimal?

  28. 29 Not Really Data-optimal Quantization Strategy a b Probability of quantizing to A: P A = b / (a+b) Probability of quantizing to B: P B = a / (a+b) Expectation of quantization error for A, B (variance) = a P A + b P B = 2ab / (a+b) Intuitively, shouldn’t we put more markers here?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend