Faster Machine Learning via Low-Precision Communication & - - PowerPoint PPT Presentation

β–Ά
faster machine learning via low precision communication
SMART_READER_LITE
LIVE PREVIEW

Faster Machine Learning via Low-Precision Communication & - - PowerPoint PPT Presentation

Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine learning systems? Takeaways


slide-1
SLIDE 1

Faster Machine Learning via Low-Precision Communication & Computation

Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

slide-2
SLIDE 2

2

How many bits do you need to represent a single number in machine learning systems?

4 8

1 10 100

32bit floating point 3bit

Takeaways

Training Neural Networks 4 bits is enough for co communica cation Training Linear Models 4 bits is enough en end-to to-en end Beyond Empirical Rigorous theoretical guarantees

slide-3
SLIDE 3

3

First Example: GPUs

  • GPUs have plenty of compute
  • Yet, bandwidth relatively limited
  • PCIe or (newer) NVLINK

Trend towards large models and datasets

  • Vision: ImageNet (1.8M images)
  • ResNet-152 model [He+15]:

60M parameters (~240 MB)

  • Speech: NIST2000 2000 hours
  • LACEA [Yu+16]:

65M parameters (~300 MB)

What happens in practice?

Regular model

Compute gradient Exchange gradient Update params

Minibatch 1 Minibatch 2

Gradient transmission is expensive.

slide-4
SLIDE 4

4

First Example: GPUs What happens in practice?

Bigger model

Compute gradient Exchange gradient Update params

Minibatch 1 Minibatch

  • GPUs have plenty of compute
  • Yet, bandwidth relatively limited
  • PCIe or (newer) NVLINK

General trend towards large models

  • Vision: ImageNet (1.8M images)
  • ResNet-152 model [He+15]:

60M parameters (~240 MB)

  • Speech: NIST2000 2000 hours
  • LACEA [Yu+16]:

65M parameters (~300 MB)

Gradient transmission is expensive.

slide-5
SLIDE 5

5

First Example: GPUs What happens in practice? Biggerer model

Compute gradient Exchange gradient

  • GPUs have plenty of compute
  • Yet, bandwidth relatively limited
  • PCIe or (newer) NVLINK

General trend towards large models

  • Vision: ImageNet (1.8M images)
  • ResNet-152 model [He+15]:

60M parameters (~240 MB)

  • Speech: NIST2000 2000 hours
  • LACEA [Yu+16]:

65M parameters (~300 MB)

Gradient transmission is expensive.

slide-6
SLIDE 6

6

First Example: GPUs Compression [Seide et al., Microsoft CNTK]

Minibatch 1

  • GPUs have plenty of compute
  • Yet, bandwidth relatively limited
  • PCIe or (newer) NVLINK

General trend towards large models

  • Vision: ImageNet (1.8M images)
  • ResNet-152 model [He+15]:

60M parameters (~240 MB)

  • Speech: NIST2000 2000 hours
  • LACEA [Yu+16]:

65M parameters (~300 MB)

Gradient transmission is expensive.

slide-7
SLIDE 7

The Key Question

Can lossy compression provide speedup, while preserving convergence?

Top-1 accuracy for AlexNet (ImageNet). Top-1 accuracy vs Time for AlexNet (ImageNet).

> 2x faster

  • Yes. Quantized SGD (QSGD) can converge as fast as SGD,

with considerably less bandwidth.

slide-8
SLIDE 8

8

Why does QSGD work?

slide-9
SLIDE 9

Notation in One Slide

Task

Data

(M examples)

argminπ’šπ‘” π’š

𝑔 𝑦 =(1/𝑁)0 π‘šπ‘π‘‘π‘‘(𝑦, 𝑓𝑗)

7 89:

Notion of β€œquality”

Solved via optimization procedure. E.g., image classification

Model 𝑦

slide-10
SLIDE 10

Background on Stochastic Gradient Descent

β–ͺ St Stochastic Gradient Descent:

Go Goal: find argminπ’šπ‘” π’š . Let 𝒉 A(π’š) = π’š β€˜s gradient at a ra randomly chosen da data po point. Iteration:

π’šπ’–D𝟐 = π’šπ’– βˆ’ πœ½π’–π’‰ A(π’šπ’–), where 𝜑[𝑕

K(𝑦𝑒)] = 𝛼𝑔 𝑦𝑒 . Theorem [Informal]: Given 𝑔 nice (e.g., convex and smooth), and 𝑆2 = ||𝑦0 βˆ’ π‘¦βˆ—||2. To converge within 𝜻 of optimal it is sufficient to run for

𝑼 = π“Ÿ( π‘ΊπŸ‘ πŸ‘ π‰πŸ‘

πœ»πŸ‘ ) iterations. Let 𝜑 𝒉 A π’š βˆ’ πœΆπ’ˆ π’š

πŸ‘ ≀ π‰πŸ‘ (variance bound)

Higher variance = more iterations to convergence.

slide-11
SLIDE 11

11

Data Flow: Data-Parallel Training (e.g. GPUs)

GPU 1 GPU 2 Data Data

Model π’šπ’– π’šπ’–D𝟐 = π’šπ’– βˆ’ πœ½π’–πœΆπ’‰ A π’šπ’– at step t. πœΆπ’‰πŸ ] π’šπ’– πœΆπ’‰πŸ‘ ] π’šπ’– Model π’šπ’–D𝟐 = π’šπ’– βˆ’ πœ½π’–(πœΆπ’‰πŸ ] π’šπ’– + πœΆπ’‰πŸ‘ ] π’šπ’– ) π’šπ’–D𝟐 = π’šπ’– βˆ’ πœ½π’–π‘Ή(πœΆπ’‰ A π’šπ’– ) at step t. Standard SGD step: Quantized SGD step: 𝑹( ) 𝑹( )

  • 5
  • 4
  • 3
  • 2
  • 1
1 2 3 4 5 2000 4000 6000 8000 10000 12000
slide-12
SLIDE 12

How Do We Quantize?

β–ͺ Gradient = vec vector

  • r π’˜ of
  • f dimen

ension

  • n 𝒐, no

normalized β–ͺ Quantization function 𝑅[𝑀𝑗] = 𝜊8 𝑀𝑗 β‹… sgn 𝑀8

where 𝜊8 𝑀𝑗 = 𝟐 with probability π’˜π’‹ , and 0, otherwise.

β–ͺ Quantization is an unb unbiased estimator: 𝑭 𝑹 π’˜ = π’˜. β–ͺ Why do this?

1

𝑀𝑗= 0.7

v1 float v2 float v3 float v4 float vn float β‹―

scaling

float

+1 0 0 -1 -1 0 1 1 1 0 0 0 -1

Compression rate > 15x

n bits and signs

Pr[ 1 ] = π’˜π’‹ = 0.7 Pr[ 0 ] = 1 - π’˜π’‹ = 0.3

slide-13
SLIDE 13

13

Gradient Compression

1

0.143 0.286 0.429 0.571 0.714 0.857

  • We apply stochastic rounding to gradients
  • The SGD iteration becomes:

π’šπ’–D𝟐 = π’šπ’– βˆ’ πœ½π’–π‘Ή(𝒉 A π’šπ’– ) at step t.

Theorem [QSGD: Alistarh, Grubic, Li, Tomioka, Vojnovic, 2016] Given dimension n, QSGD guarantees the following:

  • 1. Convergence:

If SGD converges, then QSGD converges.

  • 2. Convergence speed:

If SGD converges in T iterations, QSGD converges in ≀ 𝒐

  • T iterations.
  • 3. Bandwidth cost:

Each gradient can be coded using ≀ πŸ‘ 𝒐

  • log 𝒐 bits.

Generalizes to arbitrarily many quantization levels.

The Gamble: The benefit of reduced communication will outweigh the performance hit because of extra iterations/variance and coding/decoding.

slide-14
SLIDE 14

14

Does it act actual ually work?

slide-15
SLIDE 15

Experimental Setup

Whe Where? β–ͺ Am Amazon p16xLarge (16 16 x NVIDIA K80 80 GPUs) β–ͺ Microsoft CN CNTK v2.0, with MP MPI-ba based d communic icatio ion (no NVIDIA NCCL) Wha What? β–ͺ Ta Tasks: image cl classifica cation (Im ImageNet) and sp speech recognition (C (CMU AN4) β–ͺ Ne Nets: Re ResNet, VG VGG, In Ince ception, Al AlexNet, , respectively LS LSTM β–ͺ Wi With h defaul ult pa parameters Why Why? β–ͺ Ac Accuracy vs. Speed/Scalability Op Open-so source implementation, as s well as s do docker co containers.

slide-16
SLIDE 16

β–ͺ Al AlexNet x Im ImageNet-1K 1K x 2 2 GP GPUs

Experiments: Communication Cost

SGD vs QSGD on AlexNet.

Compute

Communicate

60% 40%

Compute

95% 5%

Communicate

slide-17
SLIDE 17

Experiments: β€œStrong” Scaling

2.3x 3.5x 1.6x 1.3x

slide-18
SLIDE 18

Experiments: A Closer Look at Accuracy

3-Layer LSTM on CMU AN4 (Speech) 2.5x ResNet50 on ImageNet

Across all networks we tried, 4 bits are sufficient for full accuracy. (QSGD arxiv tech report contains full numbers and comparisons.)

4bit: - 0.2% 8bit: + 0.3%

slide-19
SLIDE 19

19

How many bits do you need to represent a single number in machine learning systems?

4 8

1 10 100

32bit floating point 3bit

Takeaways

Training Neural Networks 4 bits is enough for co communica cation Training Linear Models 4 bits is enough en end-to to-en end

slide-20
SLIDE 20

20

Data Flow in Machine Learning Systems

Data Source Sensor Database Computation Device GPU, CPU FPGA Storage Device DRAM CPU Cache

Data Ar Model x Gradient: dot(Ar, x)Ar

1 2 3

slide-21
SLIDE 21

21

ZipML

Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache

Data Ar Model x Gradient: dot(Ar, x)Ar

1 2 3 […. 0.7 ….]

Expectation matches => OK! (Over-simplified, need to be careful about variance!)

NIPS’15 1 Naive solution: nearest rounding (=1) => Converge to a different solution Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7

slide-22
SLIDE 22

22

ZipML

Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache

Data Ar Model x Gradient: dot(Ar, x)Ar

1 2 3 […. 0.7 ….] 0 (p=0.3) 1 (p=0.7)

Expectation matches => OK!

Loss: (ax - b)2 Gradient: 2a(ax-b)

slide-23
SLIDE 23

23

ZipML

Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache

Data Ar Model x Gradient: dot(Ar, x)Ar

1 2 3 […. 0.7 ….] 0 (p=0.3) 1 (p=0.7)

Expectation matches => OK?

NO!!

Why? Gradient 2a(ax-b) is not linear in a.

slide-24
SLIDE 24

24

ZipML: β€œDouble Sampling”

How to generate samples for a to get an unbiased estimator for 2a(ax-b)? arXiv’16

TWO Independent Samples!

2a1(a2x-b)

First Sample Second Sample

How many more bits do we need to store the second sample?

Not 2x Overhead!

3bits to store the first sample 2nd sample only have 3 choices:

  • up, down, same

We can do even betterβ€”Samples are symmetric! 15 different possibilities => 4bits to store 2 samples

1bit Overhead

=> 2bits to store

slide-25
SLIDE 25

25

It works!

slide-26
SLIDE 26

26

Experiments

32bit Floating Points 12bit Fixed Points

Tomographic Reconstruction

Linear regression with fancy regularization (but a 240GB model)

slide-27
SLIDE 27
slide-28
SLIDE 28

28

It works, but is what we are doing optimal?

slide-29
SLIDE 29

Not Really

29

Intuitively, shouldn’t we put more markers here?

Data-optimal Quantization Strategy

a b Probability of quantizing to A: PA = b / (a+b) Probability of quantizing to B: PB = a / (a+b) Expectation of quantization error for A, B (variance) = a PA + b PB = 2ab / (a+b)

slide-30
SLIDE 30

Not Really

30

Intuitively, shouldn’t we put more markers here?

Data-optimal Quantization Strategy

Probability of quantizing to A: PA = b / (a+b) Probability of quantizing to B: PB = a / (a+b) Expectation of quantization error for A, B’ (variance) = 2a(b+b’) / (a+b+b’) b’

Find a set of markers c1 < … < cs,

All data points falling into an interval P-TIME Dynamic Programming Dense Markers a b Expectation of quantization error for A, B (variance) = a PA + b PB = 2ab / (a+b)

slide-31
SLIDE 31

Experiments

31

slide-32
SLIDE 32

32

Beyond Linear Regression

slide-33
SLIDE 33

33

General Loss Function

Linear Regression or LS-SVM Logistic Regression

Challenge: Non-Linear terms

Approximation via polynomials

Chebyshev Polynomials

slide-34
SLIDE 34

Experiments

34

Even for non-linear models, we can do end-to-end quantization at 8 bits, with guaranteed convergence.

slide-35
SLIDE 35

Summary & Implementation

β–ͺ We can do en end-to to-en end quantized ed linea ear reg egres ession

  • n with gu

guarantees β–ͺ We can do arbitrary cl classifica cation lo loss via ap approximat ation β–ͺ Implemented on an Xe Xeon + FPGA pl platform (Intel-Al Altera HAR ARP)

β–ͺ Qu Quantized data -> > 8-16x 16x more data per transfer

10βˆ’2 10βˆ’1 100 101 1 2 Β·10βˆ’2 6.5x

float CPU 1-thread float CPU 10-threads float FPGA Q2 FPGA

Time (s) Training loss

Classification (gisette dataset, n = 5K).

10βˆ’3 10βˆ’2 10βˆ’1 0.5 1 Β·10βˆ’2 7.2x

float CPU 1-thread float CPU 10-threads float FPGA Q4 FPGA

Time (s) Training loss

Regression (synthetic dataset, n = 100).

slide-36
SLIDE 36

Questions?

4 8

1 10 100

32bit floating point 3bit

Takeaways

Training Neural Networks 4 bits is enough for co communica cation Training Linear Models 4 bits is enough en end-to to-en end Beyond Empirical Rigorous theoretical guarantees