My Current Dream : Managing Machine Learning on Modern Hardware Ce - - PowerPoint PPT Presentation

my current dream managing machine learning on modern
SMART_READER_LITE
LIVE PREVIEW

My Current Dream : Managing Machine Learning on Modern Hardware Ce - - PowerPoint PPT Presentation

My Current Dream : Managing Machine Learning on Modern Hardware Ce Zhang 2 ML: fancy UDF breaking into traditional data workflow Input Recovered Ground Truth 1. Play well with existing data ecosystems (e.g., SciDB) 2. Fast! (< 20TB/s


slide-1
SLIDE 1

My Current Dream: Managing Machine Learning

  • n Modern Hardware

Ce Zhang

slide-2
SLIDE 2

2

ML: fancy UDF breaking into traditional data workflow

MNRAS’17 Recovered Ground Truth Input

  • 1. Play well with existing data ecosystems (e.g., SciDB)
  • 2. Fast! (< 20TB/s data) — seems to be a hardware problem
slide-3
SLIDE 3

3

ML on Modern Hardware: How to manage this messy cross-product?

Linear Models FPGA Deep Learning GPU Xeon Phi Decision Tree Bayesian Models CPU Tomographic Reconstruction Enterprise Analytics Image Classification Speech Recognition

APPLICA TIONS MODELS HARDWARE

x x

How? & Will it actually help?

slide-4
SLIDE 4

4

Hasn’t Stochastic Gradient Descent already solved the whole thing?

slide-5
SLIDE 5

5

Hasn’t Stochastic Gradient Descent already solved the whole thing?

Logically, not tooooooo far off (I can live with it, unhappily) Physically, things get sophisticated

slide-6
SLIDE 6

6

Goal: How to manage this messy cross-product?

Linear Models FPGA Deep Learning GPU Xeon Phi Decision Tree Bayesian Models CPU Tomographic Reconstruction Enterprise Analytics Image Classification Speech Recognition

APPLICA TIONS MODELS HARDWARE

x x A: 20GB x: 4MB A: 4MB x: 240GB Very Different System Architecture Different Error Tolerance Different $$$ Different Performance Expectation

We need an “optimizer” for machine learning on modern hardware

slide-7
SLIDE 7

7

No idea about the full answer

slide-8
SLIDE 8

8

How many bits do you need to represent a single number in machine learning systems?

slide-9
SLIDE 9

9

Data Flow in Machine Learning Systems

Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache

Data Ar Model x Gradient: dot(Ar, x)Ar

1 2 3

Takeaway: You can do all three channels with low- precision, with some care

slide-10
SLIDE 10

10

ZipML

Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache

Data Ar Model x Gradient: dot(Ar, x)Ar

1 2 3 […. 0.7 ….]

Expectation matches => OK! (Over-simplified, need to be careful about variance!)

NIPS’15 1 Naive solution: nearest rounding (=1) => Converge to a different solution Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7

slide-11
SLIDE 11

11

ZipML

Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache

Data Ar Model x Gradient: dot(Ar, x)Ar

1 2 3 […. 0.7 ….] 0 (p=0.3) 1 (p=0.7)

Expectation matches => OK!

Loss: (ax - b)2 Gradient: 2a(ax-b)

slide-12
SLIDE 12

12

ZipML

Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache

Data Ar Model x Gradient: dot(Ar, x)Ar

1 2 3 […. 0.7 ….] 0 (p=0.3) 1 (p=0.7)

Expectation matches => OK?

NO!!

Training Loss 0.019 0.022 0.024 0.027 0.029 # Iterations 2 4 6 8 10 12 14 16 32-bit Naive Sampling Why? Gradient 2a(ax-b) is not linear in a.

slide-13
SLIDE 13

13

ZipML: “Double Sampling”

How to generate samples for a to get an unbiased estimator for 2a(ax-b)? arXiv’16

TWO Independent Samples!

2a1(a2x-b)

First Sample Second Sample Training Loss 0.019 0.022 0.024 0.027 0.029 # Iterations 2 4 6 8 10 12 14 16 32-bit Naive Sampling Double Sampling

How many more bits do we need to store the second sample?

Not 2x Overhead!

3bits to store the first sample 2nd sample only have 3 choices:

  • up, down, same

1

0.143 0.286 0.429 0.571 0.714 0.857

We can do even better—Samples are symmetric! 15 different possibilities => 4bits to store 2 samples

1bit Overhead

=> 2bits to store

slide-14
SLIDE 14

14

It works!

slide-15
SLIDE 15

15

Experiments

32bit Floating Points 12bit Fixed Points

Tomographic Reconstruction

Linear regression with fancy regularization (but a 240GB model)

slide-16
SLIDE 16

16

Just to make it a little bit more “Swiss”

32bit 8bit

slide-17
SLIDE 17

17

There is no magic, it is tradeoff, tradeoff, & TRADEOFF

Ground Truth 32 bit 16 bit 8 bit 4 bit 2 bit 1 bit 0.000 0.000 0.000 0.007 0.172 0.929 RMSE:

Variance gets Larger => You can still converge, but need smaller step sizes => Potentially more iterations to the same quality

Fixed Stepsize

slide-18
SLIDE 18

More “classic” analytics is easier

18

4 8

# Epochs

25 50 75 100

Original Data 2Bit

0.21%

(1)Quantized Data (a)Linear Regression (b)LS-SVM

Training Loss (x109)

4 8

# Epochs

25 50 75 100

Original Data 2Bit

0.06%

4 8

# Epochs

25 50 75 100

Original Data 2Bit

0.89%

(2) Data and Gradient (3) Data, Gradient and Model

0.1 0.2 0.3

# Epochs

25 50 75 100

Original Data 2Bit 3Bit

1.50% Training Loss

0.1 0.2 0.3

# Epochs

25 50 75 100

Original Data 2Bit 3Bit

0.88%

0.1 0.2 0.3

# Epochs

25 50 75 100

Original Data 2Bit 3Bit

3.82%

Billions of rows Thousands of columns

slide-19
SLIDE 19

19

It works, but is what we are doing optimal?

slide-20
SLIDE 20

Not Really

20

1

0.25 0.5 0.75

Intuitively, shouldn’t we put more markers here?

1

Data-optimal Quantization Strategy

a b Probability of quantizing to A: PA = b / (a+b) Probability of quantizing to B: PB = a / (a+b) A B Expectation of quantization error for A, B (variance) = a PA + b PB = 2ab / (a+b)

slide-21
SLIDE 21

Not Really

21

1

0.25 0.5 0.75

Intuitively, shouldn’t we put more markers here?

1

Data-optimal Quantization Strategy

Probability of quantizing to A: PA = b / (a+b) Probability of quantizing to B: PB = a / (a+b) A B B’ Expectation of quantization error for A, B’ (variance) = 2a(b+b’) / (a+b+b’) b’

Find a set of markers c1 < … < cs,

min

c1,...,cs

X

j

X

a∈[cj,cj+1]

(a − cj)(cj+1 − a) cj+1 − cj

All data points falling into an interval P-TIME Dynamic Programming Dense Markers a b Expectation of quantization error for A, B (variance) = a PA + b PB = 2ab / (a+b)

slide-22
SLIDE 22

Experiments

22

Training Loss 30 60 90 120 150 # Epochs 6 12 18 24 30 Uniform Levels, 3bits Original Data, 32bits Data-Optimal Levels, 3bits Uniform Levels, 5bits

slide-23
SLIDE 23

23

Enough “Theory”! Let’s Build Something REAL!

slide-24
SLIDE 24

24

Data Flow in Machine Learning Systems

Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache

Data Ar Model x Gradient: dot(Ar, x)Ar

1 2 3 1 2 3

Low-precision Quantization 
 + Optimal Quantization

slide-25
SLIDE 25

25

Two Implementations

Data Source Sensor Database Storage Device DRAM CPU Cache

Data Ar Model x Gradient

FPGA

Data Source Sensor Database Storage Device DRAM CPU Cache

Data Ar Model x Gradient

GPU Param. Server

slide-26
SLIDE 26

26

“Fancy” things first :) Deep Learning

slide-27
SLIDE 27

27

GPU - Quantization

0.6 1.2 1.8 2.4 10 20 30 40

32-bit Full Precision XNOR5 Optimal5

#Epochs (b) Deep Learning 0.05 0.15 Training Loss

Impact of Optimal Quantization Impact of Quantization

32bit 2bit

slide-28
SLIDE 28

28

Full precision SGD on FPGA

16 floats 64B cache-line

x x

16 float multipliers

x x + + + + + + + + + + + + + + +

16 fixed adders

  • x

b fifo

ax b γ(ax-b)

a fifo

x x x

16 float multipliers

  • 16 fixed

adders

x loading a

C C C C

16 float2fixed converters

+ c

1 fixed2float converter

C C C

16 float2fixed converters

C

16 fixed2float converters

γ γ(ax-b)a x - γ(ax-b)a

1 float adder 1 float multiplier

Model update Dot product Gradient calculation x

Batch size is reached? 1 2 3 4 5 6 B A 7 8 9

C D

32-bit floating-point: 16 values Processing rate: 12.8 GB/s

Computation Custom Logic Storage Device FPGA BRAM Data Source Sensor Database

slide-29
SLIDE 29

29

Challenge + Solution

16 floats 64B cache-line

x x

16 float multipliers

x x + + + + + + + + + + + + + + +

16 fixed adders

  • x

b fifo

ax b γ(ax-b)

a fifo

x x x

16 float multipliers

  • 16 fixed

adders

x loading a

C C C C

16 float2fixed converters

+ c

1 fixed2float converter

C C C

16 float2fixed converters

C

16 fixed2float converters

γ γ(ax-b)a x - γ(ax-b)a

1 float adder 1 float multiplier

Model update Dot product Gradient calculation x

Batch size is reached? 1 2 3 4 5 6 B A 7 8 9

C D

8-bit ZipML: 32 values Q8 4-bit ZipML: 64 values Q4 2-bit ZipML: 128 values Q2 1-bit ZipML: 256 values Q1 => Scaling out is not trivial! 1) We can get rid of floating- point arithmetic. 2) We can further simplify integer arithmetic for lowest precision data.

slide-30
SLIDE 30

30

FPGA - Speed

1E-4 0.001 0.01 0.1 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014

Training Loss Time (s)

float CPU 1-thread float CPU 10-threads Q4 FPGA float FPGA

7.2x

0.001 0.01 0.1 1 0.000 0.005 0.010 0.015 0.020 0.025 0.030

Training Loss Time (s)

float CPU 1-thread float CPU 10-threads Q8 FPGA float FPGA

2x

When data are already stored in a specific format.

slide-31
SLIDE 31

31

VISION | ZipML: The Precision Manager for Machine Learning

Linear Models FPGA Deep Learning GPU Xeon Phi Decision Tree Bayesian Models CPU Tomographic Reconstruction Enterprise Analytics Image Classification Speech Recognition