My Current Dream: Managing Machine Learning
- n Modern Hardware
My Current Dream : Managing Machine Learning on Modern Hardware Ce - - PowerPoint PPT Presentation
My Current Dream : Managing Machine Learning on Modern Hardware Ce Zhang 2 ML: fancy UDF breaking into traditional data workflow Input Recovered Ground Truth 1. Play well with existing data ecosystems (e.g., SciDB) 2. Fast! (< 20TB/s
2
MNRAS’17 Recovered Ground Truth Input
3
Linear Models FPGA Deep Learning GPU Xeon Phi Decision Tree Bayesian Models CPU Tomographic Reconstruction Enterprise Analytics Image Classification Speech Recognition
x x
4
5
6
Linear Models FPGA Deep Learning GPU Xeon Phi Decision Tree Bayesian Models CPU Tomographic Reconstruction Enterprise Analytics Image Classification Speech Recognition
x x A: 20GB x: 4MB A: 4MB x: 240GB Very Different System Architecture Different Error Tolerance Different $$$ Different Performance Expectation
7
8
9
Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache
1 2 3
10
Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache
1 2 3 […. 0.7 ….]
Expectation matches => OK! (Over-simplified, need to be careful about variance!)
NIPS’15 1 Naive solution: nearest rounding (=1) => Converge to a different solution Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7
11
Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache
1 2 3 […. 0.7 ….] 0 (p=0.3) 1 (p=0.7)
Expectation matches => OK!
Loss: (ax - b)2 Gradient: 2a(ax-b)
12
Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache
1 2 3 […. 0.7 ….] 0 (p=0.3) 1 (p=0.7)
Expectation matches => OK?
Training Loss 0.019 0.022 0.024 0.027 0.029 # Iterations 2 4 6 8 10 12 14 16 32-bit Naive Sampling Why? Gradient 2a(ax-b) is not linear in a.
13
How to generate samples for a to get an unbiased estimator for 2a(ax-b)? arXiv’16
First Sample Second Sample Training Loss 0.019 0.022 0.024 0.027 0.029 # Iterations 2 4 6 8 10 12 14 16 32-bit Naive Sampling Double Sampling
3bits to store the first sample 2nd sample only have 3 choices:
1
0.143 0.286 0.429 0.571 0.714 0.857
We can do even better—Samples are symmetric! 15 different possibilities => 4bits to store 2 samples
=> 2bits to store
14
15
32bit Floating Points 12bit Fixed Points
Linear regression with fancy regularization (but a 240GB model)
16
17
18
4 8
# Epochs
25 50 75 100
Original Data 2Bit
0.21%
(1)Quantized Data (a)Linear Regression (b)LS-SVM
Training Loss (x109)
4 8
# Epochs
25 50 75 100
Original Data 2Bit
0.06%
4 8
# Epochs
25 50 75 100
Original Data 2Bit
0.89%
(2) Data and Gradient (3) Data, Gradient and Model
0.1 0.2 0.3
# Epochs
25 50 75 100
Original Data 2Bit 3Bit
1.50% Training Loss
0.1 0.2 0.3
# Epochs
25 50 75 100
Original Data 2Bit 3Bit
0.88%
0.1 0.2 0.3
# Epochs
25 50 75 100
Original Data 2Bit 3Bit
3.82%
Billions of rows Thousands of columns
19
20
1
0.25 0.5 0.75
Intuitively, shouldn’t we put more markers here?
1
a b Probability of quantizing to A: PA = b / (a+b) Probability of quantizing to B: PB = a / (a+b) A B Expectation of quantization error for A, B (variance) = a PA + b PB = 2ab / (a+b)
21
1
0.25 0.5 0.75
Intuitively, shouldn’t we put more markers here?
1
Probability of quantizing to A: PA = b / (a+b) Probability of quantizing to B: PB = a / (a+b) A B B’ Expectation of quantization error for A, B’ (variance) = 2a(b+b’) / (a+b+b’) b’
min
c1,...,cs
X
j
X
a∈[cj,cj+1]
(a − cj)(cj+1 − a) cj+1 − cj
All data points falling into an interval P-TIME Dynamic Programming Dense Markers a b Expectation of quantization error for A, B (variance) = a PA + b PB = 2ab / (a+b)
22
Training Loss 30 60 90 120 150 # Epochs 6 12 18 24 30 Uniform Levels, 3bits Original Data, 32bits Data-Optimal Levels, 3bits Uniform Levels, 5bits
23
24
Data Source Sensor Database Computation Device FPGA GPU, CPU Storage Device DRAM CPU Cache
1 2 3 1 2 3
25
Data Source Sensor Database Storage Device DRAM CPU Cache
Data Source Sensor Database Storage Device DRAM CPU Cache
26
27
0.6 1.2 1.8 2.4 10 20 30 40
#Epochs (b) Deep Learning 0.05 0.15 Training Loss
32bit 2bit
28
16 floats 64B cache-line
x x
16 float multipliers
x x + + + + + + + + + + + + + + +
16 fixed adders
b fifo
ax b γ(ax-b)
a fifo
x x x
16 float multipliers
adders
x loading a
C C C C
16 float2fixed converters
+ c
1 fixed2float converter
C C C
16 float2fixed converters
C
16 fixed2float converters
γ γ(ax-b)a x - γ(ax-b)a
1 float adder 1 float multiplier
Model update Dot product Gradient calculation x
Batch size is reached? 1 2 3 4 5 6 B A 7 8 9
C D
Computation Custom Logic Storage Device FPGA BRAM Data Source Sensor Database
29
16 floats 64B cache-line
x x
16 float multipliers
x x + + + + + + + + + + + + + + +
16 fixed adders
b fifo
ax b γ(ax-b)
a fifo
x x x
16 float multipliers
adders
x loading a
C C C C
16 float2fixed converters
+ c
1 fixed2float converter
C C C
16 float2fixed converters
C
16 fixed2float converters
γ γ(ax-b)a x - γ(ax-b)a
1 float adder 1 float multiplier
Model update Dot product Gradient calculation x
Batch size is reached? 1 2 3 4 5 6 B A 7 8 9
C D
30
1E-4 0.001 0.01 0.1 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014
Training Loss Time (s)
float CPU 1-thread float CPU 10-threads Q4 FPGA float FPGA
7.2x
0.001 0.01 0.1 1 0.000 0.005 0.010 0.015 0.020 0.025 0.030
Training Loss Time (s)
float CPU 1-thread float CPU 10-threads Q8 FPGA float FPGA
2x
31
Linear Models FPGA Deep Learning GPU Xeon Phi Decision Tree Bayesian Models CPU Tomographic Reconstruction Enterprise Analytics Image Classification Speech Recognition