Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and - - PowerPoint PPT Presentation

accurate fast and scalable kernel ridge regression on
SMART_READER_LITE
LIVE PREVIEW

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and - - PowerPoint PPT Presentation

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems speaker: Yang You PhD student at UC Berkeley, advised by James Demmel with James Demmel 1 , Cho-Jui Hsieh 2 , and Richard Vuduc 3 1Professor at UC Berkeley


slide-1
SLIDE 1

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems

speaker: Yang You

PhD student at UC Berkeley, advised by James Demmel

with James Demmel1, Cho-Jui Hsieh2, and Richard Vuduc3

1Professor at UC Berkeley 2Assistant Professor at UCLA 3Associate Professor at Georgia Tech Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 1 / 48

slide-2
SLIDE 2

Outline

Introduction Existing Approaches Our Approach Analysis and Results

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 2 / 48

slide-3
SLIDE 3

Kernel Ridge Regression (KRR)

Given n samples (x1, y1), ..., (xn, yn), find the empirical minimizer4 ˆ α = argmin 1 n n

i=1fi − yi2 2 + λf 2 H

fi = n

j=1αjΦ(xj, xi) =

n

j=1αj exp(−||xi − xj||2/(2σ2))

This problem has a closed-form solution5: (K + λnI)α = y f ∈ Rn, xi ∈ Rd, yi ∈ R, α ∈ Rn, λ ∈ R, Φ : Rd × Rd → R

4H is a Reproducing Kernel Hilbert Space 5K is a n-by-n matrix where K[i][j] = Φ(xj, xi), I is an identity matrix Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 3 / 48

slide-4
SLIDE 4

KRR by Direct Method

MSE: correctness metric, lower is better

difference between the predicted label and true label

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 4 / 48

slide-5
SLIDE 5

Bottleneck: solve a large linear equation (K + λnI)α = y

n-by-n dense kernel matrix K

machine learning input dataset: a n-by-d matrix n: num of samples (e.g. num of users on Facebook: ∼2.2 billion) d: num of features (e.g. num of movies a user rated: ∼1000) n >> d, a small input dataset can generate a huge kernel matrix

357 MB dataset (520,000 × 90 matrix) = 2 TB kernel matrix

Θ(n3) to solve the linear equation directly

very expensive in practice

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 5 / 48

slide-6
SLIDE 6

Weak Scaling Issue

1 million users 2 million users 4 million users 8 million users

primary interest for machine learning at scale

keep each machine fully loaded (more users, buy more servers)

keep d and n/p fixed as p grows (p is # nodes) KRR: memory grows as Θ(p) and the flops as Θ(p2) per node

perfect scaling: memory and flops are constant per node

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 6 / 48

slide-7
SLIDE 7

Outline

Introduction Existing Approaches Our Approach Analysis and Results

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 7 / 48

slide-8
SLIDE 8

Bottleneck: solve a large linear equation (K + λnI)α = y

Low-rank matrix approximation

Kernel PCA (Scholkopf et al., 1998) Incomplete Cholesky Decomposition (Fine and Scheinberg, 2002) Nystorm Sampling (Williams and Seeger, 2001)

Iterative optimization algorithm

Gradient Descent (Raskutti et al., 2011) Conjugate Gradient Methods (Blanchard and Kramer, 2010)

None of these methods can achieve the same level of accuracy as the direct method does6

We reserve them for future study

  • 6Y. Zhang, J. Duchi, M. Wainwright, Divide and Conquer Kernel Ridge Regression, COLT’13

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 8 / 48

slide-9
SLIDE 9

DKRR: Straightforward Implementation by ScaLAPACK

K + λnI is symmetric positive definite (Cholesky decomposition). weak scaling efficiency drops to 0.32% when we increase to 64 nodes

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 9 / 48

slide-10
SLIDE 10

Divide-and-Conquer KRR (DC-KRR)7

communication overhead is low, good scaling!

  • 7Y. Zhang, J. Duchi, M. Wainwright, Divide and Conquer Kernel Ridge Regression, COLT’13

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 10 / 48

slide-11
SLIDE 11

DC-KRR key idea: block-diagonal matrix approximate

figure from DC-KRR authors (Y. Zhang, J. Duchi, M. Wainwright) Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 11 / 48

slide-12
SLIDE 12

DC-KRR beats pervious methods on tens of nodes

figure from DC-KRR authors (Y. Zhang, J. Duchi, M. Wainwright) based on a dataset of music recommendation system Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 12 / 48

slide-13
SLIDE 13

weak scaling in accuracy (i.e. MSE)

Table 1: MSE: lower is better. 2k samples per node

Methods 8k samples 32k samples 128k samples DKRR (baseline) 90.9 85.0 0.002 DCKRR 88.9 85.5 81.0

when we scale DC-KRR to many nodes, it is not correct

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 13 / 48

slide-14
SLIDE 14

Outline

Introduction Existing Approaches Our Approach Analysis and Results

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 14 / 48

slide-15
SLIDE 15

Why DC-KRR does not work at scale?

It is not safe to ignore the off-diagonal parts

there are many nonzero numbers in the off-diagonal parts

a 5k-by-5k Gaussian Kernel matrix by UCI Covertype dataset, visualized by Matlab spy Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 15 / 48

slide-16
SLIDE 16

How to diagonalize Kernel matrix?

k-means clustering algorithm

cluster the samples based on Euclidean distance xi and xj are in the same cluster: ||xi − xj|| is small xi and xj are in different clusters: ||xi − xj|| is large ||xi − xj|| → ∞ means K[i][j] → 0 K[i][j] = Φ(xj, xi) = exp(−||xi − xj||2/(2σ2))

1.1 Original Kernel 1.2 After K-means

nonzero threshold: larger than 10−6 Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 16 / 48

slide-17
SLIDE 17

K-means KRR (KKRR)

we expect KKRR achieves low MSE!

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 17 / 48

slide-18
SLIDE 18

KKRR performs poorly

  • ur system tries different hyper parameters iteratively, until gets the lowest MSE

dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 18 / 48

slide-19
SLIDE 19

why KKRR performs poorly?

different clusters are very different from each other they generate different models: averaging them is a bad idea

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 19 / 48

slide-20
SLIDE 20

KKRR2

we expect KKRR2 achieves low MSE!

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 20 / 48

slide-21
SLIDE 21

KKRR2 performs much better than KKRR

  • ur system tries different hyper parameters iteratively, until gets the lowest MSE

dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 21 / 48

slide-22
SLIDE 22

how good this can be in the best situation?

suppose we can select the best model (try each one-by-one)

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 22 / 48

slide-23
SLIDE 23

KKRR3: error lower bound for block diagonal method

we believe KKRR3 will achieve lowest MSE!

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 23 / 48

slide-24
SLIDE 24

Block diagonal is great by an optimal selection algorithm

  • ur system tries different hyper parameters iteratively, until gets the lowest MSE

dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 24 / 48

slide-25
SLIDE 25

However, KKRR family is slow

  • ur system tries different hyper parameters iteratively, until gets the lowest MSE

dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 25 / 48

slide-26
SLIDE 26

K-means clustering: imbalance partitioning

the sizes of different blocks are different

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 26 / 48

slide-27
SLIDE 27

K-means clustering: imbalance partitioning

1.3 Load Balance for Data Size 1.4 Load Balance for Time

different nodes have different num of samples (n) memory: Θ(n2), flops: Θ(n3)

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 27 / 48

slide-28
SLIDE 28

Basic Idea of K-balance algorithm

Run K-means to get all the cluster centers Find the closest center (CC) for a given sample

If CC is already balanced, go on

When every center has n/p samples, done

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 28 / 48

slide-29
SLIDE 29

K-balance

distance matrix: 8 samples in 4 centers

d[i][j] = the distance between i-th center and j-th sample

balanced case: each center has 2 samples

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 29 / 48

slide-30
SLIDE 30

The center for S0 ⇒ C2

underload: C0, C1, C2, C3 balanced: None

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 30 / 48

slide-31
SLIDE 31

The center for S1 ⇒ C3

underload: C0, C1, C2, C3 balanced: None

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 31 / 48

slide-32
SLIDE 32

The center for S2 ⇒ C0

underload: C0, C1, C2, C3 balanced: None

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 32 / 48

slide-33
SLIDE 33

The center for S3 ⇒ C3

underload: C0, C1, C2, C3 balanced: None

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 33 / 48

slide-34
SLIDE 34

The center for S4 ⇒ C0

underload: C0, C1, C2 balanced: C3

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 34 / 48

slide-35
SLIDE 35

The center for S5 ⇒ C0 ⇒ C3 ⇒ C1

underload: C1, C2 balanced: C0, C3

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 35 / 48

slide-36
SLIDE 36

The center for S6 ⇒ C3 ⇒ C2

underload: C1, C2 balanced: C0, C3

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 36 / 48

slide-37
SLIDE 37

The center for S7 ⇒ C0 ⇒ C2 ⇒ C1

underload: C1 balanced: C0, C2, C3

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 37 / 48

slide-38
SLIDE 38

Done

underload: None balanced: C0, C1, C2, C3

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 38 / 48

slide-39
SLIDE 39

Optimization Flow: Balanced KRR (BKRR)

by changing k-means to k-balance, we get BKRR, BKRR2, BKRR3

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 39 / 48

slide-40
SLIDE 40

Outline

Introduction Existing Approaches Our Approach Analysis and Results

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 40 / 48

slide-41
SLIDE 41

The Datasets

name MSD cadata MG space-ga # Train Samples 463,715 18,432 1,024 2,560 # Test Samples 51,630 2,208 361 547 # Features 90 8 6 6 Application Music Housing Dynamics Politics

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 41 / 48

slide-42
SLIDE 42

BKRR vs DCKRR

based on MSD dataset Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 42 / 48

slide-43
SLIDE 43

BKRR achieves a good speedup over KKRR

1.5 Load Balance for Data Size 1.6 Load Balance for Time

different nodes have different num of samples (n) memory: Θ(n2), flops: Θ(n3)

based on MSD dataset Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 43 / 48

slide-44
SLIDE 44

BKRR2 performs better than DCKRR

1.7 1024 train, 361 test 1.8 2560 train, 547 test 1.9 18432 train, 2208 test

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 44 / 48

slide-45
SLIDE 45

Weak Scaling on MSD dataset

1.10 Time 1.11 Efficiency 1.12 Accuracy

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 45 / 48

slide-46
SLIDE 46

Key idea: difference between BKRR2 and DCKRR

BKRR2: partition the dataset into p different parts, generate p different models, and select the best model. DCKRR: partition the dataset into p similar parts, generate p similar models, and use the average of all the models.

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 46 / 48

slide-47
SLIDE 47

Tradeoff: between accuracy and scaling

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 47 / 48

slide-48
SLIDE 48

Thanks for your time!

check our source code:

https://www.cs.berkeley.edu/~youyang/cakrr.zip

Yang You (youyang@cs.berkeley.edu)

  • J. Demmel, C. Hsieh, R. Vuduc

UC Berkeley Computer Sci 48 / 48