Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and - PowerPoint PPT Presentation

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems speaker: Yang You PhD student at UC Berkeley, advised by James Demmel with James Demmel 1 , Cho-Jui Hsieh 2 , and Richard Vuduc 3 1Professor at UC Berkeley 2Assistant Professor at UCLA 3Associate Professor at Georgia Tech Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 1 / 48

Outline Introduction Existing Approaches Our Approach Analysis and Results Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 2 / 48

Kernel Ridge Regression (KRR) Given n samples ( x 1 , y 1 ), ..., ( x n , y n ), find the empirical minimizer 4 α = argmin 1 � n i =1 � f i − y i � 2 2 + λ � f � 2 ˆ H n � n � n j =1 α j exp( −|| x i − x j || 2 / (2 σ 2 )) f i = j =1 α j Φ( x j , x i ) = This problem has a closed-form solution 5 : ( K + λ nI ) α = y f ∈ R n , x i ∈ R d , y i ∈ R , α ∈ R n , λ ∈ R , Φ : R d × R d → R 4 H is a Reproducing Kernel Hilbert Space 5 K is a n -by- n matrix where K [ i ][ j ] = Φ( x j , x i ), I is an identity matrix Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 3 / 48

KRR by Direct Method MSE : correctness metric, lower is better difference between the predicted label and true label Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 4 / 48

Bottleneck: solve a large linear equation ( K + λ nI ) α = y n -by- n dense kernel matrix K machine learning input dataset: a n -by- d matrix n : num of samples (e.g. num of users on Facebook: ∼ 2.2 billion) d : num of features (e.g. num of movies a user rated: ∼ 1000) n >> d , a small input dataset can generate a huge kernel matrix 357 MB dataset (520,000 × 90 matrix) = 2 TB kernel matrix Θ( n 3 ) to solve the linear equation directly very expensive in practice Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 5 / 48

Weak Scaling Issue 1 million users 2 million users 4 million users 8 million users primary interest for machine learning at scale keep each machine fully loaded (more users, buy more servers) keep d and n / p fixed as p grows ( p is # nodes) KRR: memory grows as Θ( p ) and the flops as Θ( p 2 ) per node perfect scaling: memory and flops are constant per node Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 6 / 48

Bottleneck: solve a large linear equation ( K + λ nI ) α = y Low-rank matrix approximation Kernel PCA (Scholkopf et al., 1998) Incomplete Cholesky Decomposition (Fine and Scheinberg, 2002) Nystorm Sampling (Williams and Seeger, 2001) Iterative optimization algorithm Gradient Descent (Raskutti et al., 2011) Conjugate Gradient Methods (Blanchard and Kramer, 2010) None of these methods can achieve the same level of accuracy as the direct method does 6 We reserve them for future study 6Y. Zhang, J. Duchi, M. Wainwright, Divide and Conquer Kernel Ridge Regression, COLT’13 Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 8 / 48

DKRR: Straightforward Implementation by ScaLAPACK K + λ nI is symmetric positive definite (Cholesky decomposition). weak scaling efficiency drops to 0.32% when we increase to 64 nodes Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 9 / 48

Divide-and-Conquer KRR (DC-KRR) 7 communication overhead is low, good scaling! 7Y. Zhang, J. Duchi, M. Wainwright, Divide and Conquer Kernel Ridge Regression, COLT’13 Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 10 / 48

DC-KRR key idea: block-diagonal matrix approximate figure from DC-KRR authors (Y. Zhang, J. Duchi, M. Wainwright) Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 11 / 48

DC-KRR beats pervious methods on tens of nodes figure from DC-KRR authors (Y. Zhang, J. Duchi, M. Wainwright) based on a dataset of music recommendation system Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 12 / 48

weak scaling in accuracy (i.e. MSE ) Table 1: MSE : lower is better. 2k samples per node Methods 8k samples 32k samples 128k samples DKRR (baseline) 90.9 85.0 0.002 DCKRR 88.9 85.5 81.0 when we scale DC-KRR to many nodes, it is not correct Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 13 / 48

Why DC-KRR does not work at scale? It is not safe to ignore the off-diagonal parts there are many nonzero numbers in the off-diagonal parts a 5k-by-5k Gaussian Kernel matrix by UCI Covertype dataset, visualized by Matlab spy Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 15 / 48

How to diagonalize Kernel matrix? k-means clustering algorithm cluster the samples based on Euclidean distance x i and x j are in the same cluster: || x i − x j || is small x i and x j are in different clusters: || x i − x j || is large || x i − x j || → ∞ means K [ i ][ j ] → 0 K [ i ][ j ] = Φ( x j , x i ) = exp( −|| x i − x j || 2 / (2 σ 2 )) 1.1 Original Kernel 1.2 After K-means nonzero threshold: larger than 10 − 6 Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 16 / 48

K-means KRR (KKRR) we expect KKRR achieves low MSE! Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 17 / 48

KKRR performs poorly our system tries different hyper parameters iteratively, until gets the lowest MSE dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 18 / 48

why KKRR performs poorly? different clusters are very different from each other they generate different models: averaging them is a bad idea Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 19 / 48

KKRR2 we expect KKRR2 achieves low MSE! Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 20 / 48

KKRR2 performs much better than KKRR our system tries different hyper parameters iteratively, until gets the lowest MSE dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 21 / 48

how good this can be in the best situation? suppose we can select the best model (try each one-by-one) Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 22 / 48

KKRR3: error lower bound for block diagonal method we believe KKRR3 will achieve lowest MSE! Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 23 / 48

Block diagonal is great by an optimal selection algorithm our system tries different hyper parameters iteratively, until gets the lowest MSE dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 24 / 48

However, KKRR family is slow our system tries different hyper parameters iteratively, until gets the lowest MSE dataset of music recommendation system, on 96 CPU processors Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 25 / 48

K-means clustering: imbalance partitioning the sizes of different blocks are different Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 26 / 48

K-means clustering: imbalance partitioning 1.3 Load Balance for Data Size 1.4 Load Balance for Time different nodes have different num of samples ( n ) memory: Θ( n 2 ), flops: Θ( n 3 ) Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 27 / 48

Basic Idea of K-balance algorithm Run K-means to get all the cluster centers Find the closest center ( CC ) for a given sample If CC is already balanced, go on When every center has n / p samples, done Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 28 / 48

K-balance distance matrix: 8 samples in 4 centers d[i][j] = the distance between i-th center and j-th sample balanced case: each center has 2 samples Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 29 / 48

The center for S0 ⇒ C2 underload: C0, C1, C2, C3 balanced: None Yang You (youyang@cs.berkeley.edu) J. Demmel, C. Hsieh, R. Vuduc UC Berkeley Computer Sci 30 / 48

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and - PowerPoint PPT Presentation

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems speaker: Yang You PhD student at UC Berkeley, advised by James Demmel with James Demmel 1 , Cho-Jui Hsieh 2 , and Richard Vuduc 3 1Professor at UC Berkeley

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Drive-Thru: Drive-Thru: Fast, Accurate Evaluation of Fast, Accurate Evaluation of Storage Power

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

Random Fourier Features for Kernel Ridge Regression Michael Kapralov 1 1 EPFL (Joint work with H.

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Fast and Scalable Relational Division on Fast and Scalable Relational Division on Database

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

YANG by Example v0.1.1 (2015-11-05) Overview and Objec.ves This

Journaling on NVM Cheng Chen, Jun Yang , Qingsong Wei, Chundong Wang, and Mingdi Xue Data Storage

Abstractions from Tests Mayur Naik (Georgia Institute of Technology) Hongseok Yang (University

Towards Assumption-free Unsupervised Domain Adaptation for Visual recognition

Electronic Rituals, Oracles and Fortune Telling Allison Parrish ITP 2017 ritual Rites of

Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoefler, Wolfgang E.

Legacy Founda-on Barbara Adachi 2018 President Mission Statement The Legacy Founda>on works

Objects and subtyping in the -calculus modulo Ali Assaf, Raphal Cauderlier , Catherine

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and - PowerPoint PPT Presentation

Accurate, Fast and Scalable Kernel Ridge Regression on Parallel and Distributed Systems speaker: Yang You PhD student at UC Berkeley, advised by James Demmel with James Demmel 1 , Cho-Jui Hsieh 2 , and Richard Vuduc 3 1Professor at UC Berkeley

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Drive-Thru: Drive-Thru: Fast, Accurate Evaluation of Fast, Accurate Evaluation of Storage Power

Mount Sutro Mount Sutro South Ridge &amp; Edgewood Avenue South Ridge &amp; Edgewood Avenue

Blue Ridge Blue Ridge $858,700,000 in new investment since 2010 Blue Ridge Anecdotal Market

Random Fourier Features for Kernel Ridge Regression Michael Kapralov 1 1 EPFL (Joint work with H.

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

TAKING DATA ON FORM TAKING DATA ON FORM- -WOUND WOUND MOTORS MOTORS By : Manuel Manny

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Fast and Scalable Relational Division on Fast and Scalable Relational Division on Database

Fast Scalable Parallel Comparison Sort Fast, Scalable Parallel Comparison Sort On Hybrid Multicore

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

YANG by Example v0.1.1 (2015-11-05) Overview and Objec.ves This

Journaling on NVM Cheng Chen, Jun Yang , Qingsong Wei, Chundong Wang, and Mingdi Xue Data Storage

Abstractions from Tests Mayur Naik (Georgia Institute of Technology) Hongseok Yang (University

Towards Assumption-free Unsupervised Domain Adaptation for Visual recognition

Electronic Rituals, Oracles and Fortune Telling Allison Parrish ITP 2017 ritual Rites of

Deadlock-Free Oblivious Routing for Arbitrary Topologies Jens Domke, Torsten Hoefler, Wolfgang E.

Legacy Founda-on Barbara Adachi 2018 President Mission Statement The Legacy Founda&gt;on works

Objects and subtyping in the -calculus modulo Ali Assaf, Raphal Cauderlier , Catherine

Mount Sutro Mount Sutro South Ridge & Edgewood Avenue South Ridge & Edgewood Avenue

Legacy Founda-on Barbara Adachi 2018 President Mission Statement The Legacy Founda>on works