Predictive low-rank decomposition for kernel methods Francis Bach - PowerPoint PPT Presentation

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des Mines de Paris UC Berkeley ICML 2005

Predictive low-rank decomposition for kernel methods • Kernel algorithms and low-rank decompositions • Incomplete Cholesky decomposition • Cholesky with side information • Simulations – code online

Kernel matrices • Given – data points – kernel function • Kernel methods works with kernel matrix – defined as a Gram matrix : – symmetric : – positive semi-definite :

Kernel algorithms • Kernel algorithms, usually or worse – Eigenvalues: Kernel PCA, CCA, FDA – Matrix inversion: LS-SVM – Convex optimization problems: SOCP, QP, SDP • Requires speed-up techniques for medium/large scale problems • General purpose matrix decomposition algorithms: – Linear in (not even touching all entries!) • Nyström method (Williams & Seeger, 2000) • Sparse greedy approximations (Smola & Schölkopf, 2000) • Incomplete Cholesky decomposition (Fine & Scheinberg, 2001)

Incomplete Cholesky decomposition – is the rank of – Most algorithms become

Kernel matrices and ranks • Kernel matrices may have full rank, i.e., … • … but eigenvalues decay (at least) exponentially fast for a wide variety of kernels (Williams & Seeger, 2000, Bach & Jordan, 2002) Good approximation by low rank matrices with small • “Data live near a low-dimensional subspace in feature space” • In practise, very small

Incomplete Cholesky decomposition • Approximate full matrix from selected columns: ( use datapoints in to approximate all of them) • Use diagonal to characterize behavior of the unknown block

Lemma • Given a positive matrix and subsets • There exists a unique matrix such that – is symmetric – The column space of is spanned by – agrees with on columns in

Incomplete Cholesky decomposition • Two main issues: ? ? – Selection of columns ( pivots ) – Computation of • Incomplete Cholesky decomposition – Efficient update of with linear cost – Pivoting: greedy choice of pivot with linear cost

Incomplete Cholesky decomposition (no pivoting) k=1 k=2 k=3

Pivot selection • approximation after k-th iteration • Error • Gain after between iterations k-1 and k = • Exact computation is • Lower bound

Incomplete Cholesky decomposition with pivoting Pivot Pivot selection permutation k=1 k=2 k=3

Incomplete Cholesky decomposition: what’s missing? • Complexity after steps: • What’s wrong with incomplete Cholesky (and other decomposition algorithms)? – They don’t take into account the classification labels or regression variables – cf. PCA vs. LDA

Incomplete Cholesky decomposition: what’s missing? • Two questions: – Can we exploit side information to lower the needed rank of the approximation? – Can we do it in linear time in ?

Using side information (classification labels, regression variables) • Given – kernel matrix – side information • Multiple regression with d response variables • Classification with d classes – if n-th data point belongs to class i – 0 otherwise • Use to select pivots

Prediction criterion • Square loss: • Representer theorem: prediction using kernels leads to prediction error for i-th data point where • Minimum total prediction error • If , equal to

Computing/updating criterion • Requirements: efficient to add one column at a time – (cf linear regression setting: add one variable at at time) • QR decomposition of – – orthogonal, upper triangular –

Cholesky with side information (CSI) • Parallel Cholesky and QR decomposition • Selection of pivots?

Criterion for selection of pivots • Approximation error + prediction error • Gain in criterion after k-th iteration: • Cannot compute for each remaining pivot exactly because it requires the entire matrix • Main idea: compute “look-ahead” decomposition steps and use the decomposition to compute gains – large enough to gain enough information about – small enough to incur little additional cost

Incomplete Cholesky decomposition with pivoting and look-ahead Pivot Pivot selection permutation k=1 k=2 k=3

Running time complexity • “Semi-naïve” computations of look-ahead decompositions (i.e., start again from scratch at each iteration) – Decompositions: – Computing criterion gains: • Efficient implementation (see paper/code) – steps of Cholesky/QR: – Computing criterion gains:

Simulations • UCI datasets • Gaussian-RBF kernels – Least squares SVM • Width and regularization parameters chosen by cross- validation • Compare minimal ranks for which the average performance is within a standard deviation from the one with the full kernel matrix Test set accuracy using matrix decomposition Full rank matrix

Simulations

Conclusion • Discriminative kernel methods and … … discriminative matrix decomposition algorithms • Same complexity as non discriminative version (linear) • Matlab/C code available online

Predictive low-rank decomposition for kernel methods Francis Bach - PowerPoint PPT Presentation

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des Mines de Paris UC Berkeley ICML 2005 Predictive low-rank decomposition for kernel methods Kernel algorithms and low-rank decompositions

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Decomposition rank of UHF-absorbing C -algebras Joint work with Hiroki Matui 12, Mar., 2013.

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Session 3 Upskilling for Predictive Analytics Travis M Short, FSA Upskilling for Predictive

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Chapter 23 Two Categorical Variables: The Chi-Square Test Chapter 22 1 BPS - 5th Ed.

CONFRONTING THE DAMPING OF THE BARYON ACOUSTIC OSCILLATIONS WITH OBSERVATIONS Hidenori Nomura,

Timing Calibration Efforts in Cosmic Ray Veto for Mu2e Experiment Payton Beeler Standard Model

FLOCK MATHEMATICS SUMMER INSTITUTE YEAR 2 FRESNO STATE JUNE 6-8, 2016 Figure 1. Descriptions of

LU Decomposition INGE4035 Numerical Methods Applied to Engineering Dr. Marco A Arocha October 2,

Communication-avoiding Cholesky-QR2 for rectangular matrices Edward Hutter and Edgar Solomonik

MadLINQ: Large-Scale Distributed Matrix Computation for the Cloud Zhengping Qian, Xiuwei Chen,

Synthesis of certified programs in fixed-point arithmetic, and its application to linear algebra