Predictive low-rank decomposition for kernel methods Francis Bach - - PowerPoint PPT Presentation

predictive low rank decomposition for kernel methods
SMART_READER_LITE
LIVE PREVIEW

Predictive low-rank decomposition for kernel methods Francis Bach - - PowerPoint PPT Presentation

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des Mines de Paris UC Berkeley ICML 2005 Predictive low-rank decomposition for kernel methods Kernel algorithms and low-rank decompositions


slide-1
SLIDE 1

Predictive low-rank decomposition for kernel methods

Francis Bach Ecole des Mines de Paris Michael Jordan UC Berkeley ICML 2005

slide-2
SLIDE 2

Predictive low-rank decomposition for kernel methods

  • Kernel algorithms and low-rank decompositions
  • Incomplete Cholesky decomposition
  • Cholesky with side information
  • Simulations – code online
slide-3
SLIDE 3

Kernel matrices

  • Given

– data points – kernel function

  • Kernel methods works with kernel matrix

– defined as a Gram matrix : – symmetric : – positive semi-definite :

slide-4
SLIDE 4

Kernel algorithms

  • Kernel algorithms, usually or worse

– Eigenvalues: Kernel PCA, CCA, FDA – Matrix inversion: LS-SVM – Convex optimization problems: SOCP, QP, SDP

  • Requires speed-up techniques for medium/large scale

problems

  • General purpose matrix decomposition algorithms:

– Linear in (not even touching all entries!)

  • Nyström method (Williams & Seeger, 2000)
  • Sparse greedy approximations (Smola & Schölkopf, 2000)
  • Incomplete Cholesky decomposition (Fine & Scheinberg, 2001)
slide-5
SLIDE 5

Incomplete Cholesky decomposition

– is the rank of – Most algorithms become

slide-6
SLIDE 6

Kernel matrices and ranks

  • Kernel matrices may have full rank, i.e., …
  • … but eigenvalues decay (at least) exponentially fast for a

wide variety of kernels (Williams & Seeger, 2000, Bach & Jordan, 2002) Good approximation by low rank matrices with small

  • “Data live near a low-dimensional subspace in feature space”
  • In practise, very small
slide-7
SLIDE 7

Incomplete Cholesky decomposition

  • Approximate full matrix from selected columns:

( use datapoints in to approximate all of them)

  • Use diagonal to characterize behavior of the unknown block
slide-8
SLIDE 8

Lemma

  • Given a positive matrix and subsets
  • There exists a unique matrix such that

– is symmetric – The column space of is spanned by – agrees with on columns in

slide-9
SLIDE 9

Incomplete Cholesky decomposition

  • Two main issues:

– Selection of columns (pivots) – Computation of

  • Incomplete Cholesky decomposition

– Efficient update of with linear cost – Pivoting: greedy choice of pivot with linear cost

? ?

slide-10
SLIDE 10

Incomplete Cholesky decomposition (no pivoting)

k=1 k=2 k=3

slide-11
SLIDE 11

Pivot selection

  • approximation after k-th iteration
  • Error
  • Gain after between iterations k-1 and k =
  • Exact computation is
  • Lower bound
slide-12
SLIDE 12

Incomplete Cholesky decomposition with pivoting

Pivot selection k=1 k=2 k=3 Pivot permutation

slide-13
SLIDE 13

Incomplete Cholesky decomposition: what’s missing?

  • Complexity after steps:
  • What’s wrong with incomplete Cholesky (and other

decomposition algorithms)? – They don’t take into account the classification labels or regression variables – cf. PCA vs. LDA

slide-14
SLIDE 14

Incomplete Cholesky decomposition: what’s missing?

  • Two questions:

– Can we exploit side information to lower the needed rank

  • f the approximation?

– Can we do it in linear time in ?

slide-15
SLIDE 15

Using side information

(classification labels, regression variables)

  • Given

– kernel matrix – side information

  • Multiple regression with d response variables
  • Classification with d classes

– if n-th data point belongs to class i – 0 otherwise

  • Use to select pivots
slide-16
SLIDE 16

Prediction criterion

  • Square loss:
  • Representer theorem: prediction using kernels leads to

prediction error for i-th data point where

  • Minimum total prediction error
  • If , equal to
slide-17
SLIDE 17

Computing/updating criterion

  • Requirements: efficient to add one column at a time

– (cf linear regression setting: add one variable at at time)

  • QR decomposition of

– –

  • rthogonal, upper triangular

slide-18
SLIDE 18
  • Parallel Cholesky and QR decomposition
  • Selection of pivots?

Cholesky with side information (CSI)

slide-19
SLIDE 19

Criterion for selection of pivots

  • Approximation error + prediction error
  • Gain in criterion after k-th iteration:
  • Cannot compute for each remaining pivot exactly because it

requires the entire matrix

  • Main idea: compute “look-ahead” decomposition steps and

use the decomposition to compute gains – large enough to gain enough information about – small enough to incur little additional cost

slide-20
SLIDE 20

Incomplete Cholesky decomposition with pivoting and look-ahead

Pivot selection k=1 k=2 k=3 Pivot permutation

slide-21
SLIDE 21

Running time complexity

  • “Semi-naïve” computations of look-ahead decompositions

(i.e., start again from scratch at each iteration) – Decompositions: – Computing criterion gains:

  • Efficient implementation (see paper/code)

– steps of Cholesky/QR: – Computing criterion gains:

slide-22
SLIDE 22

Simulations

  • UCI datasets
  • Gaussian-RBF kernels – Least squares SVM
  • Width and regularization parameters chosen by cross-

validation

  • Compare minimal ranks for which the average performance

is within a standard deviation from the one with the full kernel matrix

Test set accuracy Full rank matrix using matrix decomposition

slide-23
SLIDE 23

Simulations

slide-24
SLIDE 24

Conclusion

  • Discriminative kernel methods and …

… discriminative matrix decomposition algorithms

  • Same complexity as non discriminative version (linear)
  • Matlab/C code available online