(On On the Nystrm Method for) ( Approximating a Gram Matrix for - PowerPoint PPT Presentation

(On On the Nyström Method for) ( Approximating a Gram Matrix for Improved Kernel-Based Learning Learning Michael W. Mahoney Michael W. Mahoney (joint work with P. Drineas Drineas; ; (joint work with P. thanks to R. Kannan thanks to R. Kannan) ) Yale University Dept. of Mathematics http://cs-www.cs.yale.edu/homes/mmahoney COLT June 2005

Motivation (1 of 3) Methods to extract linear structure from the data: • Support Vector Machines (SVMs). • Gaussian Processes (GPs). • Singular Value Decomposition (SVD) and the related PCA. Kernel-based learning methods to extract non-linear structure: • Choose features to define a (dot product) space F. • Map the data, X, to F by φ : X → F. • Do classification, regression, and clustering in F with linear methods. 2

Motivation (2 of 3) • Use dot products for information about mutual positions. •Define the kernel or Gram matrix: G ij =k ij =( φ (X (i) ), φ (X (j) )). •Algorithms that are expressed in terms of dot products can be given the Gram matrix G instead of the data covariance matrix X T X. •Note: Isomap, LLE, graph Laplacian eigenmaps, Hessian eigenmaps, SDE (dimensionality reduction methods for nonlinear manifolds) are kernel- PCA for particular Gram matrices. •Note: for Mercer kernels, G is SPSD. 3

Motivation (3 of 3) If the Gram matrix G -- G ij =k ij =( φ (X (i) ), φ (X (j) )) -- is dense but (nearly) low-rank, then calculations of interest still need O(n 2 ) space and O(n 3 ) time: •matrix inversion in GP prediction, •quadratic programming problems in SVMs, •computation of eigendecomposition of G. Relevant recent work using low-rank methods: •Achlioptas, McSherry, and Schölkopf, 2002,``randomized kernels’’. •Williams and Seeger, 2001, the ``Nystrom method’’. 4

Overview Our main algorithm: •Randomized algorithm to approximate a Gram matrix. •Low-rank approximation in terms of columns (and rows) of G=X T X. Our main quality-of-approximation theorem: •Provably good approximation if nonuniform probabilities are used. Discussion of the Nystrom method: •Nystrom method for integral equations and matrix problems. •Relationship to randomized SVD and CUR algorithms. 5

Review of Linear Algebra 6

Our Main Algorithm Input : n x n SPSD matrix G, probabilities {p i , 1=1,…,n}, c <= n, and k <= c. Output : n x c matrix C, and c x c matrix W k + (s.t. CW k + C T ≈ G). Algorithm : •Pick c columns of G in i.i.d. trials, with replacement and with respect to the probabilities {p i }; let I be the set of indices of the sampled columns. •Scale each sampled column (with index i ε I ) by dividing its by √ cp i . •Let C be the n x c matrix containing the rescaled sampled columns. •Let W be the c x c matrix of G with entries G ij /c √ p i p j , i,j ε I. •Compute W k + . 7

Our Main Theorem Let ε > 0 and η = 1 + √ 8log(1/ δ ). Construct an approximation CW k + C T with our Main Algorithm by sampling c columns of G with probabilities p i = G ii 2 / Σ i G ii 2 . If c >= 64k η 2 / ε 4 , then w.h.p.: ||G-CW k + C T || F <= ||G-G k || F + ε Σ i G ii 2 . If c >= 4 η 2 / ε 2 , then w.h.p.: ||G-CW k + C T || 2 <= ||G-G k || 2 + ε Σ i G ii 2 . 8

Notes About Our Main Result (1 of 2) Note: the structural simplicity of our main result: •C consists of a small number of representative data points. •W consists of the induced subgraph defined by those points. Computational resource requirements: •Assume the data X (or Gram matrix G) are stored externally. •Algorithm performs two passes over the data. •Algorithm uses O(n) additional scratch space and additional computation time. 9

Notes About Our Main Result (2 of 2) How to interpret the sampling probabilities? If the sampling probabilities were: p i = ||G (i) || 2 /||G|| F 2 •they would provide a bias towards data points that are more ``important’’ - longer and/or more representative. •the additional error would be ε ||G|| F and not ε Σ i G ii 2 = ε ||X|| F 2 (where G=X T X). Our sampling probabilities ignore correlations: p i = G ii 2 / Σ i G ii 2 = ||X (i) || 2 /||X|| F 2 10

Proof of Our Main Theorem (1 of 4) 11

Proof of Our Main Theorem (2 of 4) First, bound the spectral norm: Note: If k >= r = rank(W), then: 12

Proof of Our Main Theorem (3 of 4) Next, bound the Frobenius norm: 13

Proof of Our Main Theorem (4 of 4) Goal : Approximate the product of two (or more) matrices. (DK,DKM,DM) Input : m x n matrix A, number c <= n, and probabilities {p_i, i=1,…,n} Output : m x c matrix C (s.t. CC T ≈ AA T ) Algorithm : •Randomly sample c columns from A according to {p i } •Rescale each column by 1/ √ cp i_t to form C Theorem : Let η = 1 + √ 8log(1/ δ ). If p i = |A (i) | 2 /||A|| F 2 and c >=4 η 2 / ε 2 : •||AA T -CC T || <= ε ||A|| F 2 •||AA T AA T -CC T CC T || <= ε ||A|| F 4 14

The Nystrom Method (1 of 3) 15

The Nystrom Method (2 of 3) 16

The Nystrom Method (3 of 3) Randomized SVD Algorithms (of Frieze, Kannan, and Vempala, and Drineas, Kannan, and Mahoney) •Randomly sample columns (xor rows). •Compute/approximate low-dimensional singular vectors. •Nystrom-extend to approximate H k , the high-dim. sing. vect. •Bound ||A-H k H k T A|| 2,F <= ||A-A k || 2,F + ε ||A|| F . Randomized CUR Algorithms (of Drineas, Kannan, and Mahoney) •Randomly sample columns and rows •Bound ||A-CUR|| 2,F <= ||A-A k || 2,F + ε ||A|| F . •Does not need or use the SPSD property 17

Conclusion Main Result: We randomly sample columns (biased towards longer columns) of a Gram matrix G to get an approximation s.t.: ||G-CW k + C T || 2,F <= ||G-G k || 2,F + ε ||X|| F 2 . Open problem: Sample with respect to probabilities that include correlations, preserve the SPSD property, and obtain bounds with an additional error of ε ||G|| F . (Probably a corollary of general CUR.) 18

(On On the Nystrm Method for) ( Approximating a Gram Matrix for - PowerPoint PPT Presentation

(On On the Nystrm Method for) ( Approximating a Gram Matrix for Improved Kernel-Based Learning Learning Michael W. Mahoney Michael W. Mahoney (joint work with P. Drineas Drineas; ; (joint work with P. thanks to R. Kannan thanks to R.

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

Joshua Hartigan Supervisor: Judy-anne Osborn Heres a matrix And heres its Gram

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

Less is More: Nystr om Computational Regularization Alessandro Rudi , Raffaello Camoriano,

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Many words share the same root word This week we are focusing on words with the root gram.

2015 March CAUTIONARY STATEMENT ON FORWARD-LOOKING INFORMATION All statements, other than

AP Chemistry Unit 3: Presentation B Chemical Reactions and Stoichiometry www.njctl.org Slide 3

State of Delaware State of Delaware Office of the Governor Fiscal Year 2014 Fiscal Year 2014

Third Quarter 2019 Conference Call Investor Presentation St. Paul, MN September 26, 2019 2

Seminar Solceller og brekraft Oslo, 19. Juni 2019 Carsten Rohr Our Mission Our Values

Heat Program Challenge: Risk Perception Source: NOAA, ADHS Challenge: Risk Perception Source:

Technical Direct or Compet etent ent Per Persons ns S Statem ement ent Competent

INVESTOR PRESENTATION TSX-V : GTT Saddle North th Project ct, Tatogga Property JULY 2020 OTC

(On On the Nystrm Method for) ( Approximating a Gram Matrix for - PowerPoint PPT Presentation

(On On the Nystrm Method for) ( Approximating a Gram Matrix for Improved Kernel-Based Learning Learning Michael W. Mahoney Michael W. Mahoney (joint work with P. Drineas Drineas; ; (joint work with P. thanks to R. Kannan thanks to R.

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

Joshua Hartigan Supervisor: Judy-anne Osborn Heres a matrix And heres its Gram

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

Less is More: Nystr om Computational Regularization Alessandro Rudi , Raffaello Camoriano,

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS &amp; COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Many words share the same root word This week we are focusing on words with the root gram.

2015 March CAUTIONARY STATEMENT ON FORWARD-LOOKING INFORMATION All statements, other than

AP Chemistry Unit 3: Presentation B Chemical Reactions and Stoichiometry www.njctl.org Slide 3

State of Delaware State of Delaware Office of the Governor Fiscal Year 2014 Fiscal Year 2014

Third Quarter 2019 Conference Call Investor Presentation St. Paul, MN September 26, 2019 2

Seminar Solceller og brekraft Oslo, 19. Juni 2019 Carsten Rohr Our Mission Our Values

Heat Program Challenge: Risk Perception Source: NOAA, ADHS Challenge: Risk Perception Source:

Technical Direct or Compet etent ent Per Persons ns S Statem ement ent Competent

INVESTOR PRESENTATION TSX-V : GTT Saddle North th Project ct, Tatogga Property JULY 2020 OTC

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details