Low Rank Approximation Lecture 3 Daniel Kressner Chair for - PowerPoint PPT Presentation

Low Rank Approximation Lecture 3 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1

Basic randomized salgorithm for low-rank approx Must read: Halko/Martinsson/Tropp’2010: Finding Structure with Randomness... Randomized Algorithm: 1. Draw Gaussian random matrix Ω ∈ R n × r . 2. Perform block mat-vec Y = A Ω . 3. Compute (economic) QR decomposition Y = QR . 4. Form B = Q T A . 5. Return ˆ A = QB (in factorized form) Exact recovery: If A has rank r , we recover � A = A with probability 1. 2

Three test matrices (a) The 100 × 100 Hilbert matrix A defined by A ( i , j ) = 1 / ( i + j − 1 ) . (b) The matrix A defined by A ( i , j ) = exp ( − γ | i − j | / n ) with γ = 0 . 1. (c) 30 × 30 diagonal matrix with diagonal entries 10 , 0 . 99 10 , 0 . 98 100 , 0 . 99 100 , 0 . 98 1 , 0 . 99 , 0 . 98 , 1 1 10 , 100 , . . . 10 0 10 -10 10 -20 0 20 40 60 80 100 Singular values of test matrices 3

Randomized algorithm applied to test matrices errors measured in spectral norm: (a) Hilbert matrix, r = 5: Exact mean std 0 . 0019 0 . 0092 0 . 0099 (b) Matrix with slower decay, r = 25: Exact mean std 0 . 0034 0 . 012 0 . 002 (c) Matrix with staircase sv, r = 7: Exact mean std 0 . 010 0 . 038 0 . 025 4

Randomized algorithm applied to test matrices errors measured in Frobenius norm: (a) Hilbert matrix, r = 5: Exact mean std 0 . 0019 0 . 0093 0 . 0099 (b) Matrix with slower decay, r = 25: Exact mean std 0 . 011 0 . 024 0 . 001 (c) Matrix with staircase sv, r = 7: Exact mean std 0 . 014 0 . 041 0 . 024 5

Basic randomized algorithms for low-rank approx Add oversampling. (usually small) integer p Randomized Algorithm: 1. Draw standard Gaussian random matrix Ω ∈ R n × ( r + p ) . 2. Perform block mat-vec Y = A Ω . 3. Compute (economic) QR decomposition Y = QR . 4. Form B = Q T A . 5. Return � A = QB (in factorized form) Problem: ˆ A has rank r + p > r . Solution: Compress B ≈ T r ( B ) � Q T r ( B ) has rank r . Error: � Q T r ( B ) − A � = � Q T r ( B ) − QB + QB − A � �T r ( B ) − B � + � ( I − QQ T ) A � ≤ 6

Basic randomized algorithms for low-rank approx Add oversampling. (usually small) integer p Randomized Algorithm: 1. Draw standard Gaussian random matrix Ω ∈ R n × ( r + p ) . 2. Perform block mat-vec Y = A Ω . 3. Compute (economic) QR decomposition Y = QR . 4. Form B = Q T A . 5. Return � A = Q T r ( B ) (in factorized form) Gold standard best rank- r approximation error: ◮ spectral norm: σ r + 1 � ◮ Frobenius norm: σ 2 r + 1 + · · · + σ 2 n . 7

Randomized algorithm applied to test matrices errors measured in spectral norm: (a) Hilbert matrix, r = 5: Exact mean std 0 . 0019 0 . 0092 0 . 0099 p = 0 0 . 0019 0 . 0026 0 . 0019 p = 1 0 . 0019 0 . 0019 0 . 0001 p = 2 (b) Matrix with slower decay, r = 25: Exact mean std 0 . 0034 0 . 012 0 . 002 p = 0 0 . 0034 0 . 011 0 . 0017 p = 1 0 . 0034 0 . 010 0 . 0015 p = 2 0 . 0034 0 . 0064 0 . 0008 p = 10 0 . 0034 0 . 0037 0 . 0002 p = 25 (c) Matrix with staircase sv, r = 7: Exact mean std 0 . 010 0 . 038 0 . 025 p = 0 0 . 010 0 . 021 0 . 012 p = 1 0 . 010 0 . 012 0 . 005 p = 2 8

Analysis: basic setting Goal: Say something sensible about � ( I − QQ T ) A � . Expected value, tail bounds, ... Multivariate normal distribution X ∼ N ( µ, Σ) with mean µ ∈ R n and (positive definite) covariance matrix Σ ∈ R n × n has density � � 1 − 1 2 ( x − µ ) T Σ − 1 ( x − µ ) � f X ( x ) = exp ( 2 π ) n det (Σ) X ∼ N ( 0 , I m ) is called a Gaussian random vector. Lemma Let V ∈ R m × n be orthonormal. If X ∼ N ( 0 , I m ) then V T X ∼ N ( 0 , I n ) . This invariance of Gaussian random vectors allows us to assume w.l.o.g. in the analysis that A = Σ diagonal (and square). 9

Analysis: r = 1 , p = 0 Partition � ω 1 � � σ 1 � 0 Ω = , A = . ω 2 Σ 2 0 Then � � � � σ 1 ω 1 1 Y = A Ω = = σ 1 ω 1 . 1 Σ 2 ω 2 σ 1 ω 1 Σ 2 ω 2 σ 1 ω 1 Σ 2 ω 2 but ω − 1 1 Problem: Need to control norm of (reciprocal of 1 standard normal random variable) is Cauchy distribution with undefined mean and variance. Need to consider p ≥ 2. 10

Analysis: r = 1 , p ≥ 2 Lemma Y , Π Y be orthogonal projectors onto subspaces ˜ Let Π ˜ Y ⊂ Y . Then � Π ˜ Y A � F ≤ � Π Y A � F , � ( I − Π ˜ Y ) A � F ≥ � ( I − Π Y ) A � F . for any matrix A. Proof. EFY. � � � � σ 1 Ω 1 1 ˜ Y = A Ω = , Y = , σ − 1 1 Σ 2 Ω 2 Ω † Σ 2 Ω 2 1 where Ω † 1 is pseudoinverse of Ω 1 . Because Ω 1 is a row vector and nonzero (with probability one), we obtain Ω † 1 = Ω T 1 / � Ω 1 � 2 2 . Thus, range ( ˜ Y ) ⊂ range ( Y ) . 1 Σ 2 Ω 2 Ω † Thus, with f = σ − 1 1 , we get � 1 � 1 � ( I − QQ T ) A � F ≤ � ( I − ˜ Q ˜ ˜ Q T ) A � F , Q = � , f 1 + � f � 2 2 11

Analysis: r = 1 , p ≥ 2 � ( I − ˜ Q ˜ F − � ˜ Q T ) A � 2 F = � A � 2 Q T A � 2 F σ 2 1 � ˜ Q T A � 2 ( σ 2 1 + � Σ 2 f � 2 1 = 2 ) ≥ F 1 + � f � 2 1 + � f � 2 2 2 σ 2 1 ( 1 − � f � 2 2 ) = σ 2 1 − � Σ 2 Ω 2 Ω † 1 � 2 ≥ 2 In summary, we have: Lemma � ( I − ˜ Q ˜ Q T ) A � 2 F ≤ � Σ 2 � 2 F + � Σ 2 Ω 2 Ω † 1 � 2 F . To analyze red term, we use � � �� E � Σ 2 Ω 2 Ω † � Σ 2 Ω 2 Ω † F · E � Ω † 1 � 2 1 � 2 = � Σ 2 � 2 1 � 2 F | Ω 1 F = E E F . (See exercises for proof that E � A Ω B � 2 F = � A � 2 F � B � 2 F for Gaussian matrix Ω and constant matrices A , B .) 12

Analysis: r = 1 , p ≥ 2 F and � Ω † It remains to control � Ω 1 � 2 1 � 2 F = 1 / � Ω 1 � 2 F . � Ω 1 � 2 F is a sum of p + 1 squared independent standard normal random variables. 2 π e − x 2 / 2 . Pdf for Y = X 2 zero 1 Pdf for X ∼ N ( 0 , 1 ) given by f X ( x ) = √ for nonpositive values. For y > 0, we obtain Pr ( −√ y ≤ X ≤ √ y ) Pr ( 0 ≤ Y ≤ y ) = � √ y 2 e − x 2 / 2 d x = √ 2 π 0 � y 1 e − t / 2 d t , = √ 2 π 0 Y is called chi-squared distribution (1 degree of freedom): Y ∼ χ 2 1 . Ω 1 ∼ χ 2 p + 1 chi-squared distribution with p + 1 d.o.f.; pdf 2 − ( p + 1 ) / 2 Γ(( p + 1 ) / 2 ) x ( p + 1 ) / 2 − 1 exp ( − x / 2 )) , f Ω 1 ( x ) = x > 0 . 13

Analysis: r = 1 , p ≥ 2 � p � − 1 � 1 � Ω † 1 � 2 Ω 2 ∼ Inv − χ 2 ( p + 1 ) , F = = 1 , i � Ω 1 � 2 F i = 1 the inverse-chi-squared distribution with p + 1 degrees of freedom. Pdf given by 2 − ( p + 1 ) / 2 Γ(( p + 1 ) / 2 ) x − ( p + 1 ) / 2 − 1 exp ( − 1 / ( 2 x )) . 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 pdf for p = 1, p = 3, p = 9 14

Analysis: r = 1 , p ≥ 2 Textbook results: E � Ω † ◮ E � Ω 1 � 2 1 � 2 F = ( p − 1 ) − 1 F = p + 1, Tail bound by [Laurent/Massart’2000]: � � � � t 2 � Ω 1 � 2 ◮ P F ≤ p + 1 − t ≤ exp − 4 ( p + 1 ) Theorem For r = 1 , p ≥ 2 , we have � 1 E � ( I − QQ T ) A � F ≤ 1 + p − 1 � Σ 2 � F . Probability of deviating from this upper bound decays exponentially, as indicated by tail bound for χ 2 p + 1 . 15

Analysis: general r , p ≥ 2 Lemma F + � Σ 2 Ω 2 Ω † � ( I − QQ T ) A � 2 F ≤ � Σ 2 � 2 1 � 2 F . Proof. � Σ 1 Ω 1 � � I r � ˜ F = Σ 2 Ω 2 Ω † 1 Σ − 1 Y = A Ω = , Y = , 1 . Σ 2 Ω 2 F ONB � I r � ˜ ( I r + F T F ) − 1 / 2 . Q = F F ≤ � ( I − ˜ Q ˜ F − � ˜ � ( I − QQ T ) A � 2 Q T ) A � 2 F = � A � 2 Q T A � 2 F . � � � � � ˜ � 2 � 2 � ( I r + F T F ) − 1 / 2 Σ 1 � ( I r + F T F ) − 1 / 2 F T Σ 2 Q T A � 2 = F + F F � � � 2 � ( I r + F T F ) − 1 / 2 Σ 1 F = trace (Σ 1 ( I r + F T F ) − 1 Σ 1 ) ≥ trace (Σ 1 ( I r − F T F )Σ 1 ) = � Σ 1 � 2 F − � Σ 2 Ω 2 Ω † 1 � 2 ≥ F (Proof of trace (Σ 1 ( I r + F T F ) − 1 Σ 1 ) ≥ trace (Σ 1 ( I r − F T F )Σ 1 ) on next slide.) 16

Let G = F T F . Then we have � � ( I + G ) − 1 − ( I − G ) = G 2 ( I + G ) and hence ( I + G ) − 1 − ( I − G ) = ( I + G ) − 1 G 2 = G ( I + G ) − 1 G is symmetric positive semidefinite. This implies that � � ( I + G ) − 1 − ( I − G ) Σ 1 Σ 1 is symmetric positive semidefinite as well and hence � � � � ( I + G ) − 1 − ( I − G ) 0 ≤ trace Σ 1 Σ 1 or, equivalently, trace (Σ 1 ( I r + F T F ) − 1 Σ 1 ) ≥ trace (Σ 1 ( I r − F T F )Σ 1 ) . 17

Analysis: general r , p ≥ 2 Again use E � Σ 2 Ω 2 Ω † 1 � 2 F = � Σ 2 � 2 F · E � Ω † 1 � 2 F . By standard results in multivariate statistics, we have r E � Ω † 1 � 2 F = p − 1 . Sketch of argument: ◮ Ω 1 Ω T 1 ∼ W r ( r + p ) (Wishart distribution with r + p degrees of freedom) 1 ) − 1 ∼ W − 1 ◮ (Ω 1 Ω T ( r + p ) (inverse Wishart distribution with r + p r degrees of freedom) 1 ) − 1 = 1 ◮ E (Ω 1 Ω T p − 1 I r ; see Page 96 in [Muirhead’1982] 1 � ◮ Result follows from � Ω † 1 � 2 F = � Ω T 1 (Ω 1 Ω T 1 ) − 1 � 2 (Ω 1 Ω T 1 ) − 1 F = trace 1 R. J. Muirhead, Aspects of Multivariate Statistical Theory, Wiley, New York, NY, 1982. 18

Low Rank Approximation Lecture 3 Daniel Kressner Chair for - PowerPoint PPT Presentation

Low Rank Approximation Lecture 3 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 Basic randomized salgorithm for low-rank approx Must read: Halko/Martinsson/Tropp2010:

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Low Rank Approximation Lecture 4 Daniel Kressner Chair for Numerical Algorithms and HPC

Low Rank Approximation Lecture 10 Daniel Kressner Chair for Numerical Algorithms and HPC

Low Rank Approximation Lecture 5 Daniel Kressner Chair for Numerical Algorithms and HPC

ECS231 Low-rank approximation revisited (Introduction to Randomized Algorithms) May 23, 2019

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

6. Approximation and fitting norm approximation least-norm problems regularized

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

Recitations for 10-701 Randomized Algorithm for matrices Mu Li April 9, 2013 Low-rank

Computing the Best Rank ( r 1 , r 2 , r 3 ) Approximation of a Tensor Lars Eld en

Optimal Rank-1 Hankel Approximation of Matrices Gerlind Plonka University of Gttingen CodEx

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Low Rank Approximation Lecture 8 Daniel Kressner Chair for Numerical Algorithms and HPC

Spiral 1 / Unit 1 Combinational vs. Sequential Logic Latency vs. Throughput (Pipelining) Digital

Introduction to Inductive Logic Programming Stephen Muggleton Department of Computing Imperial

Rook polynomials Ira M. Gessel Department of Mathematics Brandeis University University of

Discrete Optimization with Ordering Elena Fernndez Tecnical University of Catalonia

CS 528 Mobile and Ubiquitous Computing Lecture 6b: Step Counting & Activity Recognition

Supporting the Implementation of the Common Core Learning Standards in your District Gladys I.

Scaling properties in multijet events towards high multiplicities Peter Schichtel

S.T.A.L.K.E.R : Clear Sky a showcase for Direct3D 10.0/ 1 Speakers: Igor A. Lobanchikov

Low Rank Approximation Lecture 3 Daniel Kressner Chair for - PowerPoint PPT Presentation

Low Rank Approximation Lecture 3 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL daniel.kressner@epfl.ch 1 Basic randomized salgorithm for low-rank approx Must read: Halko/Martinsson/Tropp2010:

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Low Rank Approximation Lecture 4 Daniel Kressner Chair for Numerical Algorithms and HPC

Low Rank Approximation Lecture 10 Daniel Kressner Chair for Numerical Algorithms and HPC

Low Rank Approximation Lecture 5 Daniel Kressner Chair for Numerical Algorithms and HPC

ECS231 Low-rank approximation revisited (Introduction to Randomized Algorithms) May 23, 2019

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

6. Approximation and fitting norm approximation least-norm problems regularized

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

Predictive low-rank decomposition for kernel methods Francis Bach Michael Jordan Ecole des

Recitations for 10-701 Randomized Algorithm for matrices Mu Li April 9, 2013 Low-rank

Computing the Best Rank ( r 1 , r 2 , r 3 ) Approximation of a Tensor Lars Eld en

Optimal Rank-1 Hankel Approximation of Matrices Gerlind Plonka University of Gttingen CodEx

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Low Rank Approximation Lecture 8 Daniel Kressner Chair for Numerical Algorithms and HPC

Spiral 1 / Unit 1 Combinational vs. Sequential Logic Latency vs. Throughput (Pipelining) Digital

Introduction to Inductive Logic Programming Stephen Muggleton Department of Computing Imperial

Rook polynomials Ira M. Gessel Department of Mathematics Brandeis University University of

Discrete Optimization with Ordering Elena Fernndez Tecnical University of Catalonia

CS 528 Mobile and Ubiquitous Computing Lecture 6b: Step Counting &amp; Activity Recognition

Supporting the Implementation of the Common Core Learning Standards in your District Gladys I.

Scaling properties in multijet events towards high multiplicities Peter Schichtel

S.T.A.L.K.E.R : Clear Sky a showcase for Direct3D 10.0/ 1 Speakers: Igor A. Lobanchikov

CS 528 Mobile and Ubiquitous Computing Lecture 6b: Step Counting & Activity Recognition