Steins Method for Matrix Concentration Lester Mackey Collaborators: - PowerPoint PPT Presentation

Stein’s Method for Matrix Concentration Lester Mackey † Collaborators: Michael I. Jordan ‡ , Richard Y. Chen ∗ , Brendan Farrell ∗ , and Joel A. Tropp ∗ † Stanford University ‡ University of California, Berkeley ∗ California Institute of Technology December 10, 2012 Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 1 / 35

Motivation Concentration Inequalities Matrix concentration P {� X − E X � ≥ t } ≤ δ P { λ max ( X − E X ) ≥ t } ≤ δ Non-asymptotic control of random matrices with complex distributions Applications Matrix completion from sparse random measurements (Gross, 2011; Recht, 2011; Negahban and Wainwright, 2010; Mackey, Talwalkar, and Jordan, 2011) Randomized matrix multiplication and factorization (Drineas, Mahoney, and Muthukrishnan, 2008; Hsu, Kakade, and Zhang, 2011b) Convex relaxation of robust or chance-constrained optimization (Nemirovski, 2007; So, 2011; Cheung, So, and Wang, 2011) Random graph analysis (Christofides and Markstr¨ om, 2008; Oliveira, 2009) Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 2 / 35

Motivation Matrix Completion Motivation: Matrix Completion Goal: Recover a matrix L 0 ∈ R m × n from a subset of its entries     ? ? 1 . . . 4 2 3 1 . . . 4  → 3 ? ? . . . ? 3 4 5 . . . 1    ? 5 ? . . . 5 2 5 3 . . . 5 Examples Collaborative filtering: How will user i rate movie j ? Ranking on the web: Is URL j relevant to user i ? Link prediction: Is user i friends with user j ? Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 3 / 35

Motivation Matrix Completion Motivation: Matrix Completion Goal: Recover a matrix L 0 ∈ R m × n from a subset of its entries     ? ? 1 . . . 4 2 3 1 . . . 4 3 ? ? . . . ?  → 3 4 5 . . . 1    ? 5 ? . . . 5 2 5 3 . . . 5 Bad News: Impossible to recover a generic matrix Too many degrees of freedom, too few observations Good News: Small number of latent factors determine preferences Movie ratings cluster by genre and director B ⊤ = L 0 A These low-rank matrices are easier to complete Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 4 / 35

Motivation Matrix Completion How to Complete a Low-rank Matrix Suppose Ω is the set of observed entry locations. First attempt: minimize A rank A subject to A ij = L 0 ij ( i, j ) ∈ Ω Problem: NP-hard ⇒ computationally intractable! Solution: Solve convex relaxation ( ? ) minimize A � A � ∗ subject to A ij = L 0 ij ( i, j ) ∈ Ω where � A � ∗ = � k σ k ( A ) is the trace/nuclear norm of A . Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 5 / 35

Motivation Matrix Completion Can Convex Optimization Recover L 0 ? Yes, with high probability. Theorem (Recht, 2011) If L 0 ∈ R m × n has rank r and s � βrn log 2 ( n ) entries are observed uniformly at random, then (under some technical conditions) convex optimization recovers L 0 exactly with probability at least 1 − n − β . See also Gross (2011); Mackey, Talwalkar, and Jordan (2011) Past results (Cand` es and Tao, 2009) required es and Recht, 2009; Cand` stronger assumptions and more intensive analysis Streamlined approach reposes on a matrix variant of a classical Bernstein inequality (1946) Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 6 / 35

Motivation Matrix Completion Scalar Bernstein Inequality Theorem (Bernstein, 1946) Let ( Y k ) k ≥ 1 be independent random variables in R satisfying E Y k = 0 and | Y k | ≤ R for each index k . Define the variance parameter σ 2 := � k E Y 2 k . Then, for all t ≥ 0 , − t 2 � � �� k Y k � ≥ t ≤ 2 · exp P � � 2 σ 2 + 2 Rt/ 3 � Gaussian decay controlled by variance when t is small Exponential decay controlled by uniform bound for large t Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 7 / 35

Motivation Matrix Completion Matrix Bernstein Inequality Theorem (Mackey, Jordan, Chen, Farrell, and Tropp, 2012) Let ( Y k ) k ≥ 1 be independent matrices in R m × n satisfying E Y k = 0 and � Y k � ≤ R for each index k . Define the variance parameter σ 2 := max �� k E Y k Y ⊤ k E Y ⊤ � , . k Y k � � � � k � � � Then, for all t ≥ 0 , − t 2 � � �� P k Y k � ≥ t ≤ ( m + n ) · exp � � 3 σ 2 + 2 Rt � See also Tropp (2011); Oliveira (2009); Recht (2011) Gaussian tail when t is small; exponential tail for large t Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 8 / 35

Motivation Matrix Completion Matrix Bernstein Inequality Theorem (Mackey, Jordan, Chen, Farrell, and Tropp, 2012) For all t ≥ 0 , − t 2 � � �� ≥ t ≤ ( m + n ) · exp P k Y k � � 3 σ 2 + 2 Rt � Consequences for matrix completion Recht (2011) showed that uniform sampling of entries captures most of the information in incoherent low-rank matrices Negahban and Wainwright (2010) showed that i.i.d. sampling of entries captures most of the information in non-spiky (near) low-rank matrices Foygel and Srebro (2011) characterized the generalization error of convex MC through Rademacher complexity Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 9 / 35

Motivation Matrix Concentration Concentration Inequalities Matrix concentration P { λ max ( X − E X ) ≥ t } ≤ δ Difficulty: Matrix multiplication is not commutative ⇒ e X + Y � = e X e Y Past approaches (Ahlswede and Winter, 2002; Oliveira, 2009; Tropp, 2011) Rely on deep results from matrix analysis Apply to sums of independent matrices and matrix martingales This work Stein’s method of exchangeable pairs (1972), as advanced by Chatterjee (2007) for scalar concentration ⇒ Improved exponential tail inequalities (Hoeffding, Bernstein) ⇒ Polynomial moment inequalities (Khintchine, Rosenthal) ⇒ Dependent sums and more general matrix functionals Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 10 / 35

Motivation Matrix Concentration Roadmap Motivation 1 Stein’s Method Background and Notation 2 Exponential Tail Inequalities 3 Polynomial Moment Inequalities 4 Dependent Sequences 5 Extensions 6 Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 11 / 35

Background Notation Hermitian matrices: H d = { A ∈ C d × d : A = A ∗ } All matrices in this talk are Hermitian. Maximum eigenvalue: λ max ( · ) Trace: tr B , the sum of the diagonal entries of B Spectral norm: � B � , the maximum singular value of B Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 12 / 35

Background Matrix Stein Pair Definition (Exchangeable Pair) d ( Z, Z ′ ) is an exchangeable pair if ( Z, Z ′ ) = ( Z ′ , Z ) . Definition (Matrix Stein Pair) Let ( Z, Z ′ ) be an exchangeable pair, and let Ψ : Z → H d be a measurable function. Define the random matrices X ′ := Ψ ( Z ′ ) . X := Ψ ( Z ) and ( X , X ′ ) is a matrix Stein pair with scale factor α ∈ (0 , 1] if E [ X ′ | Z ] = (1 − α ) X . Matrix Stein pairs are exchangeable pairs Matrix Stein pairs always have zero mean Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 13 / 35

Background The Conditional Variance Definition (Conditional Variance) Suppose that ( X , X ′ ) is a matrix Stein pair with scale factor α , constructed from the exchangeable pair ( Z, Z ′ ) . The conditional variance is the random matrix ∆ X := ∆ X ( Z ) := 1 ( X − X ′ ) 2 | Z � � . 2 α E ∆ X is a stochastic estimate for the variance, E X 2 Take-home Message Control over ∆ X yields control over λ max ( X ) Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 14 / 35

Exponential Tail Inequalities Exponential Concentration for Random Matrices Theorem (Mackey, Jordan, Chen, Farrell, and Tropp, 2012) Let ( X , X ′ ) be a matrix Stein pair with X ∈ H d . Suppose that ∆ X � c X + v I almost surely for c, v ≥ 0 . Then, for all t ≥ 0 , − t 2 � � P { λ max ( X ) ≥ t } ≤ d · exp . 2 v + 2 ct Control over the conditional variance ∆ X yields Gaussian tail for λ max ( X ) for small t , exponential tail for large t When d = 1 , improves scalar result of Chatterjee (2007) The dimensional factor d cannot be removed Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 15 / 35

Exponential Tail Inequalities Matrix Hoeffding Inequality Corollary (Mackey, Jordan, Chen, Farrell, and Tropp, 2012) k Y k for independent matrices in H d satisfying Let X = � Y 2 k � A 2 E Y k = 0 and k for deterministic matrices ( A k ) k ≥ 1 . Define the variance parameter σ 2 := � � � k A 2 � . � � k � Then, for all t ≥ 0 , � �� ≤ d · e − t 2 / 2 σ 2 . λ max ≥ t P k Y k Improves upon the matrix Hoeffding inequality of Tropp (2011) Optimal constant 1 / 2 in the exponent Can replace variance parameter with σ 2 = 1 � A 2 k + E Y 2 �� k k 2 Tighter than classical Hoeffding inequality (1963) when d = 1 Mackey (Stanford) Stein’s Method for Matrix Concentration December 10, 2012 16 / 35

Steins Method for Matrix Concentration Lester Mackey Collaborators: - PowerPoint PPT Presentation

Steins Method for Matrix Concentration Lester Mackey Collaborators: Michael I. Jordan , Richard Y. Chen , Brendan Farrell , and Joel A. Tropp Stanford University University of California, Berkeley California

Stein Couplings for Concentration of Measure Jay Bartroff, Subhankar Ghosh, Larry Goldstein and

OSMOSIS and DIFFUSION Concentration gradient Concentration Gradient - change in the concentration

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

Stein Elliptic Curves over Q ( 5) William Stein, University of Washington (This is part of

Diffusion Contaminant at Contaminant Solutes (contaminants) migrate due to concentration

Probabilistic Program Analysis and Concentration of Measure Part I: Concentration of Measure

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

Comparing distributions via their canonical Stein operators: a new view of Steins method

Concentration inequalities, the entropy method, search for super -concentration Concentration, ...

Matrix Completion and Matrix Concentration Lester Mackey, Ameet Talwalkar, Michael I. Jordan

Matrix Completion and Matrix Concentration Lester Mackey Collaborators: Ameet Talwalkar ,

Towards Demystifying Overparameterization in Deep Learning Mahdi Soltanolkotabi Department of

IV and IV-GMM Christopher F Baum ECON 8823: Applied Econometrics Boston College, Spring 2016

Foundations of Machine Learning Boosting Weak Learning (Kearns and Valiant, 1994) Definition:

Inverting Sampled Traffic Nicolas Hohn, Darryl Veitch Australian Research Council Special

Generalization Error of Generalized Linear Models in High Dimensions Melika Emami 1 , Mojtaba

Truncations of unitary matrices and Brownian bridges Alain Rouault (Laboratoire de

Outline 1

Data Science in the Wild Lecture 8: Advanced Experimental Analysis Eran Toch Data Science in the