large scale matrix analysis and inference
play

Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred - PowerPoint PPT Presentation

Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred Warmuth Reza Bosagh Zadeh - Gunnar Carlsson - Michael Mahoney Dec 9, NIPS 2013 1 / 32 Introductory musing What is a matrix? a i , j 1 A vector of n 2 parameters 2 A


  1. Large Scale Matrix Analysis and Inference Wouter M. Koolen - Manfred Warmuth Reza Bosagh Zadeh - Gunnar Carlsson - Michael Mahoney Dec 9, NIPS 2013 1 / 32

  2. Introductory musing — What is a matrix? a i , j 1 A vector of n 2 parameters 2 A covariance 3 A generalized probability distribution 4 . . . 2 / 32

  3. 1. A vector of n 2 parameters When you regularize with the squared Frobenius norm � || W || 2 min F + loss( tr ( WX n )) W n 3 / 32

  4. 1. A vector of n 2 parameters When you regularize with the squared Frobenius norm � || W || 2 min F + loss( tr ( WX n )) W n Equivalent to � || vec( W ) || 2 min 2 + loss(vec( W ) · vec( X n )) vec( W ) n No structure: n 2 independent variables 4 / 32

  5. 2. A covariance View the symmetric positive definite matrix C as a covariance matrix of some random feature vector c ∈ R n , i.e. � ( c − E ( c ))( c − E ( c )) ⊤ � C = E n features plus their pairwise interactions 5 / 32

  6. Symmetric matrices as ellipses Ellipse = { Cu : � u � 2 = 1 } Dotted lines connect point u on unit ball with point Cu on ellipse 6 / 32

  7. Symmetric matrices as ellipses Eigenvectors form axes Eigenvalues are lengths 7 / 32

  8. Dyads uu ⊤ , where u unit vector One eigenvalue one All others zero Rank one projection matrix 8 / 32

  9. Directional variance along direction u V ( c ⊤ u ) = u ⊤ Cu = tr ( C uu ⊤ ) ≥ 0 The outer figure eight is direction u times the variance u ⊤ C u PCA: find direction of largest variance 9 / 32

  10. 3 dimensional variance plots tr ( C uu ⊤ ) is generalized probability when tr ( C ) = 1 10 / 32

  11. 3. Generalized probability distributions ω = ( . 2 , . 1 ., . 6 , . 1) ⊤ Probability vector = � e i ω i i ���� ���� mixture coefficients pure events W = � w i w ⊤ Density matrix ω i i i ���� � �� � mixture coefficients pure density matrices 11 / 32

  12. 3. Generalized probability distributions ω = ( . 2 , . 1 ., . 6 , . 1) ⊤ Probability vector = � e i ω i i ���� ���� mixture coefficients pure events W = � w i w ⊤ Density matrix ω i i i ���� � �� � mixture coefficients pure density matrices Matrices as generalized distributions 12 / 32

  13. 3. Generalized probability distributions ω = ( . 2 , . 1 ., . 6 , . 1) ⊤ Probability vector = � e i ω i i ���� ���� mixture coefficients pure events W = � w i w ⊤ Density matrix ω i i i ���� � �� � mixture coefficients pure density matrices Matrices as generalized distributions Many mixtures lead to same density matrix There always exists a decomposition into n eigendyads Density matrix: Symmetric positive matrix of trace one 13 / 32

  14. It’s like a probability! Total variance along orthogonal set of directions is 1 u ⊤ 1 Wu 1 + u ⊤ 2 Wu 2 = 1 a + b + c = 1 14 / 32

  15. Uniform density? All dyads have generalized probability 1 1 n I n tr (1 n I uu ⊤ ) = 1 n tr ( uu ⊤ ) = 1 n Generalized probabilities of n orthogonal dyads sum to 1 15 / 32

  16. Conventional Bayes Rule P ( M i | y ) = P ( M i ) P ( y | M i ) P ( y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 16 / 32

  17. Conventional Bayes Rule P ( M i | y ) = P ( M i ) P ( y | M i ) P ( y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 17 / 32

  18. Conventional Bayes Rule P ( M i | y ) = P ( M i ) P ( y | M i ) P ( y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 18 / 32

  19. Conventional Bayes Rule P ( M i | y ) = P ( M i ) P ( y | M i ) P ( y ) 4 updates with the same data likelihood Update maintains uncertainty information about maximum likelihood Soft max 19 / 32

  20. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 1 update with data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 20 / 32

  21. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 2 updates with same data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 21 / 32

  22. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 3 updates with same data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 22 / 32

  23. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 4 updates with same data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 23 / 32

  24. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 10 updates with same data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 24 / 32

  25. Bayes Rule for density matrices D ( M | y ) = exp ( log D ( M ) + log D ( y | M )) tr (above matrix) 20 updates with same data likelyhood matrix D ( y | M ) Update maintains uncertainty information about maximum eigenvalue Soft max eigenvalue calculation 25 / 32

  26. Bayes’ rules vector matrix P ( M i ) · P ( y | M i ) D ( M ) ⊙ D ( y | M ) Bayes rule P ( M i | y )= D ( M | y ) = � j P ( M j ) · P ( y | M j ) tr ( D ( M ) ⊙ D ( y | M ) A ⊙ B := exp ( log A + log B ) 26 / 32

  27. Bayes’ rules vector matrix P ( M i ) · P ( y | M i ) D ( M ) ⊙ D ( y | M ) Bayes rule P ( M i | y )= D ( M | y ) = � j P ( M j ) · P ( y | M j ) tr ( D ( M ) ⊙ D ( y | M ) A ⊙ B := exp ( log A + log B ) Regularizer Entropy Quantum Entropy 27 / 32

  28. Vector case as special case of matrix case Vectors as diagonal matrices All matrices same eigensystem Fancy ⊙ becomes · Often the hardest problem ie bounds for the vector case “lift” to the matrix case 28 / 32

  29. Vector case as special case of matrix case Vectors as diagonal matrices All matrices same eigensystem Fancy ⊙ becomes · Often the hardest problem ie bounds for the vector case “lift” to the matrix case This phenomenon has been dubbed the “free matrix lunch” Size of matrix = size of vector = n 29 / 32

  30. PCA setup Data vectors C = � n x n x ⊤ n u ⊤ C u tr ( Cuu ⊤ ) max = max unit u dyad uu ⊤ � �� � � �� � linear in uu ⊤ not convex in u c ⊤ e i Corresponding vector problem max e i ���� linear in e i Vector problem is matrix problem when everything happens in the same eigensystem Uncertainty over unit: probability vector Uncertainty over dyads: density matrix Uncertainty over k -sets of units: capped probability vector Uncertainty over rank k projection matrices: capped density matrix 30 / 32

  31. For PCA Solve the vector problem first Do all bounds Lift to matrix case: essentially replace · by ⊙ Regret bounds stay the same Free Matrix Lunch 31 / 32

  32. Questions When can you “lift”vector case to matrix case? When is there a free matrix lunch? Lifting matrices to tensors? Efficient algorithms for large matrices? Approximations of ⊙ Avoid eigenvalue decomposition by sampling . . . 32 / 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend