learning mixtures of spherical gaussians
play

Learning Mixtures of Spherical Gaussians: Moment Methods and - PowerPoint PPT Presentation

Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions Daniel Hsu and Sham M. Kakade Microsoft Research, New England Also based on work with Anima Anandkumar (UCI) , Rong Ge (Princeton) , Matus Telgarsky (UCSD) . 1


  1. Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions Daniel Hsu and Sham M. Kakade Microsoft Research, New England Also based on work with Anima Anandkumar (UCI) , Rong Ge (Princeton) , Matus Telgarsky (UCSD) . 1

  2. Unsupervised machine learning ◮ Many applications in machine learning and statistics : ◮ Lots of high-dimensional data, but mostly unlabeled. 2

  3. Unsupervised machine learning ◮ Many applications in machine learning and statistics : ◮ Lots of high-dimensional data, but mostly unlabeled. ◮ Unsupervised learning : discover interesting structure of population from unlabeled data. ◮ This talk : learn about sub-populations in data source. 2

  4. Learning mixtures of Gaussians Mixture of Gaussians : � k i = 1 w i N ( � µ i , Σ i ) k sub-populations; each modeled as multivariate Gaussian N ( � µ i , Σ i ) together with mixing weight w i . 3

  5. Learning mixtures of Gaussians Mixture of Gaussians : � k i = 1 w i N ( � µ i , Σ i ) k sub-populations; each modeled as multivariate Gaussian N ( � µ i , Σ i ) together with mixing weight w i . Goal: efficient algorithm that approximately recovers parameters from samples. 3

  6. Learning mixtures of Gaussians Mixture of Gaussians : � k i = 1 w i N ( � µ i , Σ i ) k sub-populations; each modeled as multivariate Gaussian N ( � µ i , Σ i ) together with mixing weight w i . Goal: efficient algorithm that approximately recovers parameters from samples. (Alternative goal: density estimation. Not in this talk.) 3

  7. Learning setup ◮ Input : i.i.d. sample S ⊂ R d from unknown mixtures of Gaussians with parameters θ ⋆ := { ( � µ i ⋆ , Σ ⋆ i , w i ⋆ ) : i ∈ [ k ] } . 4

  8. Learning setup ◮ Input : i.i.d. sample S ⊂ R d from unknown mixtures of Gaussians with parameters θ ⋆ := { ( � µ i ⋆ , Σ ⋆ i , w i ⋆ ) : i ∈ [ k ] } . ◮ Each data point drawn from one of k Gaussians N ( � µ i ⋆ , Σ ⋆ i ) µ i ⋆ , Σ ⋆ i ) with probability w i ⋆ .) (choose N ( � 4

  9. Learning setup ◮ Input : i.i.d. sample S ⊂ R d from unknown mixtures of Gaussians with parameters θ ⋆ := { ( � µ i ⋆ , Σ ⋆ i , w i ⋆ ) : i ∈ [ k ] } . ◮ Each data point drawn from one of k Gaussians N ( � µ i ⋆ , Σ ⋆ i ) µ i ⋆ , Σ ⋆ i ) with probability w i ⋆ .) (choose N ( � ◮ But “labels” are not observed . 4

  10. Learning setup ◮ Input : i.i.d. sample S ⊂ R d from unknown mixtures of Gaussians with parameters θ ⋆ := { ( � µ i ⋆ , Σ ⋆ i , w i ⋆ ) : i ∈ [ k ] } . ◮ Each data point drawn from one of k Gaussians N ( � µ i ⋆ , Σ ⋆ i ) µ i ⋆ , Σ ⋆ i ) with probability w i ⋆ .) (choose N ( � ◮ But “labels” are not observed . ◮ Goal : estimate parameters θ = { ( � µ i , Σ i , w i ) : i ∈ [ k ] } such that θ ≈ θ ⋆ . 4

  11. Learning setup ◮ Input : i.i.d. sample S ⊂ R d from unknown mixtures of Gaussians with parameters θ ⋆ := { ( � µ i ⋆ , Σ ⋆ i , w i ⋆ ) : i ∈ [ k ] } . ◮ Each data point drawn from one of k Gaussians N ( � µ i ⋆ , Σ ⋆ i ) µ i ⋆ , Σ ⋆ i ) with probability w i ⋆ .) (choose N ( � ◮ But “labels” are not observed . ◮ Goal : estimate parameters θ = { ( � µ i , Σ i , w i ) : i ∈ [ k ] } such that θ ≈ θ ⋆ . ◮ In practice : local search for maximum-likelihood parameters (E-M algorithm). 4

  12. When are there efficient algorithms? Well-separated mixtures : estimation is easier if there is large minimum separation between component means (Dasgupta, ’99) : sep � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) or sep = Ω( k c ) : simple clustering methods, perhaps after dimension reduction (Dasgupta, ’99; Vempala-Wang, ’02; and many more.) 5

  13. When are there efficient algorithms? Well-separated mixtures : estimation is easier if there is large minimum separation between component means (Dasgupta, ’99) : sep � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) or sep = Ω( k c ) : simple clustering methods, perhaps after dimension reduction (Dasgupta, ’99; Vempala-Wang, ’02; and many more.) Recent developments : ◮ No minimum separation requirement, but current methods require exp (Ω( k )) running time / sample size (Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10) 5

  14. Overcoming barriers to efficient estimation Information-theoretic barrier : R 1 Gaussian mixtures in can require exp (Ω( k )) samples to estimate parameters, even when components are well-separated (Moitra-Valiant, ’10) . 6

  15. Overcoming barriers to efficient estimation Information-theoretic barrier : R 1 Gaussian mixtures in can require exp (Ω( k )) samples to estimate parameters, even when components are well-separated (Moitra-Valiant, ’10) . These hard instances are degenerate in high-dimensions! 6

  16. Overcoming barriers to efficient estimation Information-theoretic barrier : R 1 Gaussian mixtures in can require exp (Ω( k )) samples to estimate parameters, even when components are well-separated (Moitra-Valiant, ’10) . These hard instances are degenerate in high-dimensions! Our result : efficient algorithms for non-degenerate models in high-dimensions ( d ≥ k ) with spherical covariances . 6

  17. Main result Theorem ( H-Kakade, ’13) µ k ⋆ } linearly independent, w i ⋆ > 0 for µ 1 ⋆ , � µ 2 ⋆ , . . . , � Assume { � all i ∈ [ k ] , and Σ ⋆ i = σ 2 ⋆ I for all i ∈ [ k ] . i There is an algorithm that, given independent draws from a mixture of k spherical Gaussians, returns ε -accurate parameters (up to permutation, under ℓ 2 metric) w.h.p. The running time and sample complexity are poly ( d , k , 1 /ε, 1 / w min , 1 /λ min ) where λ min = k th -largest singular value of [ � µ 1 ⋆ | � µ 2 ⋆ | · · · | � µ k ⋆ ] . (Also using new techniques from Anandkumar-Ge-H-Kakade-Telgarsky, ’12.) 7

  18. 2. Learning algorithm Introduction Learning algorithm Method-of-moments Choice of moments Solving the moment equations Concluding remarks 8

  19. Method-of-moments Let S ⊂ R d be an i.i.d. sample from an unknown mixture of spherical Gaussians: k � ⋆ N ( � ⋆ , σ 2 ⋆ I ) . � x ∼ w i µ i i i = 1 9

  20. Method-of-moments Let S ⊂ R d be an i.i.d. sample from an unknown mixture of spherical Gaussians: k � ⋆ N ( � ⋆ , σ 2 ⋆ I ) . � x ∼ w i µ i i i = 1 Estimation via method-of-moments (Pearson, 1894) Find parameters θ such that x ) ] ≈ ˆ E θ [ p ( � x ∈ S [ p ( � x ) ] E � for some functions p : R d → R (typically multivar. polynomials) . 9

  21. Method-of-moments Let S ⊂ R d be an i.i.d. sample from an unknown mixture of spherical Gaussians: k � ⋆ N ( � ⋆ , σ 2 ⋆ I ) . � x ∼ w i µ i i i = 1 Estimation via method-of-moments (Pearson, 1894) Find parameters θ such that x ) ] ≈ ˆ E θ [ p ( � x ∈ S [ p ( � x ) ] E � for some functions p : R d → R (typically multivar. polynomials) . Q1 Which moments to use? Q2 How to (approx.) solve moment equations? 9

  22. Which moments to use? 10

  23. Which moments to use? moment order reliable estimates? unique solution? 1 st , 2 nd 1 st - and 2 nd -order moments ( e.g. , mean, covariance) [Chaudhuri-Rao, ’08] [Achlioptas-McSherry, ’05] [Vempala-Wang, ’02] 1 st 2 nd Ω( k ) th order of moments 10

  24. Which moments to use? moment order reliable estimates? unique solution? 1 st , 2 nd ✓ 1 st - and 2 nd -order moments ( e.g. , mean, covariance) ◮ Fairly easy to get reliable estimates. x ∈ S [ � x ⊗ � x ] ≈ E θ ⋆ [ � x ⊗ � E � x ] [Chaudhuri-Rao, ’08] [Achlioptas-McSherry, ’05] [Vempala-Wang, ’02] 1 st 2 nd Ω( k ) th order of moments 10

  25. Which moments to use? moment order reliable estimates? unique solution? 1 st , 2 nd ✓ ✗ 1 st - and 2 nd -order moments ( e.g. , mean, covariance) ◮ Fairly easy to get reliable estimates. x ∈ S [ � x ⊗ � x ] ≈ E θ ⋆ [ � x ⊗ � E � x ] ◮ Can have multiple solutions to moment equations. E θ 1 [ � x ⊗ � x ∈ S [ � x ⊗ � x ] ≈ E θ 2 [ � x ⊗ � x ] ≈ E � x ] , θ 1 � = θ 2 [Chaudhuri-Rao, ’08] [Achlioptas-McSherry, ’05] [Vempala-Wang, ’02] 1 st 2 nd Ω( k ) th order of moments 10

  26. Which moments to use? moment order reliable estimates? unique solution? 1 st , 2 nd ✓ ✗ Ω( k ) th Ω( k ) th -order moments ( e.g. , E θ [ degree- k -poly ( � x )] ) [Belkin-Sinha, ’10] [Chaudhuri-Rao, ’08] [Moitra-Valiant, ’10] [Achlioptas-McSherry, ’05] [Lindsay, ’89] [Vempala-Wang, ’02] [Prony, 1795] 1 st 2 nd Ω( k ) th order of moments 10

  27. Which moments to use? moment order reliable estimates? unique solution? 1 st , 2 nd ✓ ✗ Ω( k ) th ✓ Ω( k ) th -order moments ( e.g. , E θ [ degree- k -poly ( � x )] ) ◮ Uniquely pins down the solution. [Belkin-Sinha, ’10] [Chaudhuri-Rao, ’08] [Moitra-Valiant, ’10] [Achlioptas-McSherry, ’05] [Lindsay, ’89] [Vempala-Wang, ’02] [Prony, 1795] 1 st 2 nd Ω( k ) th order of moments 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend