efficient algorithms for estimating multi view mixture
play

Efficient algorithms for estimating multi-view mixture models - PowerPoint PPT Presentation

Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks Part 1.


  1. Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England

  2. Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks

  3. Part 1. Multi-view mixture models Multi-view mixture models Unsupervised learning and mixture models Multi-view mixture models Complexity barriers Multi-view method-of-moments Some applications and open questions Concluding remarks

  4. Unsupervised learning ◮ Many modern applications of machine learning : ◮ high-dimensional data from many diverse sources, ◮ but mostly unlabeled.

  5. Unsupervised learning ◮ Many modern applications of machine learning : ◮ high-dimensional data from many diverse sources, ◮ but mostly unlabeled. ◮ Unsupervised learning: extract useful info from this data. ◮ Disentangle sub-populations in data source. ◮ Discover useful representations for downstream stages of learning pipeline ( e.g. , supervised learning).

  6. Mixture models Simple latent variable model: mixture model h ∈ [ k ] := { 1 , 2 , . . . , k } (hidden) ; h x ∈ R d (observed) ; � � � � x Pr [ h = j ] = w j ; x � h ∼ P h ; so � x has a mixture distribution P ( � x ) = w 1 P 1 ( � x ) + w 2 P 2 ( � x ) + · · · + w k P k ( � x ) .

  7. Mixture models Simple latent variable model: mixture model h ∈ [ k ] := { 1 , 2 , . . . , k } (hidden) ; h x ∈ R d (observed) ; � � � � x Pr [ h = j ] = w j ; x � h ∼ P h ; so � x has a mixture distribution P ( � x ) = w 1 P 1 ( � x ) + w 2 P 2 ( � x ) + · · · + w k P k ( � x ) . Typical use : learn about constituent sub-populations ( e.g. , clusters) in data source.

  8. Multi-view mixture models Can we take advantage of diverse sources of information?

  9. Multi-view mixture models Can we take advantage of diverse sources of information? h ∈ [ k ] , h x 1 ∈ R d 1 ,� x 2 ∈ R d 2 , . . . ,� x ℓ ∈ R d ℓ . � � � · · · � x 1 x 2 x ℓ k = # components, ℓ = # views ( e.g. , audio, video, text). View 1: � x 1 ∈ R d 1 View 2: � x 2 ∈ R d 2 View 3: � x 3 ∈ R d 3

  10. Multi-view mixture models Can we take advantage of diverse sources of information? h ∈ [ k ] , h x 1 ∈ R d 1 ,� x 2 ∈ R d 2 , . . . ,� x ℓ ∈ R d ℓ . � � � · · · � x 1 x 2 x ℓ k = # components, ℓ = # views ( e.g. , audio, video, text). View 1: � x 1 ∈ R d 1 View 2: � x 2 ∈ R d 2 View 3: � x 3 ∈ R d 3

  11. Multi-view mixture models Multi-view assumption : Views are conditionally independent given the component. x 1 ∈ R d 1 x 2 ∈ R d 2 x 3 ∈ R d 3 View 1: � View 2: � View 3: � Larger k (# components): more sub-populations to disentangle. Larger ℓ (# views): more non-redundant sources of information.

  12. Semi-parametric estimation task “Parameters” of component distributions : Mixing weights w j := Pr [ h = j ] , j ∈ [ k ]; x v | h = j ] ∈ R d v , Conditional means � µ v , j := E [ � j ∈ [ k ] , v ∈ [ ℓ ] . Goal : Estimate mixing weights and conditional means from independent copies of ( � x 1 ,� x 2 , . . . ,� x ℓ ) .

  13. Semi-parametric estimation task “Parameters” of component distributions : Mixing weights w j := Pr [ h = j ] , j ∈ [ k ]; x v | h = j ] ∈ R d v , Conditional means � µ v , j := E [ � j ∈ [ k ] , v ∈ [ ℓ ] . Goal : Estimate mixing weights and conditional means from independent copies of ( � x 1 ,� x 2 , . . . ,� x ℓ ) . Questions : 1. How do we estimate { w j } and { � µ v , j } without observing h ? 2. How many views ℓ are sufficient to learn with poly ( k ) computational / sample complexity?

  14. Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem.

  15. Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier : discrete HMM pa- rameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) .

  16. Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier : discrete HMM pa- rameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) . Statistical barrier : Gaussian mixtures in R 1 can require exp (Ω( k )) samples to estimate parameters, even if components are well- separated (Moitra-Valiant, ’10) .

  17. Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier : discrete HMM pa- rameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) . Statistical barrier : Gaussian mixtures in R 1 can require exp (Ω( k )) samples to estimate parameters, even if components are well- separated (Moitra-Valiant, ’10) . In practice : resort to local search ( e.g. , EM), often subject to slow convergence and inaccurate local optima.

  18. Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j

  19. Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) : interpoint distance-based methods / EM (Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00) ◮ sep = Ω( k c ) : first use PCA to k dimensions (Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05) ◮ Also works for mixtures of log-concave distributions.

  20. Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) : interpoint distance-based methods / EM (Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00) ◮ sep = Ω( k c ) : first use PCA to k dimensions (Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05) ◮ Also works for mixtures of log-concave distributions. ◮ No minimum separation requirement: method-of-moments but exp (Ω( k )) running time / sample size (Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)

  21. Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions. 0.6 1 2 3 4 5 6 7 8 ≈ +0.4 1 2 3 4 5 6 7 8 Pr [ � x t = ·| h t = 1 ] 1 2 3 4 5 6 7 8 0 . 6 Pr [ � x t = ·| h t = 2 ] + 0 . 4 Pr [ � x t = ·| h t = 3 ]

  22. Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions. 0.6 1 2 3 4 5 6 7 8 ≈ +0.4 1 2 3 4 5 6 7 8 Pr [ � x t = ·| h t = 1 ] 1 2 3 4 5 6 7 8 0 . 6 Pr [ � x t = ·| h t = 2 ] + 0 . 4 Pr [ � x t = ·| h t = 3 ] These instances are avoided by assuming parameter matrices are full-rank (Mossel-Roch, ’06; Hsu-Kakade-Zhang, ’09)

  23. What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation.

  24. What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. ◮ Non-degeneracy condition for multi-view mixture model: Conditional means { � µ v , 1 , � µ v , 2 , . . . , � µ v , k } are linearly w > � independent for each view v ∈ [ ℓ ] , and � 0. Requires high-dimensional observations ( d v ≥ k )!

  25. What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. ◮ Non-degeneracy condition for multi-view mixture model: Conditional means { � µ v , 1 , � µ v , 2 , . . . , � µ v , k } are linearly w > � independent for each view v ∈ [ ℓ ] , and � 0. Requires high-dimensional observations ( d v ≥ k )! ◮ New efficient learning guarantees for parametric models ( e.g. , mixtures of Gaussians, general HMMs)

  26. What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. ◮ Non-degeneracy condition for multi-view mixture model: Conditional means { � µ v , 1 , � µ v , 2 , . . . , � µ v , k } are linearly w > � independent for each view v ∈ [ ℓ ] , and � 0. Requires high-dimensional observations ( d v ≥ k )! ◮ New efficient learning guarantees for parametric models ( e.g. , mixtures of Gaussians, general HMMs) ◮ General tensor decomposition framework applicable to a wide variety of estimation problems.

  27. Part 2. Multi-view method-of-moments Multi-view mixture models Multi-view method-of-moments Overview Structure of moments Uniqueness of decomposition Computing the decomposition Asymmetric views Some applications and open questions Concluding remarks

  28. The plan ◮ First, assume views are (conditionally) exchangeable , and derive basic algorithm.

  29. The plan ◮ First, assume views are (conditionally) exchangeable , and derive basic algorithm. ◮ Then, provide reduction from general multi-view setting to exchangeable case. − →

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend