Efficient algorithms for estimating multi-view mixture models - PowerPoint PPT Presentation

Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England

Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks

Part 1. Multi-view mixture models Multi-view mixture models Unsupervised learning and mixture models Multi-view mixture models Complexity barriers Multi-view method-of-moments Some applications and open questions Concluding remarks

Unsupervised learning ◮ Many modern applications of machine learning : ◮ high-dimensional data from many diverse sources, ◮ but mostly unlabeled.

Unsupervised learning ◮ Many modern applications of machine learning : ◮ high-dimensional data from many diverse sources, ◮ but mostly unlabeled. ◮ Unsupervised learning: extract useful info from this data. ◮ Disentangle sub-populations in data source. ◮ Discover useful representations for downstream stages of learning pipeline ( e.g. , supervised learning).

Mixture models Simple latent variable model: mixture model h ∈ [ k ] := { 1 , 2 , . . . , k } (hidden) ; h x ∈ R d (observed) ; � � � � x Pr [ h = j ] = w j ; x � h ∼ P h ; so � x has a mixture distribution P ( � x ) = w 1 P 1 ( � x ) + w 2 P 2 ( � x ) + · · · + w k P k ( � x ) .

Mixture models Simple latent variable model: mixture model h ∈ [ k ] := { 1 , 2 , . . . , k } (hidden) ; h x ∈ R d (observed) ; � � � � x Pr [ h = j ] = w j ; x � h ∼ P h ; so � x has a mixture distribution P ( � x ) = w 1 P 1 ( � x ) + w 2 P 2 ( � x ) + · · · + w k P k ( � x ) . Typical use : learn about constituent sub-populations ( e.g. , clusters) in data source.

Multi-view mixture models Can we take advantage of diverse sources of information?

Multi-view mixture models Can we take advantage of diverse sources of information? h ∈ [ k ] , h x 1 ∈ R d 1 ,� x 2 ∈ R d 2 , . . . ,� x ℓ ∈ R d ℓ . � � � · · · � x 1 x 2 x ℓ k = # components, ℓ = # views ( e.g. , audio, video, text). View 1: � x 1 ∈ R d 1 View 2: � x 2 ∈ R d 2 View 3: � x 3 ∈ R d 3

Multi-view mixture models Multi-view assumption : Views are conditionally independent given the component. x 1 ∈ R d 1 x 2 ∈ R d 2 x 3 ∈ R d 3 View 1: � View 2: � View 3: � Larger k (# components): more sub-populations to disentangle. Larger ℓ (# views): more non-redundant sources of information.

Semi-parametric estimation task “Parameters” of component distributions : Mixing weights w j := Pr [ h = j ] , j ∈ [ k ]; x v | h = j ] ∈ R d v , Conditional means � µ v , j := E [ � j ∈ [ k ] , v ∈ [ ℓ ] . Goal : Estimate mixing weights and conditional means from independent copies of ( � x 1 ,� x 2 , . . . ,� x ℓ ) .

Semi-parametric estimation task “Parameters” of component distributions : Mixing weights w j := Pr [ h = j ] , j ∈ [ k ]; x v | h = j ] ∈ R d v , Conditional means � µ v , j := E [ � j ∈ [ k ] , v ∈ [ ℓ ] . Goal : Estimate mixing weights and conditional means from independent copies of ( � x 1 ,� x 2 , . . . ,� x ℓ ) . Questions : 1. How do we estimate { w j } and { � µ v , j } without observing h ? 2. How many views ℓ are sufficient to learn with poly ( k ) computational / sample complexity?

Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem.

Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier : discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) .

Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier : discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) . Statistical barrier : Gaussian mixtures in R 1 can require exp (Ω( k )) samples to estimate parameters, even if components are well- separated (Moitra-Valiant, ’10) .

Some barriers to efficient estimation Challenge : many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier : discrete HMM parameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06) . Statistical barrier : Gaussian mixtures in R 1 can require exp (Ω( k )) samples to estimate parameters, even if components are well- separated (Moitra-Valiant, ’10) . In practice : resort to local search ( e.g. , EM), often subject to slow convergence and inaccurate local optima.

Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j

Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) : interpoint distance-based methods / EM (Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00) ◮ sep = Ω( k c ) : first use PCA to k dimensions (Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05) ◮ Also works for mixtures of log-concave distributions.

Making progress: Gaussian mixture model Gaussian mixture model : problem becomes easier if assume some large minimum separation between component means (Dasgupta, ’99) : � � µ i − � µ j � sep := min max { σ i , σ j } . i � = j ◮ sep = Ω( d c ) : interpoint distance-based methods / EM (Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00) ◮ sep = Ω( k c ) : first use PCA to k dimensions (Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05) ◮ Also works for mixtures of log-concave distributions. ◮ No minimum separation requirement: method-of-moments but exp (Ω( k )) running time / sample size (Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)

Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions. 0.6 1 2 3 4 5 6 7 8 ≈ +0.4 1 2 3 4 5 6 7 8 Pr [ � x t = ·| h t = 1 ] 1 2 3 4 5 6 7 8 0 . 6 Pr [ � x t = ·| h t = 2 ] + 0 . 4 Pr [ � x t = ·| h t = 3 ]

Making progress: discrete hidden Markov models Hardness reductions create HMMs with degenerate output and next-state distributions. 0.6 1 2 3 4 5 6 7 8 ≈ +0.4 1 2 3 4 5 6 7 8 Pr [ � x t = ·| h t = 1 ] 1 2 3 4 5 6 7 8 0 . 6 Pr [ � x t = ·| h t = 2 ] + 0 . 4 Pr [ � x t = ·| h t = 3 ] These instances are avoided by assuming parameter matrices are full-rank (Mossel-Roch, ’06; Hsu-Kakade-Zhang, ’09)

What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation.

What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. ◮ Non-degeneracy condition for multi-view mixture model: Conditional means { � µ v , 1 , � µ v , 2 , . . . , � µ v , k } are linearly w > � independent for each view v ∈ [ ℓ ] , and � 0. Requires high-dimensional observations ( d v ≥ k )!

What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. ◮ Non-degeneracy condition for multi-view mixture model: Conditional means { � µ v , 1 , � µ v , 2 , . . . , � µ v , k } are linearly w > � independent for each view v ∈ [ ℓ ] , and � 0. Requires high-dimensional observations ( d v ≥ k )! ◮ New efficient learning guarantees for parametric models ( e.g. , mixtures of Gaussians, general HMMs)

What we do This work : given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation. ◮ Non-degeneracy condition for multi-view mixture model: Conditional means { � µ v , 1 , � µ v , 2 , . . . , � µ v , k } are linearly w > � independent for each view v ∈ [ ℓ ] , and � 0. Requires high-dimensional observations ( d v ≥ k )! ◮ New efficient learning guarantees for parametric models ( e.g. , mixtures of Gaussians, general HMMs) ◮ General tensor decomposition framework applicable to a wide variety of estimation problems.

Part 2. Multi-view method-of-moments Multi-view mixture models Multi-view method-of-moments Overview Structure of moments Uniqueness of decomposition Computing the decomposition Asymmetric views Some applications and open questions Concluding remarks

The plan ◮ First, assume views are (conditionally) exchangeable , and derive basic algorithm.

The plan ◮ First, assume views are (conditionally) exchangeable , and derive basic algorithm. ◮ Then, provide reduction from general multi-view setting to exchangeable case. − →

Efficient algorithms for estimating multi-view mixture models - PowerPoint PPT Presentation

Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks Part 1.

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

Multi-View Representation Learning: Algorithms and Applications Changqing Zhang ( )

Planning III-A: Planning III-A: Estimating Software Size - Estimating Software Size -

Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Cumbernauld Academy Existing aerial view from west Site Plan Aerial view from South Aerial view

Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan

Multi-view Active Learning Ion Muslea University of Southern California Outline Multi-view

Mixture models of truncated data for estimating the number of species. Li-Thiao-T e S

Deep Learning for Geometry Processing 3D Representations View-Based and Volumetric CNNs 3D

Semantic Multi-View Model For Low-Power Carlos Gmez, Julien DeAntoni, Frdric Mallet

Estimating Relative Expression Mark Voorhies 4/6/2011 Mark Voorhies Estimating Relative

Chapter 6 Alternatives to Expected Utility Theory In this lecture, I describe some well-known

All Investors are Risk-averse Expected Utility Maximizers Carole Bernard (UW), Jit Seng Chen

On the E ffi ciency of the Walrasian Mechanism Moshe Babaio ff Brendan Lucier (Microsoft

CS 285 Instructor: Sergey Levine UC Berkeley Recap: Q-learning fit a model to estimate return

Knowing The What, But Not The Where in Bayesian Optimization Vu Nguyen & Michael A. Osborne

Robust Pricing in Contextual Auctions Authors: Negin Golrezaei (Massachusetts Institute of

d i E Applied Maxima and Minima a l l u d Dr. Abdulla Eid b A College of Science . r

Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan

Sambuz

Useful Links

Newsletter

Mail Us

Efficient algorithms for estimating multi-view mixture models - PowerPoint PPT Presentation

Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks Part 1.

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Towards Deep Multi-View Stereo Silvano Galliani October 2, 2017 1 / 40 Towards Deep Multi-View

Estimating Estimating Covariance . . . Statistical Characteristics Estimating . . . Proof of

Multi-View Representation Learning: Algorithms and Applications Changqing Zhang ( )

Planning III-A: Planning III-A: Estimating Software Size - Estimating Software Size -

Estimating Frequency Moments Estimating F 0 Algorithm Correctness Further Anil Maheshwari

Estimating Frequency Moments Moments Estimating F 0 Algorithm Correctness Anil Maheshwari

Cumbernauld Academy Existing aerial view from west Site Plan Aerial view from South Aerial view

Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient algorithms Ashok Vardhan

Multi-view Active Learning Ion Muslea University of Southern California Outline Multi-view

Mixture models of truncated data for estimating the number of species. Li-Thiao-T e S

Deep Learning for Geometry Processing 3D Representations View-Based and Volumetric CNNs 3D

Semantic Multi-View Model For Low-Power Carlos Gmez, Julien DeAntoni, Frdric Mallet

Estimating Relative Expression Mark Voorhies 4/6/2011 Mark Voorhies Estimating Relative

Chapter 6 Alternatives to Expected Utility Theory In this lecture, I describe some well-known

All Investors are Risk-averse Expected Utility Maximizers Carole Bernard (UW), Jit Seng Chen

On the E ffi ciency of the Walrasian Mechanism Moshe Babaio ff Brendan Lucier (Microsoft

CS 285 Instructor: Sergey Levine UC Berkeley Recap: Q-learning fit a model to estimate return

Knowing The What, But Not The Where in Bayesian Optimization Vu Nguyen &amp; Michael A. Osborne

Robust Pricing in Contextual Auctions Authors: Negin Golrezaei (Massachusetts Institute of

d i E Applied Maxima and Minima a l l u d Dr. Abdulla Eid b A College of Science . r

Learning Deep Generative Models Inference &amp; Representation Lecture 12 Rahul G. Krishnan

Sambuz

Useful Links

Newsletter

Mail Us

Knowing The What, But Not The Where in Bayesian Optimization Vu Nguyen & Michael A. Osborne

Learning Deep Generative Models Inference & Representation Lecture 12 Rahul G. Krishnan