Sharp bounds for learning a mixture of two Gaussians Moritz Hardt - PowerPoint PPT Presentation

Sharp bounds for learning a mixture of two Gaussians Moritz Hardt Eric Price IBM Almaden 2014-05-28 Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 1 / 25

Problem 140 140 160 160 180 180 200 200 Height (cm) Height (cm) Height distribution of American 20 year olds. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 2 / 25

Problem 140 140 160 160 180 180 200 200 Height (cm) Height (cm) Height distribution of American 20 year olds. ◮ Male/female heights are very close to Gaussian distribution. Can we learn the average male and female heights from unlabeled population data? How many samples to learn µ 1 , µ 2 to ± ǫσ ? Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 2 / 25

Gaussian Mixtures: Origins Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 3 / 25

Gaussian Mixtures: Origins Contributions to the Mathematical Theory of Evolution , Karl Pearson, 1894 Pearson’s naturalist buddy measured lots of crab body parts. Most lengths seemed to follow the “normal” distribution (a recently coined name) But the “forehead” size wasn’t symmetric. Maybe there were actually two species of crabs? Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 4 / 25

More previous work Pearson 1894: proposed method for 2 Gaussians ◮ “Method of moments” Other empirical papers over the years: ◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80 Provable results assuming the components are well-separated: ◮ Clustering: Dasgupta ’99, DA ’00 ◮ Spectral methods: VW ’04, AK ’05, KSV ’05, AM ’05, VW ’05 Kalai-Moitra-Valiant 2010: first general polynomial bound. ◮ Extended to general k mixtures: Moitra-Valiant ’10, Belkin-Sinha ’10 The KMV polynomial is very large. ◮ Our result : tight upper and lower bounds for the sample complexity. ◮ For k = 2 mixtures, arbitrary d dimensions. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 5 / 25

Learning the components vs. learning the sum 140 140 140 160 160 160 180 180 180 200 200 200 Height (cm) Height (cm) Height (cm) It’s important that we want to learn the individual components: Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 6 / 25

Learning the components vs. learning the sum 140 140 140 160 160 160 180 180 180 200 200 200 Height (cm) Height (cm) Height (cm) It’s important that we want to learn the individual components: ◮ Male/female average heights, std. deviations. Getting ǫ approximation in TV norm to overall distribution takes Θ( 1 /ǫ 2 ) samples from black box techniques. � Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 6 / 25

Learning the components vs. learning the sum 140 140 140 160 160 160 180 180 180 200 200 200 Height (cm) Height (cm) Height (cm) It’s important that we want to learn the individual components: ◮ Male/female average heights, std. deviations. Getting ǫ approximation in TV norm to overall distribution takes Θ( 1 /ǫ 2 ) samples from black box techniques. � ◮ Quite general: for any mixture of known unimodal distributions. [Chan, Diakonikolas, Servedio, Sun ’13] Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 6 / 25

We show Pearson’s 1894 method can be extended to be optimal! Suppose we want means and variances to ǫ accuracy: ◮ µ i to ± ǫσ ◮ σ 2 i to ± ǫ 2 σ 2 In one dimension: Θ( 1 /ǫ 12 ) samples necessary and sufficient . ◮ Previously: O ( 1 /ǫ 300 ) . ◮ Moreover: algorithm is almost the same as Pearson (1894). In d dimensions, Θ( 1 /ǫ 12 log d ) samples necessary and sufficient . ◮ “ σ 2 ” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ± ǫ 2 σ 2 . ◮ Previously: O (( d /ǫ ) 300 , 000 ) . Caveat: assume p 1 , p 2 are bounded away from zero. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 7 / 25

Outline Algorithm in One Dimension 1 Algorithm in d Dimensions 2 3 Lower Bound Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 8 / 25

Outline Algorithm in One Dimension 1 Algorithm in d Dimensions 2 3 Lower Bound Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 9 / 25

Method of Moments 140 160 180 200 Height (cm) We want to learn five parameters: µ 1 , µ 2 , σ 1 , σ 2 , p 1 , p 2 with p 1 + p 2 = 1. Moments give polynomial equations in parameters: M 1 := E [ x 1 ] = p 1 µ 1 + p 2 µ 2 M 2 := E [ x 2 ] = p 1 µ 2 1 + p 2 µ 2 2 + p 1 σ 2 1 + p 2 σ 2 2 M 3 , M 4 , M 5 = [ ... ] Use our samples to estimate the moments. Solve the system of equations to find the parameters. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 10 / 25

Method of Moments Solving the system Start with five parameters. First, can assume mean zero: ◮ Convert to “central moments” ◮ M ′ 2 = M 2 − M 2 1 is independent of translation. Analogously, can assume min ( σ 1 , σ 2 ) = 0 by converting to “excess moments” ◮ X 4 = M 4 − 3 M 2 2 is independent of adding N ( 0 , σ 2 ) . ◮ “Excess kurtosis” coined by Pearson, appearing in every Wikipedia probability distribution infobox. Leaves three free parameters. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 11 / 25

Method of Moments: system of equations Convenient to reparameterize by α = − µ 1 µ 2 , β = µ 1 + µ 2 , γ = σ 2 2 − σ 2 1 µ 2 − µ 1 Gives that X 3 = α ( β + 3 γ ) X 4 = α ( − 2 α + β 2 + 6 βγ + 3 γ 2 ) X 5 = α ( β 3 − 8 αβ + 10 β 2 γ + 15 γ 2 β − 20 αγ ) X 6 = α ( 16 α 2 − 12 αβ 2 − 60 αβγ + β 4 + 15 β 3 γ + 45 β 2 γ 2 + 15 βγ 3 ) All my attempts to obtain a simpler set have failed... It is possible, however, that some other ... equations of a less complex kind may ultimately be found. —Karl Pearson Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 12 / 25

Pearson’s Polynomial Chug chug chug... Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 : p ( α ) = 8 α 9 + 28 X 4 α 7 − 12 X 2 3 α 6 + ( 24 X 3 X 5 + 30 X 2 4 ) α 5 3 X 4 ) α 4 + ( 96 X 4 + ( 6 X 2 5 − 148 X 2 3 − 36 X 3 X 4 X 5 + 9 X 3 4 ) α 3 4 ) α 2 − 32 X 4 + ( 24 X 3 3 X 5 + 21 X 2 3 X 2 3 X 4 α + 8 X 6 3 = 0 Easy to go from solutions α to mixtures µ i , σ i , p i . Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 13 / 25

Pearson’s Polynomial 6 6 4 4 2 0 2 0 2 2 4 4 6 6 8 8 1 1 0 0 1 1 2 2 3 3 Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 . ◮ Positive roots correspond to mixtures that match on five moments. ◮ Usually have two roots. ◮ Pearson’s proposal: choose candidate with closer 6th moment. Works because six moments uniquely identify mixture [KMV] How robust to moment estimation error? ◮ Usually works well Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 14 / 25

Pearson’s Polynomial 6 6 4 4 2 0 2 0 2 2 4 4 6 6 8 8 1 1 0 0 1 1 2 2 3 3 Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 . ◮ Positive roots correspond to mixtures that match on five moments. ◮ Usually have two roots. ◮ Pearson’s proposal: choose candidate with closer 6th moment. Works because six moments uniquely identify mixture [KMV] How robust to moment estimation error? ◮ Usually works well ◮ Not when there’s a double root. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 14 / 25

Making it robust in all cases Can create another ninth degree polynomial p 6 from X 3 , X 4 , X 5 , X 6 . Then α is the unique positive root of r ( α ) := p 5 ( α ) 2 + p 6 ( α ) 2 = 0 . Therefore q ( x ) := r / ( x − α ) 2 has no positive roots. Would like that q ( x ) ≥ c > 0 for all x and all mixtures α, β, γ . ◮ Then for | � p 5 − p 6 | , | � p 6 − p 6 | ≤ ǫ , √ | α − arg min � r ( x ) | ≤ ǫ/ c . ◮ Compactness: true for any closed and bounded region. Bounded: ◮ For unbounded variables, dominating terms show q → ∞ . Closed: ◮ Issue is that x > 0 isn’t closed. ◮ Can use X 3 , X 4 to get an O ( 1 ) approximation α to α . ◮ x ∈ [ α/ 10 , α ] is closed. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 15 / 25

Result Large ∆ Small ∆ Suppose the two components have means ∆ σ apart. Then if we know M i to ± ǫ (∆ σ ) i , the algorithm recovers the means to ± ǫ ∆ σ . Therefore O (∆ − 12 ǫ − 2 ) samples give an ǫ ∆ approximation. ◮ If components are Ω( 1 ) standard deviations apart, O ( 1 /ǫ 2 ) samples suffice. ◮ In general, O ( 1 /ǫ 12 ) samples suffice to get ǫσ accuracy. Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 16 / 25

Sharp bounds for learning a mixture of two Gaussians Moritz Hardt - PowerPoint PPT Presentation

Sharp bounds for learning a mixture of two Gaussians Moritz Hardt Eric Price IBM Almaden 2014-05-28 Moritz Hardt, Eric Price (IBM) Sharp bounds for learning a mixture of two Gaussians 2014-05-28 1 / 25 Problem 140 140 160 160 180 180

Tight Bounds for Learning a Mixture of Two Gaussians Moritz Hardt Eric Price Google Research

Lecture 20 Lecture 20 Nov 12 th 2008 Clustering with Mixture of Gaussians Clustering with Mixture

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter

Safety and Health Recogni2on Achievement Program (SHARP) OSHCON SHARP Introduc/on SHARP

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Lecture 1: Review of DTFT, Gaussians, and Linear Algebra Mark Hasegawa-Johnson ECE 417:

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Sharp bounds on the expectations of linear combinations of k th records expressed in the Gini mean

Contents Clustering K-means Mixture of Gaussians Expectation Maximization

Tutorial on Estimation and Multivariate Gaussians STAT 27725/CMSC 25400: Machine Learning

6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola Introduction to Machine Learning 10-701

CS480/680 Lecture 7: May 29, 2019 Classification with Mixture of Gaussians [B] Sections 4.2,

MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn from k Gaussians, meaning Y =

Classification of High Dimensional Data By Two-way Mixture Models Jia Li Statistics Department

1 VTT TECHNICAL RESEARCH CENTRE OF FINLAND Applications Augmented on-site visualisation

Final exam date Final exam date has been announced: Articulated Figures III Tuesday,

1 me, learning pecha kucha 2 first things first : greetings 3 digital fabrication software 4 5

Hacking Browser's DOM Exploiting Ajax and RIA Exploiting Ajax and RIA Shreeraj Shah

Get a Python job, Work on OpenStack ! about:me Release Manager for OpenStack Chair of

Q3 2018 Earnings Review 1 Cautionary ry Note Non-GAAP Measures This presentation of Pan

B Corp Explained 6 October 2020 A little bit about you What is a B Corp? Co mprehensive

A comparison between MediaWiki, TWiki and XWiki communities FOSDEM Wiki devroom ULB, Brussels,