Tight Bounds for Learning a Mixture of Two Gaussians Moritz Hardt - PowerPoint PPT Presentation

More previous work Pearson 1894: proposed method for 2 Gaussians ◮ “Method of moments” Other empirical papers over the years: ◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80 Provable results assuming the components are well-separated: ◮ Clustering: Dasgupta ’99, DA ’00 ◮ Spectral methods: VW ’04, AK ’05, KSV ’05, AM ’05, VW ’05 Kalai-Moitra-Valiant 2010: first general polynomial bound. ◮ Extended to general k mixtures: Moitra-Valiant ’10, Belkin-Sinha ’10 The KMV polynomial is very large. ◮ Our result : tight upper and lower bounds for the sample complexity. ◮ For k = 2 mixtures, arbitrary d dimensions. ◮ Lower bound extends to larger k . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

Learning the components vs. learning the sum 140 140 140 160 160 160 180 180 180 200 200 200 Height (cm) Height (cm) Height (cm) It’s important that we want to learn the individual components: Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

Learning the components vs. learning the sum 140 140 140 160 160 160 180 180 180 200 200 200 Height (cm) Height (cm) Height (cm) It’s important that we want to learn the individual components: ◮ Male/female average heights, std. deviations. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

Learning the components vs. learning the sum 140 140 140 160 160 160 180 180 180 200 200 200 Height (cm) Height (cm) Height (cm) It’s important that we want to learn the individual components: ◮ Male/female average heights, std. deviations. Getting ǫ approximation in TV norm to overall distribution takes � Θ( 1 /ǫ 2 ) samples from black box techniques. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

Learning the components vs. learning the sum 140 140 140 160 160 160 180 180 180 200 200 200 Height (cm) Height (cm) Height (cm) It’s important that we want to learn the individual components: ◮ Male/female average heights, std. deviations. Getting ǫ approximation in TV norm to overall distribution takes � Θ( 1 /ǫ 2 ) samples from black box techniques. ◮ Quite general: non-properly for any mixture of known unimodal distributions. [Chan, Diakonikolas, Servedio, Sun ’13] Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

Learning the components vs. learning the sum 140 140 140 160 160 160 180 180 180 200 200 200 Height (cm) Height (cm) Height (cm) It’s important that we want to learn the individual components: ◮ Male/female average heights, std. deviations. Getting ǫ approximation in TV norm to overall distribution takes � Θ( 1 /ǫ 2 ) samples from black box techniques. ◮ Quite general: non-properly for any mixture of known unimodal distributions. [Chan, Diakonikolas, Servedio, Sun ’13] ◮ Proper learning: [Daskalakis-Kamath ’14] Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

Learning the components vs. learning the sum 140 140 140 160 160 160 180 180 180 200 200 200 Height (cm) Height (cm) Height (cm) It’s important that we want to learn the individual components: ◮ Male/female average heights, std. deviations. Getting ǫ approximation in TV norm to overall distribution takes � Θ( 1 /ǫ 2 ) samples from black box techniques. ◮ Quite general: non-properly for any mixture of known unimodal distributions. [Chan, Diakonikolas, Servedio, Sun ’13] ◮ Proper learning: [Daskalakis-Kamath ’14] ◮ But only in low dimensions. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

Learning the components vs. learning the sum 140 140 140 160 160 160 180 180 180 200 200 200 Height (cm) Height (cm) Height (cm) It’s important that we want to learn the individual components: ◮ Male/female average heights, std. deviations. Getting ǫ approximation in TV norm to overall distribution takes � Θ( 1 /ǫ 2 ) samples from black box techniques. ◮ Quite general: non-properly for any mixture of known unimodal distributions. [Chan, Diakonikolas, Servedio, Sun ’13] ◮ Proper learning: [Daskalakis-Kamath ’14] ◮ But only in low dimensions. ◮ Generic high- d TV estimation algs use 1d parameter estimation. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

α Our result A variant of Pearson’s 1894 method is optimal! Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

α Our result A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy: Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

α Our result A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy: ◮ µ i to ± ǫσ Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

α Our result A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy: ◮ µ i to ± ǫσ ◮ σ 2 i to ± ǫ 2 σ 2 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

Our result A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy: ◮ µ i to ± ǫσ ◮ σ 2 i to ± ǫ 2 σ 2 In one dimension: Θ( 1 /ǫ 12 ) samples necessary and sufficient . α Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

Our result A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy: ◮ µ i to ± ǫσ ◮ σ 2 i to ± ǫ 2 σ 2 In one dimension: Θ( 1 /ǫ 12 ) samples necessary and sufficient . ◮ Previously: 1 /ǫ ≈ 300 , no lower bound. α Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

Our result A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy: ◮ µ i to ± ǫσ ◮ σ 2 i to ± ǫ 2 σ 2 In one dimension: Θ( 1 /ǫ 12 ) samples necessary and sufficient . ◮ Previously: 1 /ǫ ≈ 300 , no lower bound. ◮ Moreover: algorithm is almost the same as Pearson (1894). α Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

Our result A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy: ◮ µ i to ± ǫσ ◮ σ 2 i to ± ǫ 2 σ 2 In one dimension: Θ( 1 /ǫ 12 ) samples necessary and sufficient . ◮ Previously: 1 /ǫ ≈ 300 , no lower bound. ◮ Moreover: algorithm is almost the same as Pearson (1894). α More precisely: if two gaussians are α standard deviations apart, 1 getting ǫα precision takes Θ( α 12 ǫ 2 ) samples. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

Our result: higher dimensions In d dimensions, Θ( 1 /ǫ 12 log d ) samples for parameter distance . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

Our result: higher dimensions In d dimensions, Θ( 1 /ǫ 12 log d ) samples for parameter distance . ◮ “ σ 2 ” is max variance in any coordinate. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

Our result: higher dimensions In d dimensions, Θ( 1 /ǫ 12 log d ) samples for parameter distance . ◮ “ σ 2 ” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ± ǫ 2 σ 2 . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

Our result: higher dimensions In d dimensions, Θ( 1 /ǫ 12 log d ) samples for parameter distance . ◮ “ σ 2 ” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ± ǫ 2 σ 2 . ◮ Useful when covariance matrix is sparse. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

Our result: higher dimensions In d dimensions, Θ( 1 /ǫ 12 log d ) samples for parameter distance . ◮ “ σ 2 ” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ± ǫ 2 σ 2 . ◮ Useful when covariance matrix is sparse. Also gives an improved bound in TV error of each component: Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

Our result: higher dimensions In d dimensions, Θ( 1 /ǫ 12 log d ) samples for parameter distance . ◮ “ σ 2 ” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ± ǫ 2 σ 2 . ◮ Useful when covariance matrix is sparse. Also gives an improved bound in TV error of each component: ◮ If components overlap, then parameter distance ≈ TV. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

Our result: higher dimensions In d dimensions, Θ( 1 /ǫ 12 log d ) samples for parameter distance . ◮ “ σ 2 ” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ± ǫ 2 σ 2 . ◮ Useful when covariance matrix is sparse. Also gives an improved bound in TV error of each component: ◮ If components overlap, then parameter distance ≈ TV. ◮ If components don’t overlap, then clustering is trivial. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

Our result: higher dimensions In d dimensions, Θ( 1 /ǫ 12 log d ) samples for parameter distance . ◮ “ σ 2 ” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ± ǫ 2 σ 2 . ◮ Useful when covariance matrix is sparse. Also gives an improved bound in TV error of each component: ◮ If components overlap, then parameter distance ≈ TV. ◮ If components don’t overlap, then clustering is trivial. ◮ Straightforwardly gives � O ( d 30 /ǫ 36 ) samples. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

Our result: higher dimensions In d dimensions, Θ( 1 /ǫ 12 log d ) samples for parameter distance . ◮ “ σ 2 ” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ± ǫ 2 σ 2 . ◮ Useful when covariance matrix is sparse. Also gives an improved bound in TV error of each component: ◮ If components overlap, then parameter distance ≈ TV. ◮ If components don’t overlap, then clustering is trivial. ◮ Straightforwardly gives � O ( d 30 /ǫ 36 ) samples. ◮ Best known, but not the � O ( d /ǫ c ) we want. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

Our result: higher dimensions In d dimensions, Θ( 1 /ǫ 12 log d ) samples for parameter distance . ◮ “ σ 2 ” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ± ǫ 2 σ 2 . ◮ Useful when covariance matrix is sparse. Also gives an improved bound in TV error of each component: ◮ If components overlap, then parameter distance ≈ TV. ◮ If components don’t overlap, then clustering is trivial. ◮ Straightforwardly gives � O ( d 30 /ǫ 36 ) samples. ◮ Best known, but not the � O ( d /ǫ c ) we want. Caveat: assume p 1 , p 2 are bounded away from zero throughout. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

Outline Algorithm in One Dimension 1 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 9 / 27

Outline Algorithm in One Dimension 1 Lower Bound 2 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 9 / 27

Outline Algorithm in One Dimension 1 Lower Bound 2 Algorithm in d Dimensions 3 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 9 / 27

Method of Moments 140 160 180 200 Height (cm) We want to learn five parameters: µ 1 , µ 2 , σ 1 , σ 2 , p 1 , p 2 with p 1 + p 2 = 1. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 11 / 27

Method of Moments 140 160 180 200 Height (cm) We want to learn five parameters: µ 1 , µ 2 , σ 1 , σ 2 , p 1 , p 2 with p 1 + p 2 = 1. Moments give polynomial equations in parameters: M 1 := E [ x 1 ] = p 1 µ 1 + p 2 µ 2 M 2 := E [ x 2 ] = p 1 µ 2 1 + p 2 µ 2 2 + p 1 σ 2 1 + p 2 σ 2 2 M 3 , M 4 , M 5 , M 6 = [ ... ] Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 11 / 27

Method of Moments 140 160 180 200 Height (cm) We want to learn five parameters: µ 1 , µ 2 , σ 1 , σ 2 , p 1 , p 2 with p 1 + p 2 = 1. Moments give polynomial equations in parameters: M 1 := E [ x 1 ] = p 1 µ 1 + p 2 µ 2 M 2 := E [ x 2 ] = p 1 µ 2 1 + p 2 µ 2 2 + p 1 σ 2 1 + p 2 σ 2 2 M 3 , M 4 , M 5 , M 6 = [ ... ] Use our samples to estimate the moments. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 11 / 27

Method of Moments 140 160 180 200 Height (cm) We want to learn five parameters: µ 1 , µ 2 , σ 1 , σ 2 , p 1 , p 2 with p 1 + p 2 = 1. Moments give polynomial equations in parameters: M 1 := E [ x 1 ] = p 1 µ 1 + p 2 µ 2 M 2 := E [ x 2 ] = p 1 µ 2 1 + p 2 µ 2 2 + p 1 σ 2 1 + p 2 σ 2 2 M 3 , M 4 , M 5 , M 6 = [ ... ] Use our samples to estimate the moments. Solve the system of equations to find the parameters. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 11 / 27

Method of Moments Solving the system Start with five parameters. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

Method of Moments Solving the system Start with five parameters. First, can assume mean zero: ◮ Convert to “central moments” Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

Method of Moments Solving the system Start with five parameters. First, can assume mean zero: ◮ Convert to “central moments” ◮ M ′ 2 = M 2 − M 2 1 is independent of translation. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

Method of Moments Solving the system Start with five parameters. First, can assume mean zero: ◮ Convert to “central moments” ◮ M ′ 2 = M 2 − M 2 1 is independent of translation. Analogously, can assume min ( σ 1 , σ 2 ) = 0 by converting to “excess moments” Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

Method of Moments Solving the system Start with five parameters. First, can assume mean zero: ◮ Convert to “central moments” ◮ M ′ 2 = M 2 − M 2 1 is independent of translation. Analogously, can assume min ( σ 1 , σ 2 ) = 0 by converting to “excess moments” ◮ X 4 = M 4 − 3 M 2 2 is independent of adding N ( 0 , σ 2 ) . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

Method of Moments Solving the system Start with five parameters. First, can assume mean zero: ◮ Convert to “central moments” ◮ M ′ 2 = M 2 − M 2 1 is independent of translation. Analogously, can assume min ( σ 1 , σ 2 ) = 0 by converting to “excess moments” ◮ X 4 = M 4 − 3 M 2 2 is independent of adding N ( 0 , σ 2 ) . ◮ “Excess kurtosis” coined by Pearson, appearing in every Wikipedia probability distribution infobox. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

Method of Moments Solving the system Start with five parameters. First, can assume mean zero: ◮ Convert to “central moments” ◮ M ′ 2 = M 2 − M 2 1 is independent of translation. Analogously, can assume min ( σ 1 , σ 2 ) = 0 by converting to “excess moments” ◮ X 4 = M 4 − 3 M 2 2 is independent of adding N ( 0 , σ 2 ) . ◮ “Excess kurtosis” coined by Pearson, appearing in every Wikipedia probability distribution infobox. Leaves three free parameters. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

Method of Moments: system of equations Convenient to reparameterize by α = − µ 1 µ 2 , β = µ 1 + µ 2 , γ = σ 2 2 − σ 2 1 µ 2 − µ 1 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 13 / 27

Method of Moments: system of equations Convenient to reparameterize by α = − µ 1 µ 2 , β = µ 1 + µ 2 , γ = σ 2 2 − σ 2 1 µ 2 − µ 1 Gives that X 3 = α ( β + 3 γ ) X 4 = α ( − 2 α + β 2 + 6 βγ + 3 γ 2 ) X 5 = α ( β 3 − 8 αβ + 10 β 2 γ + 15 γ 2 β − 20 αγ ) X 6 = α ( 16 α 2 − 12 αβ 2 − 60 αβγ + β 4 + 15 β 3 γ + 45 β 2 γ 2 + 15 βγ 3 ) Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 13 / 27

Method of Moments: system of equations Convenient to reparameterize by α = − µ 1 µ 2 , β = µ 1 + µ 2 , γ = σ 2 2 − σ 2 1 µ 2 − µ 1 Gives that X 3 = α ( β + 3 γ ) X 4 = α ( − 2 α + β 2 + 6 βγ + 3 γ 2 ) X 5 = α ( β 3 − 8 αβ + 10 β 2 γ + 15 γ 2 β − 20 αγ ) X 6 = α ( 16 α 2 − 12 αβ 2 − 60 αβγ + β 4 + 15 β 3 γ + 45 β 2 γ 2 + 15 βγ 3 ) All my attempts to obtain a simpler set have failed... It is possible, however, that some other ... equations of a less complex kind may ultimately be found. —Karl Pearson Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 13 / 27

Pearson’s Polynomial Chug chug chug... Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 14 / 27

Pearson’s Polynomial Chug chug chug... Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 : p ( α ) = 8 α 9 + 28 X 4 α 7 − 12 X 2 3 α 6 + ( 24 X 3 X 5 + 30 X 2 4 ) α 5 3 X 4 ) α 4 + ( 96 X 4 + ( 6 X 2 5 − 148 X 2 3 − 36 X 3 X 4 X 5 + 9 X 3 4 ) α 3 4 ) α 2 − 32 X 4 + ( 24 X 3 3 X 5 + 21 X 2 3 X 2 3 X 4 α + 8 X 6 3 = 0 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 14 / 27

Pearson’s Polynomial Chug chug chug... Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 : p ( α ) = 8 α 9 + 28 X 4 α 7 − 12 X 2 3 α 6 + ( 24 X 3 X 5 + 30 X 2 4 ) α 5 3 X 4 ) α 4 + ( 96 X 4 + ( 6 X 2 5 − 148 X 2 3 − 36 X 3 X 4 X 5 + 9 X 3 4 ) α 3 4 ) α 2 − 32 X 4 + ( 24 X 3 3 X 5 + 21 X 2 3 X 2 3 X 4 α + 8 X 6 3 = 0 Easy to go from solutions α = − µ 1 µ 2 to mixtures µ i , σ i , p i . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 14 / 27

Pearson’s Polynomial 6 6 4 4 2 0 2 0 2 2 4 4 6 6 8 8 1 1 0 0 1 1 2 2 3 3 Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

Pearson’s Polynomial 6 6 4 4 2 0 2 0 2 2 4 4 6 6 8 8 1 1 0 0 1 1 2 2 3 3 Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 . ◮ Positive roots correspond to mixtures that match on five moments. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

Pearson’s Polynomial 6 6 4 4 2 0 2 0 2 2 4 4 6 6 8 8 1 1 0 0 1 1 2 2 3 3 Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 . ◮ Positive roots correspond to mixtures that match on five moments. ◮ Pearson’s proposal: choose root with closer 6th moment. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

Pearson’s Polynomial 6 6 4 4 2 0 2 0 2 2 4 4 6 6 8 8 1 1 0 0 1 1 2 2 3 3 Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 . ◮ Positive roots correspond to mixtures that match on five moments. ◮ Pearson’s proposal: choose root with closer 6th moment. Works because six moments uniquely identify mixture [KMV] Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

Pearson’s Polynomial 6 6 4 4 2 0 2 0 2 2 4 4 6 6 8 8 1 1 0 0 1 1 2 2 3 3 Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 . ◮ Positive roots correspond to mixtures that match on five moments. ◮ Pearson’s proposal: choose root with closer 6th moment. Works because six moments uniquely identify mixture [KMV] How robust to moment estimation error? Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

Pearson’s Polynomial 6 6 4 4 2 0 2 0 2 2 4 4 6 6 8 8 1 1 0 0 1 1 2 2 3 3 Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 . ◮ Positive roots correspond to mixtures that match on five moments. ◮ Pearson’s proposal: choose root with closer 6th moment. Works because six moments uniquely identify mixture [KMV] How robust to moment estimation error? ◮ Usually works well Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

Pearson’s Polynomial 6 6 4 4 2 0 2 0 2 2 4 4 6 6 8 8 1 1 0 0 1 1 2 2 3 3 Get a 9th degree polynomial in the excess moments X 3 , X 4 , X 5 . ◮ Positive roots correspond to mixtures that match on five moments. ◮ Pearson’s proposal: choose root with closer 6th moment. Works because six moments uniquely identify mixture [KMV] How robust to moment estimation error? ◮ Usually works well ◮ Not when there’s a double root. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

Making it robust in all cases Can create another ninth degree polynomial p 6 from X 3 , X 4 , X 5 , X 6 . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

Making it robust in all cases Can create another ninth degree polynomial p 6 from X 3 , X 4 , X 5 , X 6 . Then α is the unique positive root of r ( α ) := p 5 ( α ) 2 + p 6 ( α ) 2 = 0 . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

Making it robust in all cases Can create another ninth degree polynomial p 6 from X 3 , X 4 , X 5 , X 6 . Then α is the unique positive root of r ( α ) := p 5 ( α ) 2 + p 6 ( α ) 2 = 0 . How robust is the solution to perturbations of X 3 , . . . , X 6 ? Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

Making it robust in all cases Can create another ninth degree polynomial p 6 from X 3 , X 4 , X 5 , X 6 . Then α is the unique positive root of r ( α ) := p 5 ( α ) 2 + p 6 ( α ) 2 = 0 . How robust is the solution to perturbations of X 3 , . . . , X 6 ? We know q ( x ) := r / ( x − α ) 2 has no positive roots. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

Making it robust in all cases Can create another ninth degree polynomial p 6 from X 3 , X 4 , X 5 , X 6 . Then α is the unique positive root of r ( α ) := p 5 ( α ) 2 + p 6 ( α ) 2 = 0 . How robust is the solution to perturbations of X 3 , . . . , X 6 ? We know q ( x ) := r / ( x − α ) 2 has no positive roots. By compactness: q ( x ) ≥ c > 0 for some constant c . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

Making it robust in all cases Can create another ninth degree polynomial p 6 from X 3 , X 4 , X 5 , X 6 . Then α is the unique positive root of r ( α ) := p 5 ( α ) 2 + p 6 ( α ) 2 = 0 . How robust is the solution to perturbations of X 3 , . . . , X 6 ? We know q ( x ) := r / ( x − α ) 2 has no positive roots. By compactness: q ( x ) ≥ c > 0 for some constant c . Therefore plugging in empirical moments � X i to estimate polynomials p 5 , p 6 is robust: Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

Making it robust in all cases Can create another ninth degree polynomial p 6 from X 3 , X 4 , X 5 , X 6 . Then α is the unique positive root of r ( α ) := p 5 ( α ) 2 + p 6 ( α ) 2 = 0 . How robust is the solution to perturbations of X 3 , . . . , X 6 ? We know q ( x ) := r / ( x − α ) 2 has no positive roots. By compactness: q ( x ) ≥ c > 0 for some constant c . Therefore plugging in empirical moments � X i to estimate polynomials p 5 , p 6 is robust: ◮ Given approximations | � p 5 − p 5 | , | � p 6 − p 6 | ≤ ǫ , | α − arg min � r ( x ) | � ǫ. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

Making it robust in all cases Can create another ninth degree polynomial p 6 from X 3 , X 4 , X 5 , X 6 . Then α is the unique positive root of r ( α ) := p 5 ( α ) 2 + p 6 ( α ) 2 = 0 . How robust is the solution to perturbations of X 3 , . . . , X 6 ? We know q ( x ) := r / ( x − α ) 2 has no positive roots. By compactness: q ( x ) ≥ c > 0 for some constant c . Therefore plugging in empirical moments � X i to estimate polynomials p 5 , p 6 is robust: ◮ Given approximations | � p 5 − p 5 | , | � p 6 − p 6 | ≤ ǫ , | α − arg min � r ( x ) | � ǫ. ◮ Getting α lets us estimate means, variances. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

Result 1 σ Scale so the excess moments are O ( 1 ) : µ i are ± O ( 1 ) . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

Result 1 σ Scale so the excess moments are O ( 1 ) : µ i are ± O ( 1 ) . Getting the � p i to O ( ǫ ) requires getting the first six moments to ± O ( ǫ ) . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

Result 1 σ Scale so the excess moments are O ( 1 ) : µ i are ± O ( 1 ) . Getting the � p i to O ( ǫ ) requires getting the first six moments to ± O ( ǫ ) . If the variance is σ 2 , then M i has variance O ( σ 2 i ) . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

Result 1 σ Scale so the excess moments are O ( 1 ) : µ i are ± O ( 1 ) . Getting the � p i to O ( ǫ ) requires getting the first six moments to ± O ( ǫ ) . If the variance is σ 2 , then M i has variance O ( σ 2 i ) . Thus O ( σ 12 /ǫ 2 ) samples to learn the µ i to ± ǫ . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

Result 1 σ Scale so the excess moments are O ( 1 ) : µ i are ± O ( 1 ) . Getting the � p i to O ( ǫ ) requires getting the first six moments to ± O ( ǫ ) . If the variance is σ 2 , then M i has variance O ( σ 2 i ) . Thus O ( σ 12 /ǫ 2 ) samples to learn the µ i to ± ǫ . ◮ If components are Ω( 1 ) standard deviations apart, O ( 1 /ǫ 2 ) samples suffice. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

Result 1 σ Scale so the excess moments are O ( 1 ) : µ i are ± O ( 1 ) . Getting the � p i to O ( ǫ ) requires getting the first six moments to ± O ( ǫ ) . If the variance is σ 2 , then M i has variance O ( σ 2 i ) . Thus O ( σ 12 /ǫ 2 ) samples to learn the µ i to ± ǫ . ◮ If components are Ω( 1 ) standard deviations apart, O ( 1 /ǫ 2 ) samples suffice. ◮ In general, O ( 1 /ǫ 12 ) samples suffice to get ǫσ accuracy. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

Lower bound in one dimension The algorithm takes O ( ǫ − 12 ) samples because it uses six moments Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

Lower bound in one dimension The algorithm takes O ( ǫ − 12 ) samples because it uses six moments ◮ Necessary to get sixth moment to ± ( ǫσ ) 6 . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

Lower bound in one dimension The algorithm takes O ( ǫ − 12 ) samples because it uses six moments ◮ Necessary to get sixth moment to ± ( ǫσ ) 6 . Let F , F ′ be any two mixtures with five matching moments: ◮ Constant means and variances. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

Lower bound in one dimension The algorithm takes O ( ǫ − 12 ) samples because it uses six moments ◮ Necessary to get sixth moment to ± ( ǫσ ) 6 . Let F , F ′ be any two mixtures with five matching moments: ◮ Constant means and variances. ◮ Add N ( 0 , σ 2 ) to each mixture for growing σ . Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

Tight Bounds for Learning a Mixture of Two Gaussians Moritz Hardt - PowerPoint PPT Presentation

Tight Bounds for Learning a Mixture of Two Gaussians Moritz Hardt Eric Price Google Research UT Austin 2015-06-17 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 1 / 27 Problem 140 140

Sharp bounds for learning a mixture of two Gaussians Moritz Hardt Eric Price IBM Almaden

Lecture 20 Lecture 20 Nov 12 th 2008 Clustering with Mixture of Gaussians Clustering with Mixture

Chapter 2 Tight-frames An Introduction 1 Outline 1. Tight-frame 1. Tight-frame 2. Matrix

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter

Bernoulli Mixture Models Victor Medina Researcher at SBIF DataCamp Mixture Models in R The

Structure of mixture models Victor Medina Researcher at SBIF DataCamp Mixture Models in R

Applied Machine Learning Expectation Maximization for Mixture of Gaussians Siamak Ravanbakhsh

Lecture 1: Review of DTFT, Gaussians, and Linear Algebra Mark Hasegawa-Johnson ECE 417:

Circuit Lower-bounds Lecture 24 Weak circuits are indeed weak 1 Circuit Lower-bounds 2

Computing Tight Bounds for Insurance Payments with Nonlinear Risk Man Hong WONG 1 Shuzhong ZHANG 2

Nearly Tight Bounds for Robust Proper Learning of Halfspaces with a Margin Ilias Diakonikolas

Contents Clustering K-means Mixture of Gaussians Expectation Maximization

Tutorial on Estimation and Multivariate Gaussians STAT 27725/CMSC 25400: Machine Learning

6.1 Gaussians 6 Bayesian Kernel Methods Alexander Smola Introduction to Machine Learning 10-701

CS480/680 Lecture 7: May 29, 2019 Classification with Mixture of Gaussians [B] Sections 4.2,

MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn from k Gaussians, meaning Y =

PRESIDENTS COUNCIL October 24, 2019 2019 United Way Campaign YOU CAN CHANGE THE STORY. YOU

Can we Beat the Square Root Bound for ECDLP over F p 2 via Representation? NutMiC 2019 , Paris

The Elliptic Curves Discrete Logarithm Problem and an implementation of parallelized Pollards

RSA Question 2 Bob thinks that p and q are primes but p isnt. Then, Bob thinks Bob

Rational Minimax Filtering Arthur J. Krener Wei Kang ajkrener@nps.edu wkang@nps.edu Research

Mechanisms for Noise Attenuation in Molecular Biology Signaling Pathways Liming Wang Department

A Divergent Synthesis of Polyurethane Dendrimers Richard T.

Behaviour at infinite time of Nematic Phase Problem Mouhamadou Samsidy Goudiaby + Supervisors