Tight Bounds for Learning a Mixture of Two Gaussians Moritz Hardt - - PowerPoint PPT Presentation

tight bounds for learning a mixture of two gaussians
SMART_READER_LITE
LIVE PREVIEW

Tight Bounds for Learning a Mixture of Two Gaussians Moritz Hardt - - PowerPoint PPT Presentation

Tight Bounds for Learning a Mixture of Two Gaussians Moritz Hardt Eric Price Google Research UT Austin 2015-06-17 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 1 / 27 Problem 140 140


slide-1
SLIDE 1

Tight Bounds for Learning a Mixture

  • f Two Gaussians

Moritz Hardt Eric Price

Google Research UT Austin

2015-06-17

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 1 / 27

slide-2
SLIDE 2

Problem

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

Height distribution of American 20 year olds.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 2 / 27

slide-3
SLIDE 3

Problem

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

Height distribution of American 20 year olds.

◮ Male/female heights are very close to Gaussian distribution. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 2 / 27

slide-4
SLIDE 4

Problem

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

Height distribution of American 20 year olds.

◮ Male/female heights are very close to Gaussian distribution.

Can we learn the average male and female heights from unlabeled population data?

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 2 / 27

slide-5
SLIDE 5

Problem

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

Height distribution of American 20 year olds.

◮ Male/female heights are very close to Gaussian distribution.

Can we learn the average male and female heights from unlabeled population data? How many samples to learn µ1, µ2 to ±ǫσ?

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 2 / 27

slide-6
SLIDE 6

Problem

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

Height distribution of American 20 year olds.

◮ Male/female heights are very close to Gaussian distribution.

Can we learn the average male and female heights from unlabeled population data? How many samples to learn µ1, µ2 to ±ǫσ? d-dimensional setting: also learn weight, shoe size, ...

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 2 / 27

slide-7
SLIDE 7

Gaussian Mixtures: Origins

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 3 / 27

slide-8
SLIDE 8

Gaussian Mixtures: Origins

Contributions to the Mathematical Theory of Evolution, Karl Pearson, 1894

Pearson’s naturalist buddy measured lots of crab body parts.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 4 / 27

slide-9
SLIDE 9

Gaussian Mixtures: Origins

Contributions to the Mathematical Theory of Evolution, Karl Pearson, 1894

Pearson’s naturalist buddy measured lots of crab body parts. Most lengths seemed to follow the “normal” distribution (a recently coined name)

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 4 / 27

slide-10
SLIDE 10

Gaussian Mixtures: Origins

Contributions to the Mathematical Theory of Evolution, Karl Pearson, 1894

Pearson’s naturalist buddy measured lots of crab body parts. Most lengths seemed to follow the “normal” distribution (a recently coined name) But the “forehead” size wasn’t symmetric.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 4 / 27

slide-11
SLIDE 11

Gaussian Mixtures: Origins

Contributions to the Mathematical Theory of Evolution, Karl Pearson, 1894

Pearson’s naturalist buddy measured lots of crab body parts. Most lengths seemed to follow the “normal” distribution (a recently coined name) But the “forehead” size wasn’t symmetric. Maybe there were actually two species of crabs?

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 4 / 27

slide-12
SLIDE 12

More previous work

Pearson 1894: proposed method for 2 Gaussians

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-13
SLIDE 13

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments” Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-14
SLIDE 14

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-15
SLIDE 15

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-16
SLIDE 16

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80

Provable results assuming the components are well-separated:

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-17
SLIDE 17

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80

Provable results assuming the components are well-separated:

◮ Clustering: Dasgupta ’99, DA ’00 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-18
SLIDE 18

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80

Provable results assuming the components are well-separated:

◮ Clustering: Dasgupta ’99, DA ’00 ◮ Spectral methods: VW ’04, AK ’05, KSV ’05, AM ’05, VW ’05 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-19
SLIDE 19

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80

Provable results assuming the components are well-separated:

◮ Clustering: Dasgupta ’99, DA ’00 ◮ Spectral methods: VW ’04, AK ’05, KSV ’05, AM ’05, VW ’05

Kalai-Moitra-Valiant 2010: first general polynomial bound.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-20
SLIDE 20

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80

Provable results assuming the components are well-separated:

◮ Clustering: Dasgupta ’99, DA ’00 ◮ Spectral methods: VW ’04, AK ’05, KSV ’05, AM ’05, VW ’05

Kalai-Moitra-Valiant 2010: first general polynomial bound.

◮ Extended to general k mixtures: Moitra-Valiant ’10, Belkin-Sinha ’10 Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-21
SLIDE 21

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80

Provable results assuming the components are well-separated:

◮ Clustering: Dasgupta ’99, DA ’00 ◮ Spectral methods: VW ’04, AK ’05, KSV ’05, AM ’05, VW ’05

Kalai-Moitra-Valiant 2010: first general polynomial bound.

◮ Extended to general k mixtures: Moitra-Valiant ’10, Belkin-Sinha ’10

The KMV polynomial is very large.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-22
SLIDE 22

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80

Provable results assuming the components are well-separated:

◮ Clustering: Dasgupta ’99, DA ’00 ◮ Spectral methods: VW ’04, AK ’05, KSV ’05, AM ’05, VW ’05

Kalai-Moitra-Valiant 2010: first general polynomial bound.

◮ Extended to general k mixtures: Moitra-Valiant ’10, Belkin-Sinha ’10

The KMV polynomial is very large.

◮ Our result: tight upper and lower bounds for the sample complexity. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-23
SLIDE 23

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80

Provable results assuming the components are well-separated:

◮ Clustering: Dasgupta ’99, DA ’00 ◮ Spectral methods: VW ’04, AK ’05, KSV ’05, AM ’05, VW ’05

Kalai-Moitra-Valiant 2010: first general polynomial bound.

◮ Extended to general k mixtures: Moitra-Valiant ’10, Belkin-Sinha ’10

The KMV polynomial is very large.

◮ Our result: tight upper and lower bounds for the sample complexity. ◮ For k = 2 mixtures, arbitrary d dimensions. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-24
SLIDE 24

More previous work

Pearson 1894: proposed method for 2 Gaussians

◮ “Method of moments”

Other empirical papers over the years:

◮ Royce ’58, Gridgeman ’70, Gupta-Huang ’80

Provable results assuming the components are well-separated:

◮ Clustering: Dasgupta ’99, DA ’00 ◮ Spectral methods: VW ’04, AK ’05, KSV ’05, AM ’05, VW ’05

Kalai-Moitra-Valiant 2010: first general polynomial bound.

◮ Extended to general k mixtures: Moitra-Valiant ’10, Belkin-Sinha ’10

The KMV polynomial is very large.

◮ Our result: tight upper and lower bounds for the sample complexity. ◮ For k = 2 mixtures, arbitrary d dimensions. ◮ Lower bound extends to larger k. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 5 / 27

slide-25
SLIDE 25

Learning the components vs. learning the sum

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

It’s important that we want to learn the individual components:

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

slide-26
SLIDE 26

Learning the components vs. learning the sum

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

It’s important that we want to learn the individual components:

◮ Male/female average heights, std. deviations. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

slide-27
SLIDE 27

Learning the components vs. learning the sum

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

It’s important that we want to learn the individual components:

◮ Male/female average heights, std. deviations.

Getting ǫ approximation in TV norm to overall distribution takes

  • Θ(1/ǫ2) samples from black box techniques.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

slide-28
SLIDE 28

Learning the components vs. learning the sum

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

It’s important that we want to learn the individual components:

◮ Male/female average heights, std. deviations.

Getting ǫ approximation in TV norm to overall distribution takes

  • Θ(1/ǫ2) samples from black box techniques.

◮ Quite general: non-properly for any mixture of known unimodal

  • distributions. [Chan, Diakonikolas, Servedio, Sun ’13]

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

slide-29
SLIDE 29

Learning the components vs. learning the sum

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

It’s important that we want to learn the individual components:

◮ Male/female average heights, std. deviations.

Getting ǫ approximation in TV norm to overall distribution takes

  • Θ(1/ǫ2) samples from black box techniques.

◮ Quite general: non-properly for any mixture of known unimodal

  • distributions. [Chan, Diakonikolas, Servedio, Sun ’13]

◮ Proper learning: [Daskalakis-Kamath ’14] Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

slide-30
SLIDE 30

Learning the components vs. learning the sum

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

It’s important that we want to learn the individual components:

◮ Male/female average heights, std. deviations.

Getting ǫ approximation in TV norm to overall distribution takes

  • Θ(1/ǫ2) samples from black box techniques.

◮ Quite general: non-properly for any mixture of known unimodal

  • distributions. [Chan, Diakonikolas, Servedio, Sun ’13]

◮ Proper learning: [Daskalakis-Kamath ’14] ◮ But only in low dimensions. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

slide-31
SLIDE 31

Learning the components vs. learning the sum

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

140 160 180 200

Height (cm)

It’s important that we want to learn the individual components:

◮ Male/female average heights, std. deviations.

Getting ǫ approximation in TV norm to overall distribution takes

  • Θ(1/ǫ2) samples from black box techniques.

◮ Quite general: non-properly for any mixture of known unimodal

  • distributions. [Chan, Diakonikolas, Servedio, Sun ’13]

◮ Proper learning: [Daskalakis-Kamath ’14] ◮ But only in low dimensions. ◮ Generic high-d TV estimation algs use 1d parameter estimation. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 6 / 27

slide-32
SLIDE 32

Our result

A variant of Pearson’s 1894 method is optimal!

α Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

slide-33
SLIDE 33

Our result

A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy:

α Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

slide-34
SLIDE 34

Our result

A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy:

◮ µi to ±ǫσ α Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

slide-35
SLIDE 35

Our result

A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy:

◮ µi to ±ǫσ ◮ σ2

i to ±ǫ2σ2

α Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

slide-36
SLIDE 36

Our result

A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy:

◮ µi to ±ǫσ ◮ σ2

i to ±ǫ2σ2

In one dimension: Θ(1/ǫ12) samples necessary and sufficient.

α Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

slide-37
SLIDE 37

Our result

A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy:

◮ µi to ±ǫσ ◮ σ2

i to ±ǫ2σ2

In one dimension: Θ(1/ǫ12) samples necessary and sufficient.

◮ Previously: 1/ǫ≈300, no lower bound. α Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

slide-38
SLIDE 38

Our result

A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy:

◮ µi to ±ǫσ ◮ σ2

i to ±ǫ2σ2

In one dimension: Θ(1/ǫ12) samples necessary and sufficient.

◮ Previously: 1/ǫ≈300, no lower bound. ◮ Moreover: algorithm is almost the same as Pearson (1894). α Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

slide-39
SLIDE 39

Our result

A variant of Pearson’s 1894 method is optimal! Suppose we want means and variances to ǫ accuracy:

◮ µi to ±ǫσ ◮ σ2

i to ±ǫ2σ2

In one dimension: Θ(1/ǫ12) samples necessary and sufficient.

◮ Previously: 1/ǫ≈300, no lower bound. ◮ Moreover: algorithm is almost the same as Pearson (1894). α

More precisely: if two gaussians are α standard deviations apart, getting ǫα precision takes Θ(

1 α12ǫ2 ) samples.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 7 / 27

slide-40
SLIDE 40

Our result: higher dimensions

In d dimensions, Θ(1/ǫ12 log d) samples for parameter distance.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

slide-41
SLIDE 41

Our result: higher dimensions

In d dimensions, Θ(1/ǫ12 log d) samples for parameter distance.

◮ “σ2” is max variance in any coordinate. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

slide-42
SLIDE 42

Our result: higher dimensions

In d dimensions, Θ(1/ǫ12 log d) samples for parameter distance.

◮ “σ2” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ±ǫ2σ2. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

slide-43
SLIDE 43

Our result: higher dimensions

In d dimensions, Θ(1/ǫ12 log d) samples for parameter distance.

◮ “σ2” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ±ǫ2σ2. ◮ Useful when covariance matrix is sparse. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

slide-44
SLIDE 44

Our result: higher dimensions

In d dimensions, Θ(1/ǫ12 log d) samples for parameter distance.

◮ “σ2” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ±ǫ2σ2. ◮ Useful when covariance matrix is sparse.

Also gives an improved bound in TV error of each component:

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

slide-45
SLIDE 45

Our result: higher dimensions

In d dimensions, Θ(1/ǫ12 log d) samples for parameter distance.

◮ “σ2” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ±ǫ2σ2. ◮ Useful when covariance matrix is sparse.

Also gives an improved bound in TV error of each component:

◮ If components overlap, then parameter distance ≈ TV. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

slide-46
SLIDE 46

Our result: higher dimensions

In d dimensions, Θ(1/ǫ12 log d) samples for parameter distance.

◮ “σ2” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ±ǫ2σ2. ◮ Useful when covariance matrix is sparse.

Also gives an improved bound in TV error of each component:

◮ If components overlap, then parameter distance ≈ TV. ◮ If components don’t overlap, then clustering is trivial. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

slide-47
SLIDE 47

Our result: higher dimensions

In d dimensions, Θ(1/ǫ12 log d) samples for parameter distance.

◮ “σ2” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ±ǫ2σ2. ◮ Useful when covariance matrix is sparse.

Also gives an improved bound in TV error of each component:

◮ If components overlap, then parameter distance ≈ TV. ◮ If components don’t overlap, then clustering is trivial. ◮ Straightforwardly gives

O(d30/ǫ36) samples.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

slide-48
SLIDE 48

Our result: higher dimensions

In d dimensions, Θ(1/ǫ12 log d) samples for parameter distance.

◮ “σ2” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ±ǫ2σ2. ◮ Useful when covariance matrix is sparse.

Also gives an improved bound in TV error of each component:

◮ If components overlap, then parameter distance ≈ TV. ◮ If components don’t overlap, then clustering is trivial. ◮ Straightforwardly gives

O(d30/ǫ36) samples.

◮ Best known, but not the

O(d/ǫc) we want.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

slide-49
SLIDE 49

Our result: higher dimensions

In d dimensions, Θ(1/ǫ12 log d) samples for parameter distance.

◮ “σ2” is max variance in any coordinate. ◮ Get each entry of covariance matrix to ±ǫ2σ2. ◮ Useful when covariance matrix is sparse.

Also gives an improved bound in TV error of each component:

◮ If components overlap, then parameter distance ≈ TV. ◮ If components don’t overlap, then clustering is trivial. ◮ Straightforwardly gives

O(d30/ǫ36) samples.

◮ Best known, but not the

O(d/ǫc) we want.

Caveat: assume p1, p2 are bounded away from zero throughout.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 8 / 27

slide-50
SLIDE 50

Outline

1

Algorithm in One Dimension

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 9 / 27

slide-51
SLIDE 51

Outline

1

Algorithm in One Dimension

2

Lower Bound

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 9 / 27

slide-52
SLIDE 52

Outline

1

Algorithm in One Dimension

2

Lower Bound

3

Algorithm in d Dimensions

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 9 / 27

slide-53
SLIDE 53

Outline

1

Algorithm in One Dimension

2

Lower Bound

3

Algorithm in d Dimensions

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 10 / 27

slide-54
SLIDE 54

Method of Moments

140 160 180 200

Height (cm)

We want to learn five parameters: µ1, µ2, σ1, σ2, p1, p2 with p1 + p2 = 1.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 11 / 27

slide-55
SLIDE 55

Method of Moments

140 160 180 200

Height (cm)

We want to learn five parameters: µ1, µ2, σ1, σ2, p1, p2 with p1 + p2 = 1. Moments give polynomial equations in parameters: M1 := E[x1] = p1µ1 + p2µ2 M2 := E[x2] = p1µ2

1 + p2µ2 2 + p1σ2 1 + p2σ2 2

M3, M4, M5, M6 = [...]

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 11 / 27

slide-56
SLIDE 56

Method of Moments

140 160 180 200

Height (cm)

We want to learn five parameters: µ1, µ2, σ1, σ2, p1, p2 with p1 + p2 = 1. Moments give polynomial equations in parameters: M1 := E[x1] = p1µ1 + p2µ2 M2 := E[x2] = p1µ2

1 + p2µ2 2 + p1σ2 1 + p2σ2 2

M3, M4, M5, M6 = [...] Use our samples to estimate the moments.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 11 / 27

slide-57
SLIDE 57

Method of Moments

140 160 180 200

Height (cm)

We want to learn five parameters: µ1, µ2, σ1, σ2, p1, p2 with p1 + p2 = 1. Moments give polynomial equations in parameters: M1 := E[x1] = p1µ1 + p2µ2 M2 := E[x2] = p1µ2

1 + p2µ2 2 + p1σ2 1 + p2σ2 2

M3, M4, M5, M6 = [...] Use our samples to estimate the moments. Solve the system of equations to find the parameters.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 11 / 27

slide-58
SLIDE 58

Method of Moments

Solving the system

Start with five parameters.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

slide-59
SLIDE 59

Method of Moments

Solving the system

Start with five parameters. First, can assume mean zero:

◮ Convert to “central moments” Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

slide-60
SLIDE 60

Method of Moments

Solving the system

Start with five parameters. First, can assume mean zero:

◮ Convert to “central moments” ◮ M′

2 = M2 − M2 1 is independent of translation.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

slide-61
SLIDE 61

Method of Moments

Solving the system

Start with five parameters. First, can assume mean zero:

◮ Convert to “central moments” ◮ M′

2 = M2 − M2 1 is independent of translation.

Analogously, can assume min(σ1, σ2) = 0 by converting to “excess moments”

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

slide-62
SLIDE 62

Method of Moments

Solving the system

Start with five parameters. First, can assume mean zero:

◮ Convert to “central moments” ◮ M′

2 = M2 − M2 1 is independent of translation.

Analogously, can assume min(σ1, σ2) = 0 by converting to “excess moments”

◮ X4 = M4 − 3M2

2 is independent of adding N(0, σ2).

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

slide-63
SLIDE 63

Method of Moments

Solving the system

Start with five parameters. First, can assume mean zero:

◮ Convert to “central moments” ◮ M′

2 = M2 − M2 1 is independent of translation.

Analogously, can assume min(σ1, σ2) = 0 by converting to “excess moments”

◮ X4 = M4 − 3M2

2 is independent of adding N(0, σ2).

◮ “Excess kurtosis” coined by Pearson, appearing in every Wikipedia

probability distribution infobox.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

slide-64
SLIDE 64

Method of Moments

Solving the system

Start with five parameters. First, can assume mean zero:

◮ Convert to “central moments” ◮ M′

2 = M2 − M2 1 is independent of translation.

Analogously, can assume min(σ1, σ2) = 0 by converting to “excess moments”

◮ X4 = M4 − 3M2

2 is independent of adding N(0, σ2).

◮ “Excess kurtosis” coined by Pearson, appearing in every Wikipedia

probability distribution infobox.

Leaves three free parameters.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 12 / 27

slide-65
SLIDE 65

Method of Moments: system of equations

Convenient to reparameterize by α = −µ1µ2, β = µ1 + µ2, γ = σ2

2 − σ2 1

µ2 − µ1

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 13 / 27

slide-66
SLIDE 66

Method of Moments: system of equations

Convenient to reparameterize by α = −µ1µ2, β = µ1 + µ2, γ = σ2

2 − σ2 1

µ2 − µ1 Gives that X3 = α(β + 3γ) X4 = α(−2α + β2 + 6βγ + 3γ2) X5 = α(β3 − 8αβ + 10β2γ + 15γ2β − 20αγ) X6 = α(16α2 − 12αβ2 − 60αβγ + β4 + 15β3γ + 45β2γ2 + 15βγ3)

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 13 / 27

slide-67
SLIDE 67

Method of Moments: system of equations

Convenient to reparameterize by α = −µ1µ2, β = µ1 + µ2, γ = σ2

2 − σ2 1

µ2 − µ1 Gives that X3 = α(β + 3γ) X4 = α(−2α + β2 + 6βγ + 3γ2) X5 = α(β3 − 8αβ + 10β2γ + 15γ2β − 20αγ) X6 = α(16α2 − 12αβ2 − 60αβγ + β4 + 15β3γ + 45β2γ2 + 15βγ3) All my attempts to obtain a simpler set have failed... It is possible, however, that some other ... equations of a less complex kind may ultimately be found. —Karl Pearson

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 13 / 27

slide-68
SLIDE 68

Pearson’s Polynomial

Chug chug chug...

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 14 / 27

slide-69
SLIDE 69

Pearson’s Polynomial

Chug chug chug... Get a 9th degree polynomial in the excess moments X3, X4, X5: p(α) = 8α9 + 28X4α7 − 12X 2

3 α6 + (24X3X5 + 30X 2 4 )α5

+ (6X 2

5 − 148X 2 3 X4)α4 + (96X 4 3 − 36X3X4X5 + 9X 3 4 )α3

+ (24X 3

3 X5 + 21X 2 3 X 2 4 )α2 − 32X 4 3 X4α + 8X 6 3

= 0

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 14 / 27

slide-70
SLIDE 70

Pearson’s Polynomial

Chug chug chug... Get a 9th degree polynomial in the excess moments X3, X4, X5: p(α) = 8α9 + 28X4α7 − 12X 2

3 α6 + (24X3X5 + 30X 2 4 )α5

+ (6X 2

5 − 148X 2 3 X4)α4 + (96X 4 3 − 36X3X4X5 + 9X 3 4 )α3

+ (24X 3

3 X5 + 21X 2 3 X 2 4 )α2 − 32X 4 3 X4α + 8X 6 3

= 0 Easy to go from solutions α = −µ1µ2 to mixtures µi, σi, pi.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 14 / 27

slide-71
SLIDE 71

Pearson’s Polynomial

1 1 2 3 6 4 2 0 2 4 6 8 1 1 2 3 6 4 2 0 2 4 6 8

Get a 9th degree polynomial in the excess moments X3, X4, X5.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

slide-72
SLIDE 72

Pearson’s Polynomial

1 1 2 3 6 4 2 0 2 4 6 8 1 1 2 3 6 4 2 0 2 4 6 8

Get a 9th degree polynomial in the excess moments X3, X4, X5.

◮ Positive roots correspond to mixtures that match on five moments. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

slide-73
SLIDE 73

Pearson’s Polynomial

1 1 2 3 6 4 2 0 2 4 6 8 1 1 2 3 6 4 2 0 2 4 6 8

Get a 9th degree polynomial in the excess moments X3, X4, X5.

◮ Positive roots correspond to mixtures that match on five moments. ◮ Pearson’s proposal: choose root with closer 6th moment. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

slide-74
SLIDE 74

Pearson’s Polynomial

1 1 2 3 6 4 2 0 2 4 6 8 1 1 2 3 6 4 2 0 2 4 6 8

Get a 9th degree polynomial in the excess moments X3, X4, X5.

◮ Positive roots correspond to mixtures that match on five moments. ◮ Pearson’s proposal: choose root with closer 6th moment.

Works because six moments uniquely identify mixture [KMV]

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

slide-75
SLIDE 75

Pearson’s Polynomial

1 1 2 3 6 4 2 0 2 4 6 8 1 1 2 3 6 4 2 0 2 4 6 8

Get a 9th degree polynomial in the excess moments X3, X4, X5.

◮ Positive roots correspond to mixtures that match on five moments. ◮ Pearson’s proposal: choose root with closer 6th moment.

Works because six moments uniquely identify mixture [KMV] How robust to moment estimation error?

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

slide-76
SLIDE 76

Pearson’s Polynomial

1 1 2 3 6 4 2 0 2 4 6 8 1 1 2 3 6 4 2 0 2 4 6 8

Get a 9th degree polynomial in the excess moments X3, X4, X5.

◮ Positive roots correspond to mixtures that match on five moments. ◮ Pearson’s proposal: choose root with closer 6th moment.

Works because six moments uniquely identify mixture [KMV] How robust to moment estimation error?

◮ Usually works well Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

slide-77
SLIDE 77

Pearson’s Polynomial

1 1 2 3 6 4 2 0 2 4 6 8 1 1 2 3 6 4 2 0 2 4 6 8

Get a 9th degree polynomial in the excess moments X3, X4, X5.

◮ Positive roots correspond to mixtures that match on five moments. ◮ Pearson’s proposal: choose root with closer 6th moment.

Works because six moments uniquely identify mixture [KMV] How robust to moment estimation error?

◮ Usually works well ◮ Not when there’s a double root. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 15 / 27

slide-78
SLIDE 78

Making it robust in all cases

Can create another ninth degree polynomial p6 from X3, X4, X5, X6.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

slide-79
SLIDE 79

Making it robust in all cases

Can create another ninth degree polynomial p6 from X3, X4, X5, X6. Then α is the unique positive root of r(α) := p5(α)2 + p6(α)2 = 0.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

slide-80
SLIDE 80

Making it robust in all cases

Can create another ninth degree polynomial p6 from X3, X4, X5, X6. Then α is the unique positive root of r(α) := p5(α)2 + p6(α)2 = 0. How robust is the solution to perturbations of X3, . . . , X6?

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

slide-81
SLIDE 81

Making it robust in all cases

Can create another ninth degree polynomial p6 from X3, X4, X5, X6. Then α is the unique positive root of r(α) := p5(α)2 + p6(α)2 = 0. How robust is the solution to perturbations of X3, . . . , X6? We know q(x) := r/(x − α)2 has no positive roots.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

slide-82
SLIDE 82

Making it robust in all cases

Can create another ninth degree polynomial p6 from X3, X4, X5, X6. Then α is the unique positive root of r(α) := p5(α)2 + p6(α)2 = 0. How robust is the solution to perturbations of X3, . . . , X6? We know q(x) := r/(x − α)2 has no positive roots. By compactness: q(x) ≥ c > 0 for some constant c.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

slide-83
SLIDE 83

Making it robust in all cases

Can create another ninth degree polynomial p6 from X3, X4, X5, X6. Then α is the unique positive root of r(α) := p5(α)2 + p6(α)2 = 0. How robust is the solution to perturbations of X3, . . . , X6? We know q(x) := r/(x − α)2 has no positive roots. By compactness: q(x) ≥ c > 0 for some constant c. Therefore plugging in empirical moments Xi to estimate polynomials p5, p6 is robust:

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

slide-84
SLIDE 84

Making it robust in all cases

Can create another ninth degree polynomial p6 from X3, X4, X5, X6. Then α is the unique positive root of r(α) := p5(α)2 + p6(α)2 = 0. How robust is the solution to perturbations of X3, . . . , X6? We know q(x) := r/(x − α)2 has no positive roots. By compactness: q(x) ≥ c > 0 for some constant c. Therefore plugging in empirical moments Xi to estimate polynomials p5, p6 is robust:

◮ Given approximations |

p5 − p5|, | p6 − p6| ≤ ǫ, |α − arg min r(x)| ǫ.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

slide-85
SLIDE 85

Making it robust in all cases

Can create another ninth degree polynomial p6 from X3, X4, X5, X6. Then α is the unique positive root of r(α) := p5(α)2 + p6(α)2 = 0. How robust is the solution to perturbations of X3, . . . , X6? We know q(x) := r/(x − α)2 has no positive roots. By compactness: q(x) ≥ c > 0 for some constant c. Therefore plugging in empirical moments Xi to estimate polynomials p5, p6 is robust:

◮ Given approximations |

p5 − p5|, | p6 − p6| ≤ ǫ, |α − arg min r(x)| ǫ.

◮ Getting α lets us estimate means, variances. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 16 / 27

slide-86
SLIDE 86

Result

1 σ

Scale so the excess moments are O(1): µi are ±O(1).

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

slide-87
SLIDE 87

Result

1 σ

Scale so the excess moments are O(1): µi are ±O(1). Getting the pi to O(ǫ) requires getting the first six moments to ±O(ǫ).

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

slide-88
SLIDE 88

Result

1 σ

Scale so the excess moments are O(1): µi are ±O(1). Getting the pi to O(ǫ) requires getting the first six moments to ±O(ǫ). If the variance is σ2, then Mi has variance O(σ2i).

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

slide-89
SLIDE 89

Result

1 σ

Scale so the excess moments are O(1): µi are ±O(1). Getting the pi to O(ǫ) requires getting the first six moments to ±O(ǫ). If the variance is σ2, then Mi has variance O(σ2i). Thus O(σ12/ǫ2) samples to learn the µi to ±ǫ.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

slide-90
SLIDE 90

Result

1 σ

Scale so the excess moments are O(1): µi are ±O(1). Getting the pi to O(ǫ) requires getting the first six moments to ±O(ǫ). If the variance is σ2, then Mi has variance O(σ2i). Thus O(σ12/ǫ2) samples to learn the µi to ±ǫ.

◮ If components are Ω(1) standard deviations apart, O(1/ǫ2) samples

suffice.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

slide-91
SLIDE 91

Result

1 σ

Scale so the excess moments are O(1): µi are ±O(1). Getting the pi to O(ǫ) requires getting the first six moments to ±O(ǫ). If the variance is σ2, then Mi has variance O(σ2i). Thus O(σ12/ǫ2) samples to learn the µi to ±ǫ.

◮ If components are Ω(1) standard deviations apart, O(1/ǫ2) samples

suffice.

◮ In general, O(1/ǫ12) samples suffice to get ǫσ accuracy. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 17 / 27

slide-92
SLIDE 92

Outline

1

Algorithm in One Dimension

2

Lower Bound

3

Algorithm in d Dimensions

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 18 / 27

slide-93
SLIDE 93

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-94
SLIDE 94

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

◮ Necessary to get sixth moment to ±(ǫσ)6. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-95
SLIDE 95

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

◮ Necessary to get sixth moment to ±(ǫσ)6.

Let F, F ′ be any two mixtures with five matching moments:

◮ Constant means and variances. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-96
SLIDE 96

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

◮ Necessary to get sixth moment to ±(ǫσ)6.

Let F, F ′ be any two mixtures with five matching moments:

◮ Constant means and variances. ◮ Add N(0, σ2) to each mixture for growing σ. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-97
SLIDE 97

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

◮ Necessary to get sixth moment to ±(ǫσ)6.

Let F, F ′ be any two mixtures with five matching moments:

◮ Constant means and variances. ◮ Add N(0, σ2) to each mixture for growing σ. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-98
SLIDE 98

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

◮ Necessary to get sixth moment to ±(ǫσ)6.

Let F, F ′ be any two mixtures with five matching moments:

◮ Constant means and variances. ◮ Add N(0, σ2) to each mixture for growing σ. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-99
SLIDE 99

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

◮ Necessary to get sixth moment to ±(ǫσ)6.

Let F, F ′ be any two mixtures with five matching moments:

◮ Constant means and variances. ◮ Add N(0, σ2) to each mixture for growing σ. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-100
SLIDE 100

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

◮ Necessary to get sixth moment to ±(ǫσ)6.

Let F, F ′ be any two mixtures with five matching moments:

◮ Constant means and variances. ◮ Add N(0, σ2) to each mixture for growing σ. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-101
SLIDE 101

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

◮ Necessary to get sixth moment to ±(ǫσ)6.

Let F, F ′ be any two mixtures with five matching moments:

◮ Constant means and variances. ◮ Add N(0, σ2) to each mixture for growing σ. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-102
SLIDE 102

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

◮ Necessary to get sixth moment to ±(ǫσ)6.

Let F, F ′ be any two mixtures with five matching moments:

◮ Constant means and variances. ◮ Add N(0, σ2) to each mixture for growing σ. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-103
SLIDE 103

Lower bound in one dimension

The algorithm takes O(ǫ−12) samples because it uses six moments

◮ Necessary to get sixth moment to ±(ǫσ)6.

Let F, F ′ be any two mixtures with five matching moments:

◮ Constant means and variances. ◮ Add N(0, σ2) to each mixture for growing σ.

Claim: Ω(σ12) samples necessary to distinguish the distributions.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 19 / 27

slide-104
SLIDE 104

Lower bound in one dimension

Two mixtures F, F ′ with F ≈ F ′.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 20 / 27

slide-105
SLIDE 105

Lower bound in one dimension

Two mixtures F, F ′ with F ≈ F ′. Have TV(F, F ′) ≈ 1/σ6.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 20 / 27

slide-106
SLIDE 106

Lower bound in one dimension

Two mixtures F, F ′ with F ≈ F ′. Have TV(F, F ′) ≈ 1/σ6. Shows Ω(σ6) samples, O(σ12) samples.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 20 / 27

slide-107
SLIDE 107

Lower bound in one dimension

Two mixtures F, F ′ with F ≈ F ′. Have TV(F, F ′) ≈ 1/σ6. Shows Ω(σ6) samples, O(σ12) samples. Improve using squared Hellinger distance.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 20 / 27

slide-108
SLIDE 108

Lower bound in one dimension

Two mixtures F, F ′ with F ≈ F ′. Have TV(F, F ′) ≈ 1/σ6. Shows Ω(σ6) samples, O(σ12) samples. Improve using squared Hellinger distance.

◮ H2(P, Q) := 1

2

  • (
  • p(x) −
  • q(x))2dx

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 20 / 27

slide-109
SLIDE 109

Lower bound in one dimension

Two mixtures F, F ′ with F ≈ F ′. Have TV(F, F ′) ≈ 1/σ6. Shows Ω(σ6) samples, O(σ12) samples. Improve using squared Hellinger distance.

◮ H2(P, Q) := 1

2

  • (
  • p(x) −
  • q(x))2dx

◮ H2 is subadditive on product measures: Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 20 / 27

slide-110
SLIDE 110

Lower bound in one dimension

Two mixtures F, F ′ with F ≈ F ′. Have TV(F, F ′) ≈ 1/σ6. Shows Ω(σ6) samples, O(σ12) samples. Improve using squared Hellinger distance.

◮ H2(P, Q) := 1

2

  • (
  • p(x) −
  • q(x))2dx

◮ H2 is subadditive on product measures: ⋆ H2((x1, . . . , xm), (x′ 1, . . . , x′ m)) ≤ mH2(x, x′). Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 20 / 27

slide-111
SLIDE 111

Lower bound in one dimension

Two mixtures F, F ′ with F ≈ F ′. Have TV(F, F ′) ≈ 1/σ6. Shows Ω(σ6) samples, O(σ12) samples. Improve using squared Hellinger distance.

◮ H2(P, Q) := 1

2

  • (
  • p(x) −
  • q(x))2dx

◮ H2 is subadditive on product measures: ⋆ H2((x1, . . . , xm), (x′ 1, . . . , x′ m)) ≤ mH2(x, x′). ◮ Sample complexity is Ω(1/H2(F, F ′)) Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 20 / 27

slide-112
SLIDE 112

Lower bound in one dimension

Two mixtures F, F ′ with F ≈ F ′. Have TV(F, F ′) ≈ 1/σ6. Shows Ω(σ6) samples, O(σ12) samples. Improve using squared Hellinger distance.

◮ H2(P, Q) := 1

2

  • (
  • p(x) −
  • q(x))2dx

◮ H2 is subadditive on product measures: ⋆ H2((x1, . . . , xm), (x′ 1, . . . , x′ m)) ≤ mH2(x, x′). ◮ Sample complexity is Ω(1/H2(F, F ′)) ◮ H2 TV H, but often H ≈ TV. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 20 / 27

slide-113
SLIDE 113

Bounding the Hellinger distance: general idea

Definition

H2(P, Q) = 1 2

  • (
  • p(x) −
  • q(x))2dx

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 21 / 27

slide-114
SLIDE 114

Bounding the Hellinger distance: general idea

Definition

H2(P, Q) = 1 2

  • (
  • p(x) −
  • q(x))2dx = 1 −

p(x)q(x)dx

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 21 / 27

slide-115
SLIDE 115

Bounding the Hellinger distance: general idea

Definition

H2(P, Q) = 1 2

  • (
  • p(x) −
  • q(x))2dx = 1 −

p(x)q(x)dx If q(x) = (1 + ∆(x))p(x) for some small ∆, then [Pollard ’00]

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 21 / 27

slide-116
SLIDE 116

Bounding the Hellinger distance: general idea

Definition

H2(P, Q) = 1 2

  • (
  • p(x) −
  • q(x))2dx = 1 −

p(x)q(x)dx If q(x) = (1 + ∆(x))p(x) for some small ∆, then [Pollard ’00] H2(p, q) = 1 − 1 + ∆(x)p(x)dx

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 21 / 27

slide-117
SLIDE 117

Bounding the Hellinger distance: general idea

Definition

H2(P, Q) = 1 2

  • (
  • p(x) −
  • q(x))2dx = 1 −

p(x)q(x)dx If q(x) = (1 + ∆(x))p(x) for some small ∆, then [Pollard ’00] H2(p, q) = 1 − 1 + ∆(x)p(x)dx = 1 − E

x∼p[

  • 1 + ∆(x)]

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 21 / 27

slide-118
SLIDE 118

Bounding the Hellinger distance: general idea

Definition

H2(P, Q) = 1 2

  • (
  • p(x) −
  • q(x))2dx = 1 −

p(x)q(x)dx If q(x) = (1 + ∆(x))p(x) for some small ∆, then [Pollard ’00] H2(p, q) = 1 − 1 + ∆(x)p(x)dx = 1 − E

x∼p[

  • 1 + ∆(x)]

= 1 − E

x∼p[1 + ∆(x)/2 − O(∆2(x))]∆(x)

  • x

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 21 / 27

slide-119
SLIDE 119

Bounding the Hellinger distance: general idea

Definition

H2(P, Q) = 1 2

  • (
  • p(x) −
  • q(x))2dx = 1 −

p(x)q(x)dx If q(x) = (1 + ∆(x))p(x) for some small ∆, then [Pollard ’00] H2(p, q) = 1 − 1 + ∆(x)p(x)dx = 1 − E

x∼p[

  • 1 + ∆(x)]

= 1 − E

x∼p[1 + ∆(x)/2 − O(∆2(x))]∆(x)

  • x

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 21 / 27

slide-120
SLIDE 120

Bounding the Hellinger distance: general idea

Definition

H2(P, Q) = 1 2

  • (
  • p(x) −
  • q(x))2dx = 1 −

p(x)q(x)dx If q(x) = (1 + ∆(x))p(x) for some small ∆, then [Pollard ’00] H2(p, q) = 1 − 1 + ∆(x)p(x)dx = 1 − E

x∼p[

  • 1 + ∆(x)]

= 1 − E

x∼p[1 + ∆(x)

  • q(x)−p(x)=0

/2 − O(∆2(x))]∆(x)

  • x

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 21 / 27

slide-121
SLIDE 121

Bounding the Hellinger distance: general idea

Definition

H2(P, Q) = 1 2

  • (
  • p(x) −
  • q(x))2dx = 1 −

p(x)q(x)dx If q(x) = (1 + ∆(x))p(x) for some small ∆, then [Pollard ’00] H2(p, q) = 1 − 1 + ∆(x)p(x)dx = 1 − E

x∼p[

  • 1 + ∆(x)]

= 1 − E

x∼p[1 + ∆(x)

  • q(x)−p(x)=0

/2 − O(∆2(x))]∆(x)

  • x

E

x∼p[∆2(x)]

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 21 / 27

slide-122
SLIDE 122

Bounding the Hellinger distance: general idea

Definition

H2(P, Q) = 1 2

  • (
  • p(x) −
  • q(x))2dx = 1 −

p(x)q(x)dx If q(x) = (1 + ∆(x))p(x) for some small ∆, then [Pollard ’00] H2(p, q) = 1 − 1 + ∆(x)p(x)dx = 1 − E

x∼p[

  • 1 + ∆(x)]

= 1 − E

x∼p[1 + ∆(x)

  • q(x)−p(x)=0

/2 − O(∆2(x))]∆(x)

  • x

E

x∼p[∆2(x)]

Compare to TV(p, q) = 1

2 Ex∼p[|∆(x)|]

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 21 / 27

slide-123
SLIDE 123

Bounding the Hellinger distance: our setting

Lemma

Let F, F ′ be two subgaussian distributions with k matching moments and constant parameters. Then for G, G′ = F + N(0, σ2), F ′ + N(0, σ2), H2(G, G′) 1/σ2k+2.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 22 / 27

slide-124
SLIDE 124

Bounding the Hellinger distance: our setting

Lemma

Let F, F ′ be two subgaussian distributions with k matching moments and constant parameters. Then for G, G′ = F + N(0, σ2), F ′ + N(0, σ2), H2(G, G′) 1/σ2k+2. Power series expansion of E[∆2] = E

  • G′(x)−G(x)

G(x)

2 .

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 22 / 27

slide-125
SLIDE 125

Bounding the Hellinger distance: our setting

Lemma

Let F, F ′ be two subgaussian distributions with k matching moments and constant parameters. Then for G, G′ = F + N(0, σ2), F ′ + N(0, σ2), H2(G, G′) 1/σ2k+2. Power series expansion of E[∆2] = E

  • G′(x)−G(x)

G(x)

2 . Matching moments make the first k terms zero.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 22 / 27

slide-126
SLIDE 126

Bounding the Hellinger distance: our setting

Lemma

Let F, F ′ be two subgaussian distributions with k matching moments and constant parameters. Then for G, G′ = F + N(0, σ2), F ′ + N(0, σ2), H2(G, G′) 1/σ2k+2. Power series expansion of E[∆2] = E

  • G′(x)−G(x)

G(x)

2 . Matching moments make the first k terms zero. Leaves (1/σk+1)2 as largest remaining term.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 22 / 27

slide-127
SLIDE 127

Lower bound in one dimension

Add N(0, σ2) to two mixtures with five matching moments.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 23 / 27

slide-128
SLIDE 128

Lower bound in one dimension

Add N(0, σ2) to two mixtures with five matching moments.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 23 / 27

slide-129
SLIDE 129

Lower bound in one dimension

Add N(0, σ2) to two mixtures with five matching moments.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 23 / 27

slide-130
SLIDE 130

Lower bound in one dimension

Add N(0, σ2) to two mixtures with five matching moments.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 23 / 27

slide-131
SLIDE 131

Lower bound in one dimension

Add N(0, σ2) to two mixtures with five matching moments.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 23 / 27

slide-132
SLIDE 132

Lower bound in one dimension

Add N(0, σ2) to two mixtures with five matching moments.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 23 / 27

slide-133
SLIDE 133

Lower bound in one dimension

Add N(0, σ2) to two mixtures with five matching moments.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 23 / 27

slide-134
SLIDE 134

Lower bound in one dimension

Add N(0, σ2) to two mixtures with five matching moments. For G = 1 2N(−1, 1 + σ2) + 1 2N(1, 2 + σ2) G′ ≈ 0.297N(−1.226, 0.610 + σ2) + 0.703N(0.517, 2.396 + σ2) have H2(G, G′) 1/σ12.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 23 / 27

slide-135
SLIDE 135

Lower bound in one dimension

Add N(0, σ2) to two mixtures with five matching moments. For G = 1 2N(−1, 1 + σ2) + 1 2N(1, 2 + σ2) G′ ≈ 0.297N(−1.226, 0.610 + σ2) + 0.703N(0.517, 2.396 + σ2) have H2(G, G′) 1/σ12. Therefore distinguishing G from G′ takes Ω(σ12) samples.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 23 / 27

slide-136
SLIDE 136

Lower bound in one dimension

Add N(0, σ2) to two mixtures with five matching moments. For G = 1 2N(−1, 1 + σ2) + 1 2N(1, 2 + σ2) G′ ≈ 0.297N(−1.226, 0.610 + σ2) + 0.703N(0.517, 2.396 + σ2) have H2(G, G′) 1/σ12. Therefore distinguishing G from G′ takes Ω(σ12) samples. Cannot learn either means to ±ǫσ or variance to ±ǫ2σ2 with

  • (1/ǫ12) samples.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 23 / 27

slide-137
SLIDE 137

Lower bound in d dimensions

Trivial based on the Hellinger distance bound.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 24 / 27

slide-138
SLIDE 138

Lower bound in d dimensions

Trivial based on the Hellinger distance bound. Place the “hard” instance independently in all d coordinates.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 24 / 27

slide-139
SLIDE 139

Lower bound in d dimensions

Trivial based on the Hellinger distance bound. Place the “hard” instance independently in all d coordinates. Solution must solve all d instances.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 24 / 27

slide-140
SLIDE 140

Lower bound in d dimensions

Trivial based on the Hellinger distance bound. Place the “hard” instance independently in all d coordinates. Solution must solve all d instances. Each instance has Hellinger distance O(ǫ12).

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 24 / 27

slide-141
SLIDE 141

Lower bound in d dimensions

Trivial based on the Hellinger distance bound. Place the “hard” instance independently in all d coordinates. Solution must solve all d instances. Each instance has Hellinger distance O(ǫ12). Therefore Ω(ǫ−12 log(d/δ)) samples are necessary to succeed with probability 1 − δ:

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 24 / 27

slide-142
SLIDE 142

Lower bound in d dimensions

Trivial based on the Hellinger distance bound. Place the “hard” instance independently in all d coordinates. Solution must solve all d instances. Each instance has Hellinger distance O(ǫ12). Therefore Ω(ǫ−12 log(d/δ)) samples are necessary to succeed with probability 1 − δ:

◮ Each set of ǫ−12 samples has a constant chance of giving no

information about each coordinate.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 24 / 27

slide-143
SLIDE 143

Lower bound in d dimensions

Trivial based on the Hellinger distance bound. Place the “hard” instance independently in all d coordinates. Solution must solve all d instances. Each instance has Hellinger distance O(ǫ12). Therefore Ω(ǫ−12 log(d/δ)) samples are necessary to succeed with probability 1 − δ:

◮ Each set of ǫ−12 samples has a constant chance of giving no

information about each coordinate.

◮ With o(ǫ−12 log d) samples, some coordinate will be independent of

all the samples.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 24 / 27

slide-144
SLIDE 144

Outline

1

Algorithm in One Dimension

2

Lower Bound

3

Algorithm in d Dimensions

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 25 / 27

slide-145
SLIDE 145

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-146
SLIDE 146

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

◮ (And covariance matrix) Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-147
SLIDE 147

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

◮ (And covariance matrix)

Look at individual attributes to get all these.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-148
SLIDE 148

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

◮ (And covariance matrix)

Look at individual attributes to get all these. Just need to know: is the taller group also heavier or lighter?

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-149
SLIDE 149

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

◮ (And covariance matrix)

Look at individual attributes to get all these. Just need to know: is the taller group also heavier or lighter? Suffices to consider d = 2:

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-150
SLIDE 150

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

◮ (And covariance matrix)

Look at individual attributes to get all these. Just need to know: is the taller group also heavier or lighter? Suffices to consider d = 2:

◮ Does µi go with µj or µ′

j?

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-151
SLIDE 151

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

◮ (And covariance matrix)

Look at individual attributes to get all these. Just need to know: is the taller group also heavier or lighter? Suffices to consider d = 2:

◮ Does µi go with µj or µ′

j?

◮ Project onto a random direction ei sin θ + ej cos θ. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-152
SLIDE 152

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

◮ (And covariance matrix)

Look at individual attributes to get all these. Just need to know: is the taller group also heavier or lighter? Suffices to consider d = 2:

◮ Does µi go with µj or µ′

j?

◮ Project onto a random direction ei sin θ + ej cos θ. ◮ (µi, µj) usually has a significantly different projection from (µi, µ′

j).

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-153
SLIDE 153

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

◮ (And covariance matrix)

Look at individual attributes to get all these. Just need to know: is the taller group also heavier or lighter? Suffices to consider d = 2:

◮ Does µi go with µj or µ′

j?

◮ Project onto a random direction ei sin θ + ej cos θ. ◮ (µi, µj) usually has a significantly different projection from (µi, µ′

j).

Thus we can piece them together by solving the O(d2) one dimensional problems.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-154
SLIDE 154

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

◮ (And covariance matrix)

Look at individual attributes to get all these. Just need to know: is the taller group also heavier or lighter? Suffices to consider d = 2:

◮ Does µi go with µj or µ′

j?

◮ Project onto a random direction ei sin θ + ej cos θ. ◮ (µi, µj) usually has a significantly different projection from (µi, µ′

j).

Thus we can piece them together by solving the O(d2) one dimensional problems. For covariances: reduce to d = 4, so O(d4) one dimensional problems.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-155
SLIDE 155

Algorithm in d dimensions

Want to learn average male/female height, weight, shoe size, ...

◮ (And covariance matrix)

Look at individual attributes to get all these. Just need to know: is the taller group also heavier or lighter? Suffices to consider d = 2:

◮ Does µi go with µj or µ′

j?

◮ Project onto a random direction ei sin θ + ej cos θ. ◮ (µi, µj) usually has a significantly different projection from (µi, µ′

j).

Thus we can piece them together by solving the O(d2) one dimensional problems. For covariances: reduce to d = 4, so O(d4) one dimensional problems. Only loss is log(1/δ) → log(d/δ): Θ(1/ǫ12 log(d/δ)) samples

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 26 / 27

slide-156
SLIDE 156

Recap and open questions

Our result:

◮ Θ(ǫ−12 log d) samples necessary and sufficient to estimate µi to

±ǫσ, σ2

i to ±ǫ2σ2.

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 27 / 27

slide-157
SLIDE 157

Recap and open questions

Our result:

◮ Θ(ǫ−12 log d) samples necessary and sufficient to estimate µi to

±ǫσ, σ2

i to ±ǫ2σ2.

◮ If the means have ασ separation, just O(ǫ−2α−12) for ǫασ accuracy. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 27 / 27

slide-158
SLIDE 158

Recap and open questions

Our result:

◮ Θ(ǫ−12 log d) samples necessary and sufficient to estimate µi to

±ǫσ, σ2

i to ±ǫ2σ2.

◮ If the means have ασ separation, just O(ǫ−2α−12) for ǫασ accuracy.

Extend to k > 2?

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 27 / 27

slide-159
SLIDE 159

Recap and open questions

Our result:

◮ Θ(ǫ−12 log d) samples necessary and sufficient to estimate µi to

±ǫσ, σ2

i to ±ǫ2σ2.

◮ If the means have ασ separation, just O(ǫ−2α−12) for ǫασ accuracy.

Extend to k > 2?

◮ Lower bound extends, at least to Ω(ǫ−6k−2). Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 27 / 27

slide-160
SLIDE 160

Recap and open questions

Our result:

◮ Θ(ǫ−12 log d) samples necessary and sufficient to estimate µi to

±ǫσ, σ2

i to ±ǫ2σ2.

◮ If the means have ασ separation, just O(ǫ−2α−12) for ǫασ accuracy.

Extend to k > 2?

◮ Lower bound extends, at least to Ω(ǫ−6k−2). ◮ Do we really care about finding an O(ǫ−22) algorithm? Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 27 / 27

slide-161
SLIDE 161

Recap and open questions

Our result:

◮ Θ(ǫ−12 log d) samples necessary and sufficient to estimate µi to

±ǫσ, σ2

i to ±ǫ2σ2.

◮ If the means have ασ separation, just O(ǫ−2α−12) for ǫασ accuracy.

Extend to k > 2?

◮ Lower bound extends, at least to Ω(ǫ−6k−2). ◮ Do we really care about finding an O(ǫ−22) algorithm? ◮ Solving the system of equations gets nasty. Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 27 / 27

slide-162
SLIDE 162

Recap and open questions

Our result:

◮ Θ(ǫ−12 log d) samples necessary and sufficient to estimate µi to

±ǫσ, σ2

i to ±ǫ2σ2.

◮ If the means have ασ separation, just O(ǫ−2α−12) for ǫασ accuracy.

Extend to k > 2?

◮ Lower bound extends, at least to Ω(ǫ−6k−2). ◮ Do we really care about finding an O(ǫ−22) algorithm? ◮ Solving the system of equations gets nasty. ◮ [Next talk: Ge-Huang-Kakade avoid this for smoothed instances] Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 27 / 27

slide-163
SLIDE 163

Recap and open questions

Our result:

◮ Θ(ǫ−12 log d) samples necessary and sufficient to estimate µi to

±ǫσ, σ2

i to ±ǫ2σ2.

◮ If the means have ασ separation, just O(ǫ−2α−12) for ǫασ accuracy.

Extend to k > 2?

◮ Lower bound extends, at least to Ω(ǫ−6k−2). ◮ Do we really care about finding an O(ǫ−22) algorithm? ◮ Solving the system of equations gets nasty. ◮ [Next talk: Ge-Huang-Kakade avoid this for smoothed instances]

Automated way of figuring out whether solution to system of polynomial equations is robust?

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 27 / 27

slide-164
SLIDE 164

Recap and open questions

Our result:

◮ Θ(ǫ−12 log d) samples necessary and sufficient to estimate µi to

±ǫσ, σ2

i to ±ǫ2σ2.

◮ If the means have ασ separation, just O(ǫ−2α−12) for ǫασ accuracy.

Extend to k > 2?

◮ Lower bound extends, at least to Ω(ǫ−6k−2). ◮ Do we really care about finding an O(ǫ−22) algorithm? ◮ Solving the system of equations gets nasty. ◮ [Next talk: Ge-Huang-Kakade avoid this for smoothed instances]

Automated way of figuring out whether solution to system of polynomial equations is robust? TV estimation in d dimensions with d/ǫc rather than d30/ǫc?

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 27 / 27

slide-165
SLIDE 165

Moritz Hardt, Eric Price (Google/UT) Tight Bounds for Learning a Mixture of Two Gaussians 2015-06-17 28 / 27