Maximum Likelihood Estimation for Learning Populations of Parameters - - PowerPoint PPT Presentation

maximum likelihood estimation for learning populations of
SMART_READER_LITE
LIVE PREVIEW

Maximum Likelihood Estimation for Learning Populations of Parameters - - PowerPoint PPT Presentation

Maximum Likelihood Estimation for Learning Populations of Parameters Ramya Korlakai Vinayak Postdoctoral Researcher Paul G. Allen School of CSE joint work with Weihao Kong, Gregory Valiant, Sham Kakade Poster #189 ramya@cs.washington.edu 1


slide-1
SLIDE 1

Maximum Likelihood Estimation for Learning Populations of Parameters Ramya Korlakai Vinayak

joint work with Weihao Kong, Gregory Valiant, Sham Kakade

Paul G. Allen School of CSE

1

Postdoctoral Researcher

Poster #189

ramya@cs.washington.edu

slide-2
SLIDE 2

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data
 Suppose for a large random subset of the population in California, we

  • bserve whether a person caught the

flu or not for last 5 years

2

slide-3
SLIDE 3

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data
 Suppose for a large random subset of the population in California, we

  • bserve whether a person caught the

flu or not for last 5 years

2

pi i

Probability of catching flu (unknown)

slide-4
SLIDE 4

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data
 Suppose for a large random subset of the population in California, we

  • bserve whether a person caught the

flu or not for last 5 years

2

bias of coin i

pi i

Probability of catching flu (unknown)

slide-5
SLIDE 5

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data
 Suppose for a large random subset of the population in California, we

  • bserve whether a person caught the

flu or not for last 5 years

2

bias of coin i

pi i

Probability of catching flu (unknown)

1 1

{ }

xi = 2

b pi = xi t = 0.4 ± 0.45

slide-6
SLIDE 6

Poster #189

Motivation: Large yet Sparse Data

Example: Flu data
 Suppose for a large random subset of the population in California, we

  • bserve whether a person caught the

flu or not for last 5 years

2

bias of coin i

pi i

Probability of catching flu (unknown)

1 1

{ }

xi = 2

b pi = xi t = 0.4 ± 0.45

Goal: Can we learn the distribution of the biases over the population?

slide-7
SLIDE 7

Poster #189

Motivation: Large yet Sparse Data

  • Population size is large, often hundreds of thousands or millions
  • Number of observations per individual is limited (sparse) prohibiting accurate

estimation of parameters of interest

  • Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology

Example: Flu data
 Suppose for a large random subset of the population in California, we

  • bserve whether a person caught the

flu or not for last 5 years

2

bias of coin i

pi i

Probability of catching flu (unknown)

1 1

{ }

xi = 2

b pi = xi t = 0.4 ± 0.45

Goal: Can we learn the distribution of the biases over the population?

slide-8
SLIDE 8

Poster #189

Motivation: Large yet Sparse Data

  • Population size is large, often hundreds of thousands or millions
  • Number of observations per individual is limited (sparse) prohibiting accurate

estimation of parameters of interest

  • Application domains: Epidemiology, Social Sciences, Psychology, Medicine, Biology

Example: Flu data
 Suppose for a large random subset of the population in California, we

  • bserve whether a person caught the

flu or not for last 5 years

2

bias of coin i

pi i

Probability of catching flu (unknown)

1 1

{ }

xi = 2

b pi = xi t = 0.4 ± 0.45

Goal: Can we learn the distribution of the biases over the population?

Why? Testing and estimating properties of the distribution

Useful for downstream analysis:

slide-9
SLIDE 9

Model: Non-parametric Mixture of Binomials

1 0.5

True Distribution P ?

  • N independent coins


Each coin has its own bias drawn from i = 1, 2, ..., N

pi ∼ P ? (unknown)

(unknown)

Lord 1965, 1969

3

Poster #189

P ?

slide-10
SLIDE 10

Model: Non-parametric Mixture of Binomials

1 0.5

True Distribution P ?

  • N independent coins


Each coin has its own bias drawn from i = 1, 2, ..., N

pi ∼ P ? (unknown)

(unknown)

  • We get to observe t tosses for every coin

Observations:

Xi ∼ Bin(t, pi) ∈ {0, 1, ..., t}

xi = 2

1 1

{ }

t = 5 tosses

Lord 1965, 1969

3

Poster #189

P ?

slide-11
SLIDE 11

Model: Non-parametric Mixture of Binomials

1 0.5

True Distribution P ?

  • N independent coins


Each coin has its own bias drawn from i = 1, 2, ..., N

pi ∼ P ? (unknown)

(unknown)

  • We get to observe t tosses for every coin

Observations:

Xi ∼ Bin(t, pi) ∈ {0, 1, ..., t}

xi = 2

1 1

{ }

t = 5 tosses

Lord 1965, 1969

3

Poster #189

{Xi}N

i=1

  • Given

ˆ P

P ?

, return estimate of

P ?

slide-12
SLIDE 12

Model: Non-parametric Mixture of Binomials

1 0.5

True Distribution P ?

  • N independent coins


Each coin has its own bias drawn from i = 1, 2, ..., N

pi ∼ P ? (unknown)

(unknown)

  • We get to observe t tosses for every coin

Observations:

Xi ∼ Bin(t, pi) ∈ {0, 1, ..., t}

xi = 2

1 1

{ }

t = 5 tosses

Lord 1965, 1969

3

Poster #189

{Xi}N

i=1

  • Given

ˆ P

P ?

, return estimate of

  • Wasserstein-1 distance

(Earth Mover’s Distance)

W1 ⇣ P ?, ˆ P ⌘

P ?

slide-13
SLIDE 13

Learning with Sparse Observations is Non-trivial

4

Poster #189

  • Empirical plug-in estimator is bad

ˆ Pplug-in = histogram ⇢X1 t , ...Xi t , ..., XN t

  • Θ

✓ 1 √ t ◆

When incurs error of

t ⌧ N

N =

Number of coins

t =

Number of tosses per coin

slide-14
SLIDE 14

Learning with Sparse Observations is Non-trivial

4

Poster #189

The setting in this work is different

  • Many recent works on estimating symmetric properties of a discrete distribution

with sparse observations

Paninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 ….

  • Empirical plug-in estimator is bad

ˆ Pplug-in = histogram ⇢X1 t , ...Xi t , ..., XN t

  • Θ

✓ 1 √ t ◆

When incurs error of

t ⌧ N

N =

Number of coins

t =

Number of tosses per coin

slide-15
SLIDE 15

Learning with Sparse Observations is Non-trivial

4

Poster #189

The setting in this work is different

  • Many recent works on estimating symmetric properties of a discrete distribution

with sparse observations

Paninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 ….

  • Empirical plug-in estimator is bad

ˆ Pplug-in = histogram ⇢X1 t , ...Xi t , ..., XN t

  • Θ

✓ 1 √ t ◆

When incurs error of

t ⌧ N

  • Tian et. al 2017 proposed a moment matching based estimator which achieves

when

  • ptimal error of O

✓1 t ◆

Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger moments t > c log N

t < c log N

N =

Number of coins

t =

Number of tosses per coin

slide-16
SLIDE 16

Learning with Sparse Observations is Non-trivial

4

Poster #189

The setting in this work is different

  • Many recent works on estimating symmetric properties of a discrete distribution

with sparse observations

Paninski 2003, Valiant and Valiant 2011, Jiao et. al. 2015, Orlitsky et. al. 2016, Acharya et. al. 2017 ….

  • Empirical plug-in estimator is bad

ˆ Pplug-in = histogram ⇢X1 t , ...Xi t , ..., XN t

  • Θ

✓ 1 √ t ◆

When incurs error of

t ⌧ N

  • Tian et. al 2017 proposed a moment matching based estimator which achieves

when

  • ptimal error of O

✓1 t ◆

Weakness of moment matching estimator is that it fails to obtain optimal error when due to higher variance in larger moments t > c log N

t < c log N

What about Maximum Likelihood Estimator?

N =

Number of coins

t =

Number of tosses per coin

slide-17
SLIDE 17

Maximum Likelihood Estimator

5

Sufficient statistic: Fingerprint

Poster #189

fingerprint vector

h = [h0, h1, ..hs, .., ht]

1 2 3 4 5

hs

s

hs = # coins that show s heads N

s = 0, 1, ..., t

slide-18
SLIDE 18

Maximum Likelihood Estimator

5

Sufficient statistic: Fingerprint

Poster #189

ˆ Pmle ∈

arg min

Q∈dist[0,1] KL

Observed h , Expected h under the distribution Q fingerprint vector

h = [h0, h1, ..hs, .., ht]

1 2 3 4 5

hs

s

hs = # coins that show s heads N

s = 0, 1, ..., t

slide-19
SLIDE 19
  • NOT the empirical estimator

Maximum Likelihood Estimator

  • Convex optimization: Efficient (polynomial time)

5

Sufficient statistic: Fingerprint

Poster #189

ˆ Pmle ∈

arg min

Q∈dist[0,1] KL

Observed h , Expected h under the distribution Q fingerprint vector

h = [h0, h1, ..hs, .., ht]

1 2 3 4 5

hs

s

hs = # coins that show s heads N

s = 0, 1, ..., t

slide-20
SLIDE 20
  • NOT the empirical estimator

Maximum Likelihood Estimator

  • Convex optimization: Efficient (polynomial time)

5

  • Proposed in late 1960’s by Frederic Lord in the context of psychological testing.

Several works study the geometry and identifiability and uniqueness of the solution of the MLE Lord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999 Sufficient statistic: Fingerprint

Poster #189

ˆ Pmle ∈

arg min

Q∈dist[0,1] KL

Observed h , Expected h under the distribution Q fingerprint vector

h = [h0, h1, ..hs, .., ht]

1 2 3 4 5

hs

s

hs = # coins that show s heads N

s = 0, 1, ..., t

slide-21
SLIDE 21
  • NOT the empirical estimator

Maximum Likelihood Estimator

  • Convex optimization: Efficient (polynomial time)

5

  • Proposed in late 1960’s by Frederic Lord in the context of psychological testing.

Several works study the geometry and identifiability and uniqueness of the solution of the MLE Lord 1965,1969, Turnbull 1976, Laird 1978, Lindsay 1983, Wood 1999 Sufficient statistic: Fingerprint

Poster #189

ˆ Pmle ∈

arg min

Q∈dist[0,1] KL

Observed h , Expected h under the distribution Q fingerprint vector

h = [h0, h1, ..hs, .., ht]

1 2 3 4 5

hs

s

How well does the MLE recover the distribution?

hs = # coins that show s heads N

s = 0, 1, ..., t

slide-22
SLIDE 22

Main Results: MLE is Minimax Optimal in Sparse Regime

6

Poster #189

N =

Number of coins

t =

Number of tosses per coin

t ⌧ N

Sparse Regime

Non-asymptotic guarantees Theorem 1 The MLE achieves following error bounds:

  • Small Sample Regime:
  • w. p. ≥ 1 − δ

W1 ⇣ P ?, ˆ Pmle ⌘ = O ✓1 t ◆

t < c log N

  • Medium Sample Regime:

W1 ⇣ P ?, ˆ Pmle ⌘ = O ✓ 1 √t log N ◆

c log N ≤ t ≤ N 2/9−✏

when when

slide-23
SLIDE 23

Main Results: MLE is Minimax Optimal in Sparse Regime

6

Poster #189

  • Matching Minimax Lower Bounds

inf

f sup P

E [W1(P, f(X))] > Ω ✓1 t ◆ ∨ Ω ✓ 1 √t log N ◆

Theorem 2

N =

Number of coins

t =

Number of tosses per coin

t ⌧ N

Sparse Regime

Non-asymptotic guarantees Theorem 1 The MLE achieves following error bounds:

  • Small Sample Regime:
  • w. p. ≥ 1 − δ

W1 ⇣ P ?, ˆ Pmle ⌘ = O ✓1 t ◆

t < c log N

  • Medium Sample Regime:

W1 ⇣ P ?, ˆ Pmle ⌘ = O ✓ 1 √t log N ◆

c log N ≤ t ≤ N 2/9−✏

when when

slide-24
SLIDE 24

Novel Proof: Polynomial Approximations

7

Poster #189

b ft(x) =

t

X

j=0

bj ✓t j ◆ xj(1 − x)t−j

New bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions on [0, 1] Bernstein polynomials

slide-25
SLIDE 25

Novel Proof: Polynomial Approximations

7

Poster #189

b ft(x) =

t

X

j=0

bj ✓t j ◆ xj(1 − x)t−j

New bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions on [0, 1]

Performance on Real Data

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

CDF Political Leanings

MLE TVK17 Empirical

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

CDF Flight Delays

MLE TVK17 Empirical

Bernstein polynomials

slide-26
SLIDE 26

Summary

Learning distribution of parameters over a population with sparse

  • bservations per individual

MLE is Minimax Optimal even with sparse

  • bservations!

8

ramya@cs.washington.edu

Poster #189

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

CDF Political Leanings

MLE TVK17 Empirical

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

CDF Flight Delays

MLE TVK17 Empirical

Novel proof: new bounds on coefficients of Bernstein polynomials approximating Lipschitz-1 functions

Performance on Real Data