Recent work in Truncated Statistics Andrew Ilyas Motivation: - - PowerPoint PPT Presentation

recent work in truncated statistics
SMART_READER_LITE
LIVE PREVIEW

Recent work in Truncated Statistics Andrew Ilyas Motivation: - - PowerPoint PPT Presentation

Recent work in Truncated Statistics Andrew Ilyas Motivation: Poincar and the Baker Motivation: Poincar and the Baker Motivation: Poincar and the Baker Claimed weight: 1 kg/loaf Motivation: Poincar and the Baker Claimed weight: 1


slide-1
SLIDE 1

Recent work in Truncated Statistics

Andrew Ilyas

slide-2
SLIDE 2

Motivation: Poincaré and the Baker

slide-3
SLIDE 3

Motivation: Poincaré and the Baker

slide-4
SLIDE 4

Motivation: Poincaré and the Baker

Claimed weight: 1 kg/loaf

slide-5
SLIDE 5

Motivation: Poincaré and the Baker

Claimed weight: 1 kg/loaf Average weight: 950 g/loaf

slide-6
SLIDE 6

Motivation: Poincaré and the Baker

Claimed weight: 1 kg/loaf Average weight: 1.05 kg/loaf

slide-7
SLIDE 7

Motivation: Poincaré and the Baker

Claimed weight: 1 kg/loaf Average weight: 1.05 kg/loaf

1 kg

Frequency

slide-8
SLIDE 8

Outline

  • Gaussian parameter estimation [Daskalakis et al, 2018]
  • Regression & classification [Daskalakis et al, 2019; Ilyas et al, 2020 (forthcoming)]
  • Extensions and Limitations [many works]
  • Future work/open problems
slide-9
SLIDE 9

Gaussian Estimation

slide-10
SLIDE 10

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

slide-11
SLIDE 11

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S

slide-12
SLIDE 12

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S

Observe x

slide-13
SLIDE 13

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S x ∉ S

Observe x

slide-14
SLIDE 14

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S

Throw away and restart

x

x ∉ S

Observe x

slide-15
SLIDE 15

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S

Throw away and restart

x

x ∉ S

Observe x

Goal: Obtain estimates from samples

( ̂ μ, ̂ Σ) ≈ (μ, Σ)

slide-16
SLIDE 16

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S

Throw away and restart

x

x ∉ S

Observe x

Goal: Obtain estimates from samples

( ̂ μ, ̂ Σ) ≈ (μ, Σ)

  • Fig. 1 (Daskalakis et al, 2018): 1000 samples from

and from truncated to . Which is which?

𝒪([0,1], I) 𝒪([0,1],4 I) [−0.5,0.5] × [1.5,2.5]

slide-17
SLIDE 17

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

slide-18
SLIDE 18

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Standard approach to estimating Gaussian parameters:
slide-19
SLIDE 19

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Standard approach to estimating Gaussian parameters:

( ̂ μ, ̂ Σ) = arg max

(μ,Σ) ∑ xi

log(fN(xi; μ, Σ)) = arg max

(μ,Σ) ∑ xi

(xi − μ)⊤Σ−1(xi − μ)

slide-20
SLIDE 20

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Standard approach to estimating Gaussian parameters:

( ̂ μ, ̂ Σ) = arg max

(μ,Σ) ∑ xi

log(fN(xi; μ, Σ)) = arg max

(μ,Σ) ∑ xi

(xi − μ)⊤Σ−1(xi − μ)

slide-21
SLIDE 21

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Standard approach to estimating Gaussian parameters:
  • Take derivative, set to 0:

( ̂ μ, ̂ Σ) = arg max

(μ,Σ) ∑ xi

log(fN(xi; μ, Σ)) = arg max

(μ,Σ) ∑ xi

(xi − μ)⊤Σ−1(xi − μ)

slide-22
SLIDE 22

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Standard approach to estimating Gaussian parameters:
  • Take derivative, set to 0:

( ̂ μ, ̂ Σ) = arg max

(μ,Σ) ∑ xi

log(fN(xi; μ, Σ)) = arg max

(μ,Σ) ∑ xi

(xi − μ)⊤Σ−1(xi − μ) ̂ μ = 1 n ∑

xi

xi

slide-23
SLIDE 23

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Standard approach to estimating Gaussian parameters:
  • Take derivative, set to 0:

( ̂ μ, ̂ Σ) = arg max

(μ,Σ) ∑ xi

log(fN(xi; μ, Σ)) = arg max

(μ,Σ) ∑ xi

(xi − μ)⊤Σ−1(xi − μ) ̂ μ = 1 n ∑

xi

xi ̂ Σ = 1 n ∑

xi

(xi − ̂ μ)(xi − ̂ μ)⊤

slide-24
SLIDE 24

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

slide-25
SLIDE 25

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • In the truncated setting, the log-likelihood changes:
slide-26
SLIDE 26

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • In the truncated setting, the log-likelihood changes:

f(x; μ, Σ, S) = fN(x; μ, Σ) ∫S fN(z; μ, Σ) dz if x ∈ S else 0

slide-27
SLIDE 27

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • In the truncated setting, the log-likelihood changes:

f(x; μ, Σ, S) = fN(x; μ, Σ) ∫S fN(z; μ, Σ) dz if x ∈ S else 0 log(f(x; μ, Σ, S)) = log(fN(x; μ, Σ)) − log (∫S fN(z; μ, Σ) dz)

slide-28
SLIDE 28

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • In the truncated setting, the log-likelihood changes:
  • No longer has a closed-form solution for the maximizer

f(x; μ, Σ, S) = fN(x; μ, Σ) ∫S fN(z; μ, Σ) dz if x ∈ S else 0 log(f(x; μ, Σ, S)) = log(fN(x; μ, Σ)) − log (∫S fN(z; μ, Σ) dz)

slide-29
SLIDE 29

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

slide-30
SLIDE 30

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
slide-31
SLIDE 31

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
  • Step 2: We get an unbiased estimate of the gradient from just truncated samples:
slide-32
SLIDE 32

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
  • Step 2: We get an unbiased estimate of the gradient from just truncated samples:

∇μlog(f(x; v, T, S)) = 𝔽z∼𝒪(μ,Σ)[z|z ∈ S] − x

slide-33
SLIDE 33

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
  • Step 2: We get an unbiased estimate of the gradient from just truncated samples:

∇μlog(f(x; v, T, S)) = 𝔽z∼𝒪(μ,Σ)[z|z ∈ S] − x

∇Σlog(f(x; v, T, S)) = 1 2 xx⊤ − 1 2 𝔽z∼𝒪(μ,Σ) [zz⊤|z ∈ S]

slide-34
SLIDE 34

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
  • Step 2: We get an unbiased estimate of the gradient from just truncated samples:

∇μlog(f(x; v, T, S)) = 𝔽z∼𝒪(μ,Σ)[z|z ∈ S] − x

∇Σlog(f(x; v, T, S)) = 1 2 xx⊤ − 1 2 𝔽z∼𝒪(μ,Σ) [zz⊤|z ∈ S]

Expected truncated mean/ covariance under current params

slide-35
SLIDE 35

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
  • Step 2: We get an unbiased estimate of the gradient from just truncated samples:

∇μlog(f(x; v, T, S)) = 𝔽z∼𝒪(μ,Σ)[z|z ∈ S] − x

∇Σlog(f(x; v, T, S)) = 1 2 xx⊤ − 1 2 𝔽z∼𝒪(μ,Σ) [zz⊤|z ∈ S]

Empirical (batch) mean/covariance Expected truncated mean/ covariance under current params

slide-36
SLIDE 36

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
  • Step 2: We get an unbiased estimate of the gradient from just truncated samples:
  • Thus: can execute SGD on the truncated log-likelihood with oracle access to S

∇μlog(f(x; v, T, S)) = 𝔽z∼𝒪(μ,Σ)[z|z ∈ S] − x

∇Σlog(f(x; v, T, S)) = 1 2 xx⊤ − 1 2 𝔽z∼𝒪(μ,Σ) [zz⊤|z ∈ S]

Empirical (batch) mean/covariance Expected truncated mean/ covariance under current params

slide-37
SLIDE 37

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

slide-38
SLIDE 38

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 3: SGD recovers the true parameters!
slide-39
SLIDE 39

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 3: SGD recovers the true parameters!
  • Ingredients:
slide-40
SLIDE 40

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 3: SGD recovers the true parameters!
  • Ingredients:
  • Convexity always holds (not necessarily strong)
slide-41
SLIDE 41

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 3: SGD recovers the true parameters!
  • Ingredients:
  • Convexity always holds (not necessarily strong)
  • Guaranteed constant probability of a sample falling into

α S

slide-42
SLIDE 42

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 3: SGD recovers the true parameters!
  • Ingredients:
  • Convexity always holds (not necessarily strong)
  • Guaranteed constant probability of a sample falling into

α S

  • Efficient projection algorithm into the set of valid parameters (defined by )

α

slide-43
SLIDE 43

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 3: SGD recovers the true parameters!
  • Ingredients:
  • Convexity always holds (not necessarily strong)
  • Guaranteed constant probability of a sample falling into

α S

  • Efficient projection algorithm into the set of valid parameters (defined by )

α

  • Strong convexity within the projection set: H ⪰ C ⋅ α4 ⋅ λm(T−1) ⋅ I
slide-44
SLIDE 44

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 3: SGD recovers the true parameters!
  • Ingredients:
  • Convexity always holds (not necessarily strong)
  • Guaranteed constant probability of a sample falling into

α S

  • Efficient projection algorithm into the set of valid parameters (defined by )

α

  • Strong convexity within the projection set: H ⪰ C ⋅ α4 ⋅ λm(T−1) ⋅ I
  • Good initialization point (i.e., assigns constant mass to )

S

slide-45
SLIDE 45

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

  • Step 3: SGD recovers the true parameters!
  • Ingredients:
  • Convexity always holds (not necessarily strong)
  • Guaranteed constant probability of a sample falling into

α S

  • Efficient projection algorithm into the set of valid parameters (defined by )

α

  • Strong convexity within the projection set: H ⪰ C ⋅ α4 ⋅ λm(T−1) ⋅ I
  • Good initialization point (i.e., assigns constant mass to )

S

  • Result: Efficient algorithm for recovering parameters from truncated data!
slide-46
SLIDE 46

Truncation bias in regression

slide-47
SLIDE 47

Truncation bias in regression

  • Goal: infer the effect of height
  • n basketball ability

xi yi

slide-48
SLIDE 48

Truncation bias in regression

  • Goal: infer the effect of height
  • n basketball ability

xi yi

  • Strategy: linear regression
slide-49
SLIDE 49

Truncation bias in regression

  • Goal: infer the effect of height
  • n basketball ability

xi yi

  • Strategy: linear regression

What we expect:

slide-50
SLIDE 50

Truncation bias in regression

  • Goal: infer the effect of height
  • n basketball ability

xi yi

  • Strategy: linear regression

What we expect:

slide-51
SLIDE 51
  • Goal: infer the effect of height
  • n basketball ability
  • Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

z

slide-52
SLIDE 52
  • Goal: infer the effect of height
  • n basketball ability
  • Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

z

slide-53
SLIDE 53
  • Goal: infer the effect of height
  • n basketball ability
  • Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

z

ability

slide-54
SLIDE 54
  • Goal: infer the effect of height
  • n basketball ability
  • Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

z

ability height

slide-55
SLIDE 55
  • Goal: infer the effect of height
  • n basketball ability
  • Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

ε

z

ability height

slide-56
SLIDE 56
  • Goal: infer the effect of height
  • n basketball ability
  • Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

ε

z

ability height

slide-57
SLIDE 57
  • Goal: infer the effect of height
  • n basketball ability
  • Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

NBA? ε

z

ability height

slide-58
SLIDE 58
  • Goal: infer the effect of height
  • n basketball ability
  • Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

NBA? ε

z

ability height Yes Observe yi

slide-59
SLIDE 59
  • Goal: infer the effect of height
  • n basketball ability
  • Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

NBA? ε

z

ability height No Player unobserved Yes Observe yi

slide-60
SLIDE 60
  • Goal: infer the effect of height
  • n basketball ability
  • Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

NBA? ε

z

ability height No Player unobserved Yes Observe yi

  • Truncation: only observe data

based on the value of yi

slide-61
SLIDE 61

Truncation in practice

Not a hypothetical problem (or a new one!)

slide-62
SLIDE 62

Truncation in practice

Fig 1 [Hausman and Wise 1977]

Not a hypothetical problem (or a new one!)

slide-63
SLIDE 63

Truncation in practice

Fig 1 [Hausman and Wise 1977] Corrected previous findings about education (x) vs income (y) affected by truncation on income (y)

Not a hypothetical problem (or a new one!)

slide-64
SLIDE 64

Truncation in practice

Fig 1 [Hausman and Wise 1977] Corrected previous findings about education (x) vs income (y) affected by truncation on income (y) Table 1 [Lin et al 1999]

Not a hypothetical problem (or a new one!)

slide-65
SLIDE 65

Truncation in practice

Fig 1 [Hausman and Wise 1977] Corrected previous findings about education (x) vs income (y) affected by truncation on income (y) Table 1 [Lin et al 1999] Found bias in income (x) vs child support (y) because respondence rate differs based on y

Not a hypothetical problem (or a new one!)

slide-66
SLIDE 66

Truncation in practice

Fig 1 [Hausman and Wise 1977] Corrected previous findings about education (x) vs income (y) affected by truncation on income (y) Table 1 [Lin et al 1999] Found bias in income (x) vs child support (y) because respondence rate differs based on y

Not a hypothetical problem (or a new one!)

Has inspired lots of prior work in statistics/econometrics Our goal: unified efficient (polynomial in dimension) algorithm

[Galton 1897; Pearson 1902; Lee 1914; Fisher 1931; Hotelling 1948; Tukey 1949; Tobin 1958; Amemiya 1973; Breen 1996; Balakrishnan, Cramer 2014]

slide-67
SLIDE 67

Truncated regression and classification

slide-68
SLIDE 68

Truncated regression and classification

x ∼ D

Sample a covariate x

slide-69
SLIDE 69

Truncated regression and classification

x ∼ D

Sample a covariate x

slide-70
SLIDE 70

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z

slide-71
SLIDE 71

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z w.p. 1 - φ(z)

slide-72
SLIDE 72

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z Throw away (x,z) and restart w.p. 1 - φ(z)

slide-73
SLIDE 73

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z Throw away (x,z) and restart w.p. 1 - φ(z)

slide-74
SLIDE 74

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z w.p. φ(z) Throw away (x,z) and restart w.p. 1 - φ(z)

slide-75
SLIDE 75

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z w.p. φ(z) Throw away (x,z) and restart w.p. 1 - φ(z)

y := π(z)

Project z to a label y

slide-76
SLIDE 76

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z w.p. φ(z) Throw away (x,z) and restart w.p. 1 - φ(z)

y := π(z)

Project z to a label y

slide-77
SLIDE 77

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z w.p. φ(z)

T ∪ {(x, y)}

Add (x,y) to training set Throw away (x,z) and restart w.p. 1 - φ(z)

y := π(z)

Project z to a label y

slide-78
SLIDE 78

Parameter estimation

slide-79
SLIDE 79

Parameter estimation

  • We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

slide-80
SLIDE 80

Parameter estimation

  • We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

  • Standard (non-truncated) approach: maximize likelihood
slide-81
SLIDE 81

Parameter estimation

  • We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

  • Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

slide-82
SLIDE 82

Parameter estimation

  • We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

  • Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

All possible latent variables corresponding to label

slide-83
SLIDE 83

Parameter estimation

  • We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

  • Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

Likelihood of latent under model All possible latent variables corresponding to label

slide-84
SLIDE 84

Parameter estimation

  • We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

  • Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

  • Example: if

is a linear function, then:

Likelihood of latent under model All possible latent variables corresponding to label

slide-85
SLIDE 85

Parameter estimation

  • We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

  • Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

  • Example: if

is a linear function, then:

  • If

and , MLE is ordinary least squares regression

π(z) = z ε ∼ 𝒪(0,1)

Likelihood of latent under model All possible latent variables corresponding to label

slide-86
SLIDE 86

Parameter estimation

  • We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

  • Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

  • Example: if

is a linear function, then:

  • If

and , MLE is ordinary least squares regression

π(z) = z ε ∼ 𝒪(0,1)

  • If

and , MLE is probit regression

π(z) = 1z≥0 ε ∼ 𝒪(0,1)

Likelihood of latent under model All possible latent variables corresponding to label

slide-87
SLIDE 87

Parameter estimation

  • We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

  • Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

  • Example: if

is a linear function, then:

  • If

and , MLE is ordinary least squares regression

π(z) = z ε ∼ 𝒪(0,1)

  • If

and , MLE is probit regression

π(z) = 1z≥0 ε ∼ 𝒪(0,1)

  • If

and , MLE is logistic regression

π(z) = 1z≥0 ε ∼ Logistic(0,1)

Likelihood of latent under model All possible latent variables corresponding to label

slide-88
SLIDE 88

Parameter estimation

  • We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

  • Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

  • Example: if

is a linear function, then:

  • If

and , MLE is ordinary least squares regression

π(z) = z ε ∼ 𝒪(0,1)

  • If

and , MLE is probit regression

π(z) = 1z≥0 ε ∼ 𝒪(0,1)

  • If

and , MLE is logistic regression

π(z) = 1z≥0 ε ∼ Logistic(0,1)

  • What about the truncated case?

Likelihood of latent under model All possible latent variables corresponding to label

slide-89
SLIDE 89

Parameter estimation from truncated data

slide-90
SLIDE 90

Parameter estimation from truncated data

Main idea: maximization of the truncated log-likelihood

slide-91
SLIDE 91

Parameter estimation from truncated data

  • Truncated likelihood:

Main idea: maximization of the truncated log-likelihood

slide-92
SLIDE 92

Parameter estimation from truncated data

  • Truncated likelihood:

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

slide-93
SLIDE 93

Parameter estimation from truncated data

  • Truncated likelihood:

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

slide-94
SLIDE 94

Parameter estimation from truncated data

  • Truncated likelihood:

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x))ϕ(z) dz ∫z DN(z − hθ(x))ϕ(z) dz p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

slide-95
SLIDE 95

Parameter estimation from truncated data

  • Truncated likelihood:

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x))ϕ(z) dz ∫z DN(z − hθ(x))ϕ(z) dz p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

slide-96
SLIDE 96

Parameter estimation from truncated data

  • Truncated likelihood:

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x))ϕ(z) dz ∫z DN(z − hθ(x))ϕ(z) dz p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

slide-97
SLIDE 97

Parameter estimation from truncated data

  • Truncated likelihood:
  • Again, we can compute a stochastic gradient of the log-likelihood with only
  • racle access to

Leads to another SGD-based algorithm

ϕ ⟹

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x))ϕ(z) dz ∫z DN(z − hθ(x))ϕ(z) dz p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

slide-98
SLIDE 98

Parameter estimation from truncated data

  • Truncated likelihood:
  • Again, we can compute a stochastic gradient of the log-likelihood with only
  • racle access to

Leads to another SGD-based algorithm

ϕ ⟹

  • However: this time the loss can actually be non-convex

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x))ϕ(z) dz ∫z DN(z − hθ(x))ϕ(z) dz p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

slide-99
SLIDE 99

Parameter estimation from truncated data

slide-100
SLIDE 100
  • However: this time the loss can actually be non-convex

Parameter estimation from truncated data

slide-101
SLIDE 101
  • However: this time the loss can actually be non-convex

Parameter estimation from truncated data

θ ℓ(θ)

slide-102
SLIDE 102
  • However: this time the loss can actually be non-convex
  • Example: 1D logistic regression, S = [−1, 3]

Parameter estimation from truncated data

θ ℓ(θ)

slide-103
SLIDE 103
  • However: this time the loss can actually be non-convex
  • Example: 1D logistic regression, S = [−1, 3]
  • Instead, we will use quasi-convexity:

Parameter estimation from truncated data

θ ℓ(θ)

slide-104
SLIDE 104
  • However: this time the loss can actually be non-convex
  • Example: 1D logistic regression, S = [−1, 3]
  • Instead, we will use quasi-convexity:

Parameter estimation from truncated data

θ ℓ(θ)

Definition (Quasi-convexity): For all , we have

f(y) ≤ f(x) ⟨∇f(x), y − x⟩ ≤ 0

slide-105
SLIDE 105
  • However: this time the loss can actually be non-convex
  • Example: 1D logistic regression, S = [−1, 3]
  • Instead, we will use quasi-convexity:

Parameter estimation from truncated data

θ ℓ(θ)

Definition (Quasi-convexity): For all , we have

f(y) ≤ f(x) ⟨∇f(x), y − x⟩ ≤ 0

[Hazan et al, 2015] define strict local quasi-convexity (SLQC) property: both stronger (inner product bounded away from zero) and weaker ( is constrained to a ball around ) than just QC

y x*

slide-106
SLIDE 106
  • However: this time the loss can actually be non-convex
  • Example: 1D logistic regression, S = [−1, 3]
  • Instead, we will use quasi-convexity:

Parameter estimation from truncated data

θ ℓ(θ)

Definition (Quasi-convexity): For all , we have

f(y) ≤ f(x) ⟨∇f(x), y − x⟩ ≤ 0

[Hazan et al, 2015] define strict local quasi-convexity (SLQC) property: both stronger (inner product bounded away from zero) and weaker ( is constrained to a ball around ) than just QC

y x*

Their result: normalized SGD with minimum batch size converges to global optimum for SLQC functions

slide-107
SLIDE 107

Analysis

slide-108
SLIDE 108

Analysis

  • Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

slide-109
SLIDE 109

Analysis

  • Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

  • As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

slide-110
SLIDE 110

Analysis

  • Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

  • As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

  • In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

slide-111
SLIDE 111

Analysis

x ∼ D

Sample a covariate x

  • Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

  • As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

  • In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

slide-112
SLIDE 112

Analysis

x ∼ D

Sample a covariate x

  • Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

  • As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

  • In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

slide-113
SLIDE 113

Analysis

x ∼ D

Sample a covariate x Pass to linear model, sample normal/logistic

z = hθ(x) + ε w⊤

* x

  • Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

  • As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

  • In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

slide-114
SLIDE 114

Analysis

x ∼ D

Sample a covariate x Pass to linear model, sample normal/logistic

z = hθ(x) + ε w⊤

* x

Truncate to interval [a,b]

z = hθ(x) + ε w⊤

* x

ϕ(z)

0 b a

  • Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

  • As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

  • In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

slide-115
SLIDE 115

Analysis

x ∼ D

Sample a covariate x Pass to linear model, sample normal/logistic

z = hθ(x) + ε w⊤

* x

Truncate to interval [a,b]

z = hθ(x) + ε w⊤

* x

ϕ(z)

0 b a

  • Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

  • As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

  • In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

slide-116
SLIDE 116

Analysis

x ∼ D

Sample a covariate x Pass to linear model, sample normal/logistic

z = hθ(x) + ε w⊤

* x

Truncate to interval [a,b]

z = hθ(x) + ε w⊤

* x

ϕ(z)

0 b a Project to get a label

z = hθ(x) + ε w⊤

* x

y = 1 y = 0

π(z)

  • Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

  • As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

  • In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

slide-117
SLIDE 117

Analysis

x ∼ D

Sample a covariate x Pass to linear model, sample normal/logistic

z = hθ(x) + ε w⊤

* x

Truncate to interval [a,b]

z = hθ(x) + ε w⊤

* x

ϕ(z)

0 b a Project to get a label

z = hθ(x) + ε w⊤

* x

y = 1 y = 0

π(z)

  • Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

  • As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

  • In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

Theorem (informal): if for every , there is a non-zero ( ) probability that , then NSGD finds an -minimizer of the NLL in steps.

x ∈ ℝd α > 0 y = {0,1} ε

poly(1/α,1/ε, d)

slide-118
SLIDE 118

Experiments

Synthetic data

slide-119
SLIDE 119

Experiments

Synthetic data

Setup:

slide-120
SLIDE 120

Experiments

Synthetic data

Setup:

  • θ* ∼ 𝒱([−1,1]10)
slide-121
SLIDE 121

Experiments

Synthetic data

Setup:

  • θ* ∼ 𝒱([−1,1]10)
  • X ∼ 𝒱([0,1]10×n)
slide-122
SLIDE 122

Experiments

Synthetic data

Setup:

  • θ* ∼ 𝒱([−1,1]10)
  • X ∼ 𝒱([0,1]10×n)
  • (normal/log)

ε ∼ DN

slide-123
SLIDE 123

Experiments

Synthetic data

Setup:

  • θ* ∼ 𝒱([−1,1]10)
  • X ∼ 𝒱([0,1]10×n)
  • (normal/log)

ε ∼ DN

  • Z := θ⊤

* X + ε

slide-124
SLIDE 124

Experiments

Synthetic data

Setup:

  • θ* ∼ 𝒱([−1,1]10)
  • X ∼ 𝒱([0,1]10×n)
  • (normal/log)

ε ∼ DN

  • Z := θ⊤

* X + ε

  • Truncation [C, ∞)
slide-125
SLIDE 125

Experiments

Synthetic data

Setup:

  • θ* ∼ 𝒱([−1,1]10)
  • X ∼ 𝒱([0,1]10×n)
  • (normal/log)

ε ∼ DN

  • Z := θ⊤

* X + ε

  • Truncation [C, ∞)
  • Y = 1Z≥0
slide-126
SLIDE 126

Experiments

Synthetic data

−2 −1

0.2 0.4 0.6 0.8 1 Truncation parameter C Cosine similarity with θ∗

Setup:

  • θ* ∼ 𝒱([−1,1]10)
  • X ∼ 𝒱([0,1]10×n)
  • (normal/log)

ε ∼ DN

  • Z := θ⊤

* X + ε

  • Truncation [C, ∞)
  • Y = 1Z≥0

Standard regression Truncated regression

slide-127
SLIDE 127

Experiments

Synthetic data

−2 −1

0.2 0.4 0.6 0.8 1 Truncation parameter C Cosine similarity with θ∗

−2 −1

0.2 0.4 0.6 0.8 1 Truncation parameter C

Setup:

  • θ* ∼ 𝒱([−1,1]10)
  • X ∼ 𝒱([0,1]10×n)
  • (normal/log)

ε ∼ DN

  • Z := θ⊤

* X + ε

  • Truncation [C, ∞)
  • Y = 1Z≥0

Standard regression Truncated regression

slide-128
SLIDE 128

Experiments

UCI MSD dataset

slide-129
SLIDE 129

Experiments

UCI MSD dataset

Setup:

slide-130
SLIDE 130

Experiments

UCI MSD dataset

Setup:

  • song attributes

X :

slide-131
SLIDE 131

Experiments

UCI MSD dataset

Setup:

  • song attributes

X :

  • year recorded

Z :

slide-132
SLIDE 132

Experiments

UCI MSD dataset

Setup:

  • song attributes

X :

  • year recorded

Z :

  • Truncation [C, ∞)
slide-133
SLIDE 133

Experiments

UCI MSD dataset

Setup:

  • song attributes

X :

  • year recorded

Z :

  • Truncation [C, ∞)
  • recorded before ’96?

Y :

slide-134
SLIDE 134

Experiments

UCI MSD dataset

Setup:

  • song attributes

X :

  • year recorded

Z :

  • Truncation [C, ∞)
  • recorded before ’96?

Y :

1,985 1,990 1,995 2,000 45 55 65 75 Truncation parameter C Test set accuracy Standard regression Truncated regression

slide-135
SLIDE 135

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

slide-136
SLIDE 136

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

  • We saw how to estimate parameters of truncated Gaussian
slide-137
SLIDE 137

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

  • We saw how to estimate parameters of truncated Gaussian
  • Nagarajan & Panageas consider truncated mixture of two Gaussians
slide-138
SLIDE 138

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

  • We saw how to estimate parameters of truncated Gaussian
  • Nagarajan & Panageas consider truncated mixture of two Gaussians

1 2 𝒪(μ, Σ) + 1 2 𝒪(−μ, Σ)

slide-139
SLIDE 139

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

  • We saw how to estimate parameters of truncated Gaussian
  • Nagarajan & Panageas consider truncated mixture of two Gaussians
  • Likelihood can be optimized using the standard expectation-

maximization method, gives local improvement guarantee

1 2 𝒪(μ, Σ) + 1 2 𝒪(−μ, Σ)

slide-140
SLIDE 140

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

  • We saw how to estimate parameters of truncated Gaussian
  • Nagarajan & Panageas consider truncated mixture of two Gaussians
  • Likelihood can be optimized using the standard expectation-

maximization method, gives local improvement guarantee

  • Global convergence of EM for truncated mixtures is shown

1 2 𝒪(μ, Σ) + 1 2 𝒪(−μ, Σ)

slide-141
SLIDE 141

Extensions and Limitations

Unknown truncation set [Kontonis et al, 2019]

slide-142
SLIDE 142

Extensions and Limitations

Unknown truncation set [Kontonis et al, 2019]

  • For general truncation sets , estimating parameters is impossible

S

slide-143
SLIDE 143

Extensions and Limitations

Unknown truncation set [Kontonis et al, 2019]

  • For general truncation sets , estimating parameters is impossible

S

  • However, [Kontonis et al, 2019] show that learning is possible if

the space of possible sets has bounded VC dimension, or Gaussian surface area (measures of complexity):

S

slide-144
SLIDE 144

Extensions and Limitations

Unknown truncation set [Kontonis et al, 2019]

  • For general truncation sets , estimating parameters is impossible

S

  • However, [Kontonis et al, 2019] show that learning is possible if

the space of possible sets has bounded VC dimension, or Gaussian surface area (measures of complexity):

S

slide-145
SLIDE 145

Extensions and Limitations

High-dimensional (sparse) setting [Daskalakis et al, 2020]

slide-146
SLIDE 146

Extensions and Limitations

High-dimensional (sparse) setting [Daskalakis et al, 2020]

  • For linear regression, we can also consider the setting where the

covariates are very high dimensional, but -sparse

xi k

slide-147
SLIDE 147

Extensions and Limitations

High-dimensional (sparse) setting [Daskalakis et al, 2020]

  • For linear regression, we can also consider the setting where the

covariates are very high dimensional, but -sparse

xi k

  • In this setting, [Daskalakis et al, 2020] propose a modified LASSO

algorithm for dealing with truncation

slide-148
SLIDE 148

Extensions and Limitations

High-dimensional (sparse) setting [Daskalakis et al, 2020]

  • For linear regression, we can also consider the setting where the

covariates are very high dimensional, but -sparse

xi k

  • In this setting, [Daskalakis et al, 2020] propose a modified LASSO

algorithm for dealing with truncation

  • Recovers parameters under truncation with error O(

k log(d)/n

slide-149
SLIDE 149

Future Work

slide-150
SLIDE 150

Future Work

  • Robustness to model mis-specification
slide-151
SLIDE 151

Future Work

  • Robustness to model mis-specification
  • Connections to causal inference:
slide-152
SLIDE 152

Future Work

  • Robustness to model mis-specification
  • Connections to causal inference:
  • Selection bias
slide-153
SLIDE 153

Future Work

  • Robustness to model mis-specification
  • Connections to causal inference:
  • Selection bias
  • Truncated outcomes (e.g. death in medical trials, dropping out in

school studies, non-response in surveys)

slide-154
SLIDE 154

Future Work

  • Robustness to model mis-specification
  • Connections to causal inference:
  • Selection bias
  • Truncated outcomes (e.g. death in medical trials, dropping out in

school studies, non-response in surveys)

  • Improving algorithms for censored statistics (where the learner
  • bserves the truncation)