[PPT] - Recent work in Truncated Statistics Andrew Ilyas Motivation: PowerPoint Presentation

SLIDE 1

Recent work in Truncated Statistics

Andrew Ilyas

SLIDE 2

Motivation: Poincaré and the Baker

SLIDE 3

Motivation: Poincaré and the Baker

SLIDE 4

Motivation: Poincaré and the Baker

Claimed weight: 1 kg/loaf

SLIDE 5

Motivation: Poincaré and the Baker

Claimed weight: 1 kg/loaf Average weight: 950 g/loaf

SLIDE 6

Motivation: Poincaré and the Baker

Claimed weight: 1 kg/loaf Average weight: 1.05 kg/loaf

SLIDE 7

Motivation: Poincaré and the Baker

Claimed weight: 1 kg/loaf Average weight: 1.05 kg/loaf

1 kg

Frequency

SLIDE 8

Outline

Gaussian parameter estimation [Daskalakis et al, 2018]
Regression & classification [Daskalakis et al, 2019; Ilyas et al, 2020 (forthcoming)]
Extensions and Limitations [many works]
Future work/open problems

SLIDE 9

Gaussian Estimation

SLIDE 10

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

SLIDE 11

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S

SLIDE 12

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S

Observe x

SLIDE 13

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S x ∉ S

Observe x

SLIDE 14

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S

Throw away and restart

x

x ∉ S

Observe x

SLIDE 15

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S

Throw away and restart

x

x ∉ S

Observe x

Goal: Obtain estimates from samples

( ̂ μ, ̂ Σ) ≈ (μ, Σ)

SLIDE 16

Gaussian Estimation

x ∼ 𝒪(μ, Σ)

Sample x

x ∈ S

Throw away and restart

x

x ∉ S

Observe x

Goal: Obtain estimates from samples

( ̂ μ, ̂ Σ) ≈ (μ, Σ)

Fig. 1 (Daskalakis et al, 2018): 1000 samples from

and from truncated to . Which is which?

𝒪([0,1], I) 𝒪([0,1],4 I) [−0.5,0.5] × [1.5,2.5]

SLIDE 17

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

SLIDE 18

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Standard approach to estimating Gaussian parameters:

SLIDE 19

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Standard approach to estimating Gaussian parameters:

( ̂ μ, ̂ Σ) = arg max

(μ,Σ) ∑ xi

log(fN(xi; μ, Σ)) = arg max

(μ,Σ) ∑ xi

(xi − μ)⊤Σ−1(xi − μ)

SLIDE 20

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Standard approach to estimating Gaussian parameters:

( ̂ μ, ̂ Σ) = arg max

(μ,Σ) ∑ xi

log(fN(xi; μ, Σ)) = arg max

(μ,Σ) ∑ xi

(xi − μ)⊤Σ−1(xi − μ)

SLIDE 21

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Standard approach to estimating Gaussian parameters:
Take derivative, set to 0:

( ̂ μ, ̂ Σ) = arg max

(μ,Σ) ∑ xi

log(fN(xi; μ, Σ)) = arg max

(μ,Σ) ∑ xi

(xi − μ)⊤Σ−1(xi − μ)

SLIDE 22

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Standard approach to estimating Gaussian parameters:
Take derivative, set to 0:

( ̂ μ, ̂ Σ) = arg max

(μ,Σ) ∑ xi

log(fN(xi; μ, Σ)) = arg max

(μ,Σ) ∑ xi

(xi − μ)⊤Σ−1(xi − μ) ̂ μ = 1 n ∑

xi

SLIDE 23

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Standard approach to estimating Gaussian parameters:
Take derivative, set to 0:

( ̂ μ, ̂ Σ) = arg max

(μ,Σ) ∑ xi

log(fN(xi; μ, Σ)) = arg max

(μ,Σ) ∑ xi

(xi − μ)⊤Σ−1(xi − μ) ̂ μ = 1 n ∑

xi

xi ̂ Σ = 1 n ∑

xi

(xi − ̂ μ)(xi − ̂ μ)⊤

SLIDE 24

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

SLIDE 25

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

In the truncated setting, the log-likelihood changes:

SLIDE 26

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

In the truncated setting, the log-likelihood changes:

f(x; μ, Σ, S) = fN(x; μ, Σ) ∫S fN(z; μ, Σ) dz if x ∈ S else 0

SLIDE 27

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

In the truncated setting, the log-likelihood changes:

f(x; μ, Σ, S) = fN(x; μ, Σ) ∫S fN(z; μ, Σ) dz if x ∈ S else 0 log(f(x; μ, Σ, S)) = log(fN(x; μ, Σ)) − log (∫S fN(z; μ, Σ) dz)

SLIDE 28

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

In the truncated setting, the log-likelihood changes:
No longer has a closed-form solution for the maximizer

f(x; μ, Σ, S) = fN(x; μ, Σ) ∫S fN(z; μ, Σ) dz if x ∈ S else 0 log(f(x; μ, Σ, S)) = log(fN(x; μ, Σ)) − log (∫S fN(z; μ, Σ) dz)

SLIDE 29

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

SLIDE 30

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ

SLIDE 31

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
Step 2: We get an unbiased estimate of the gradient from just truncated samples:

SLIDE 32

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
Step 2: We get an unbiased estimate of the gradient from just truncated samples:

∇μlog(f(x; v, T, S)) = 𝔽z∼𝒪(μ,Σ)[z|z ∈ S] − x

SLIDE 33

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
Step 2: We get an unbiased estimate of the gradient from just truncated samples:

∇μlog(f(x; v, T, S)) = 𝔽z∼𝒪(μ,Σ)[z|z ∈ S] − x

∇Σlog(f(x; v, T, S)) = 1 2 xx⊤ − 1 2 𝔽z∼𝒪(μ,Σ) [zz⊤|z ∈ S]

SLIDE 34

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
Step 2: We get an unbiased estimate of the gradient from just truncated samples:

∇μlog(f(x; v, T, S)) = 𝔽z∼𝒪(μ,Σ)[z|z ∈ S] − x

∇Σlog(f(x; v, T, S)) = 1 2 xx⊤ − 1 2 𝔽z∼𝒪(μ,Σ) [zz⊤|z ∈ S]

Expected truncated mean/ covariance under current params

SLIDE 35

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
Step 2: We get an unbiased estimate of the gradient from just truncated samples:

∇μlog(f(x; v, T, S)) = 𝔽z∼𝒪(μ,Σ)[z|z ∈ S] − x

∇Σlog(f(x; v, T, S)) = 1 2 xx⊤ − 1 2 𝔽z∼𝒪(μ,Σ) [zz⊤|z ∈ S]

Empirical (batch) mean/covariance Expected truncated mean/ covariance under current params

SLIDE 36

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 1: Re-parameterize: T = Σ−1, v = Σ−1μ
Step 2: We get an unbiased estimate of the gradient from just truncated samples:
Thus: can execute SGD on the truncated log-likelihood with oracle access to S

∇μlog(f(x; v, T, S)) = 𝔽z∼𝒪(μ,Σ)[z|z ∈ S] − x

∇Σlog(f(x; v, T, S)) = 1 2 xx⊤ − 1 2 𝔽z∼𝒪(μ,Σ) [zz⊤|z ∈ S]

Empirical (batch) mean/covariance Expected truncated mean/ covariance under current params

SLIDE 37

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

SLIDE 38

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 3: SGD recovers the true parameters!

SLIDE 39

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 3: SGD recovers the true parameters!
Ingredients:

SLIDE 40

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 3: SGD recovers the true parameters!
Ingredients:
Convexity always holds (not necessarily strong)

SLIDE 41

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 3: SGD recovers the true parameters!
Ingredients:
Convexity always holds (not necessarily strong)
Guaranteed constant probability of a sample falling into

α S

SLIDE 42

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 3: SGD recovers the true parameters!
Ingredients:
Convexity always holds (not necessarily strong)
Guaranteed constant probability of a sample falling into

α S

Efficient projection algorithm into the set of valid parameters (defined by )

α

SLIDE 43

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 3: SGD recovers the true parameters!
Ingredients:
Convexity always holds (not necessarily strong)
Guaranteed constant probability of a sample falling into

α S

Efficient projection algorithm into the set of valid parameters (defined by )

α

Strong convexity within the projection set: H ⪰ C ⋅ α4 ⋅ λm(T−1) ⋅ I

SLIDE 44

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 3: SGD recovers the true parameters!
Ingredients:
Convexity always holds (not necessarily strong)
Guaranteed constant probability of a sample falling into

α S

Efficient projection algorithm into the set of valid parameters (defined by )

α

Strong convexity within the projection set: H ⪰ C ⋅ α4 ⋅ λm(T−1) ⋅ I
Good initialization point (i.e., assigns constant mass to )

S

SLIDE 45

Theme: Maximum Likelihood Estimation

Projected Gradient Descent on the Negative Log-Likelihood (NLL)

Step 3: SGD recovers the true parameters!
Ingredients:
Convexity always holds (not necessarily strong)
Guaranteed constant probability of a sample falling into

α S

Efficient projection algorithm into the set of valid parameters (defined by )

α

Strong convexity within the projection set: H ⪰ C ⋅ α4 ⋅ λm(T−1) ⋅ I
Good initialization point (i.e., assigns constant mass to )

S

Result: Efficient algorithm for recovering parameters from truncated data!

SLIDE 46

Truncation bias in regression

SLIDE 47

Truncation bias in regression

Goal: infer the effect of height
n basketball ability

xi yi

SLIDE 48

Truncation bias in regression

Goal: infer the effect of height
n basketball ability

xi yi

Strategy: linear regression

SLIDE 49

Truncation bias in regression

Goal: infer the effect of height
n basketball ability

xi yi

Strategy: linear regression

What we expect:

SLIDE 50

Truncation bias in regression

Goal: infer the effect of height
n basketball ability

xi yi

Strategy: linear regression

What we expect:

SLIDE 51

Goal: infer the effect of height
n basketball ability
Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

z

SLIDE 52

Goal: infer the effect of height
n basketball ability
Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

z

SLIDE 53

Goal: infer the effect of height
n basketball ability
Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

z

ability

SLIDE 54

Goal: infer the effect of height
n basketball ability
Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

z

ability height

SLIDE 55

Goal: infer the effect of height
n basketball ability
Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

ε

z

ability height

SLIDE 56

Goal: infer the effect of height
n basketball ability
Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

ε

z

ability height

SLIDE 57

Goal: infer the effect of height
n basketball ability
Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

NBA? ε

z

ability height

SLIDE 58

Goal: infer the effect of height
n basketball ability
Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

NBA? ε

z

ability height Yes Observe yi

SLIDE 59

Goal: infer the effect of height
n basketball ability
Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

NBA? ε

z

ability height No Player unobserved Yes Observe yi

SLIDE 60

Goal: infer the effect of height
n basketball ability
Strategy: linear regression

xi yi

Bias from truncation: an illustration

What we get:

Good enough for NBA!

NBA? ε

z

ability height No Player unobserved Yes Observe yi

Truncation: only observe data

based on the value of yi

SLIDE 61

Truncation in practice

Not a hypothetical problem (or a new one!)

SLIDE 62

Truncation in practice

Fig 1 [Hausman and Wise 1977]

Not a hypothetical problem (or a new one!)

SLIDE 63

Truncation in practice

Fig 1 [Hausman and Wise 1977] Corrected previous findings about education (x) vs income (y) affected by truncation on income (y)

Not a hypothetical problem (or a new one!)

SLIDE 64

Truncation in practice

Fig 1 [Hausman and Wise 1977] Corrected previous findings about education (x) vs income (y) affected by truncation on income (y) Table 1 [Lin et al 1999]

Not a hypothetical problem (or a new one!)

SLIDE 65

Truncation in practice

Fig 1 [Hausman and Wise 1977] Corrected previous findings about education (x) vs income (y) affected by truncation on income (y) Table 1 [Lin et al 1999] Found bias in income (x) vs child support (y) because respondence rate differs based on y

Not a hypothetical problem (or a new one!)

SLIDE 66

Truncation in practice

Fig 1 [Hausman and Wise 1977] Corrected previous findings about education (x) vs income (y) affected by truncation on income (y) Table 1 [Lin et al 1999] Found bias in income (x) vs child support (y) because respondence rate differs based on y

Not a hypothetical problem (or a new one!)

Has inspired lots of prior work in statistics/econometrics Our goal: unified efficient (polynomial in dimension) algorithm

[Galton 1897; Pearson 1902; Lee 1914; Fisher 1931; Hotelling 1948; Tukey 1949; Tobin 1958; Amemiya 1973; Breen 1996; Balakrishnan, Cramer 2014]

SLIDE 67

Truncated regression and classification

SLIDE 68

Truncated regression and classification

x ∼ D

Sample a covariate x

SLIDE 69

Truncated regression and classification

x ∼ D

Sample a covariate x

SLIDE 70

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z

SLIDE 71

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z w.p. 1 - φ(z)

SLIDE 72

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z Throw away (x,z) and restart w.p. 1 - φ(z)

SLIDE 73

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z Throw away (x,z) and restart w.p. 1 - φ(z)

SLIDE 74

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z w.p. φ(z) Throw away (x,z) and restart w.p. 1 - φ(z)

SLIDE 75

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z w.p. φ(z) Throw away (x,z) and restart w.p. 1 - φ(z)

y := π(z)

Project z to a label y

SLIDE 76

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z w.p. φ(z) Throw away (x,z) and restart w.p. 1 - φ(z)

y := π(z)

Project z to a label y

SLIDE 77

Truncated regression and classification

x ∼ D

Sample a covariate x

z = hθ*(x) + ε ε ∼ DN

Sample noise ε, compute latent z w.p. φ(z)

T ∪ {(x, y)}

Add (x,y) to training set Throw away (x,z) and restart w.p. 1 - φ(z)

y := π(z)

Project z to a label y

SLIDE 78

Parameter estimation

SLIDE 79

Parameter estimation

We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

SLIDE 80

Parameter estimation

We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

Standard (non-truncated) approach: maximize likelihood

SLIDE 81

Parameter estimation

We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

SLIDE 82

Parameter estimation

We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

All possible latent variables corresponding to label

SLIDE 83

Parameter estimation

We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

Likelihood of latent under model All possible latent variables corresponding to label

SLIDE 84

Parameter estimation

We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

Example: if

is a linear function, then:

hθ

Likelihood of latent under model All possible latent variables corresponding to label

SLIDE 85

Parameter estimation

We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

Example: if

is a linear function, then:

hθ

If

and , MLE is ordinary least squares regression

π(z) = z ε ∼ 𝒪(0,1)

Likelihood of latent under model All possible latent variables corresponding to label

SLIDE 86

Parameter estimation

We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

Example: if

is a linear function, then:

hθ

If

and , MLE is ordinary least squares regression

π(z) = z ε ∼ 𝒪(0,1)

If

and , MLE is probit regression

π(z) = 1z≥0 ε ∼ 𝒪(0,1)

Likelihood of latent under model All possible latent variables corresponding to label

SLIDE 87

Parameter estimation

We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

Example: if

is a linear function, then:

hθ

If

and , MLE is ordinary least squares regression

π(z) = z ε ∼ 𝒪(0,1)

If

and , MLE is probit regression

π(z) = 1z≥0 ε ∼ 𝒪(0,1)

If

and , MLE is logistic regression

π(z) = 1z≥0 ε ∼ Logistic(0,1)

Likelihood of latent under model All possible latent variables corresponding to label

SLIDE 88

Parameter estimation

We have a model

where , want estimate

yi ∼ π (hθ*(xi) + ε) ε ∼ DN ̂ θ for θ*

Standard (non-truncated) approach: maximize likelihood

p(θ; x, y) = ∫z∈π−1(y)

DN(z − hθ(x)) dz

Example: if

is a linear function, then:

hθ

If

and , MLE is ordinary least squares regression

π(z) = z ε ∼ 𝒪(0,1)

If

and , MLE is probit regression

π(z) = 1z≥0 ε ∼ 𝒪(0,1)

If

and , MLE is logistic regression

π(z) = 1z≥0 ε ∼ Logistic(0,1)

What about the truncated case?

Likelihood of latent under model All possible latent variables corresponding to label

SLIDE 89

Parameter estimation from truncated data

SLIDE 90

Parameter estimation from truncated data

Main idea: maximization of the truncated log-likelihood

SLIDE 91

Parameter estimation from truncated data

Truncated likelihood:

Main idea: maximization of the truncated log-likelihood

SLIDE 92

Parameter estimation from truncated data

Truncated likelihood:

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

SLIDE 93

Parameter estimation from truncated data

Truncated likelihood:

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

SLIDE 94

Parameter estimation from truncated data

Truncated likelihood:

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x))ϕ(z) dz ∫z DN(z − hθ(x))ϕ(z) dz p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

SLIDE 95

Parameter estimation from truncated data

Truncated likelihood:

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x))ϕ(z) dz ∫z DN(z − hθ(x))ϕ(z) dz p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

SLIDE 96

Parameter estimation from truncated data

Truncated likelihood:

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x))ϕ(z) dz ∫z DN(z − hθ(x))ϕ(z) dz p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

SLIDE 97

Parameter estimation from truncated data

Truncated likelihood:
Again, we can compute a stochastic gradient of the log-likelihood with only
racle access to

Leads to another SGD-based algorithm

ϕ ⟹

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x))ϕ(z) dz ∫z DN(z − hθ(x))ϕ(z) dz p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

SLIDE 98

Parameter estimation from truncated data

Truncated likelihood:
Again, we can compute a stochastic gradient of the log-likelihood with only
racle access to

Leads to another SGD-based algorithm

ϕ ⟹

However: this time the loss can actually be non-convex

p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x))ϕ(z) dz ∫z DN(z − hθ(x))ϕ(z) dz p(θ; x, y) = ∫z∈π−1(y) DN(z − hθ(x)) dz

Main idea: maximization of the truncated log-likelihood

SLIDE 99

Parameter estimation from truncated data

SLIDE 100

However: this time the loss can actually be non-convex

Parameter estimation from truncated data

SLIDE 101

However: this time the loss can actually be non-convex

Parameter estimation from truncated data

θ ℓ(θ)

SLIDE 102

However: this time the loss can actually be non-convex
Example: 1D logistic regression, S = [−1, 3]

Parameter estimation from truncated data

θ ℓ(θ)

SLIDE 103

However: this time the loss can actually be non-convex
Example: 1D logistic regression, S = [−1, 3]
Instead, we will use quasi-convexity:

Parameter estimation from truncated data

θ ℓ(θ)

SLIDE 104

However: this time the loss can actually be non-convex
Example: 1D logistic regression, S = [−1, 3]
Instead, we will use quasi-convexity:

Parameter estimation from truncated data

θ ℓ(θ)

Definition (Quasi-convexity): For all , we have

f(y) ≤ f(x) ⟨∇f(x), y − x⟩ ≤ 0

SLIDE 105

However: this time the loss can actually be non-convex
Example: 1D logistic regression, S = [−1, 3]
Instead, we will use quasi-convexity:

Parameter estimation from truncated data

θ ℓ(θ)

Definition (Quasi-convexity): For all , we have

f(y) ≤ f(x) ⟨∇f(x), y − x⟩ ≤ 0

[Hazan et al, 2015] define strict local quasi-convexity (SLQC) property: both stronger (inner product bounded away from zero) and weaker ( is constrained to a ball around ) than just QC

y x*

SLIDE 106

However: this time the loss can actually be non-convex
Example: 1D logistic regression, S = [−1, 3]
Instead, we will use quasi-convexity:

Parameter estimation from truncated data

θ ℓ(θ)

Definition (Quasi-convexity): For all , we have

f(y) ≤ f(x) ⟨∇f(x), y − x⟩ ≤ 0

[Hazan et al, 2015] define strict local quasi-convexity (SLQC) property: both stronger (inner product bounded away from zero) and weaker ( is constrained to a ball around ) than just QC

y x*

Their result: normalized SGD with minimum batch size converges to global optimum for SLQC functions

SLIDE 107

Analysis

SLIDE 108

Analysis

Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

SLIDE 109

Analysis

Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

⟹

SLIDE 110

Analysis

Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

⟹

In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

SLIDE 111

Analysis

x ∼ D

Sample a covariate x

Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

⟹

In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

SLIDE 112

Analysis

x ∼ D

Sample a covariate x

Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

⟹

In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

SLIDE 113

Analysis

x ∼ D

Sample a covariate x Pass to linear model, sample normal/logistic

z = hθ(x) + ε w⊤

* x

Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

⟹

In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

SLIDE 114

Analysis

x ∼ D

Sample a covariate x Pass to linear model, sample normal/logistic

z = hθ(x) + ε w⊤

* x

Truncate to interval [a,b]

z = hθ(x) + ε w⊤

* x

ϕ(z)

0 b a

Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

⟹

In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

SLIDE 115

Analysis

x ∼ D

Sample a covariate x Pass to linear model, sample normal/logistic

z = hθ(x) + ε w⊤

* x

Truncate to interval [a,b]

z = hθ(x) + ε w⊤

* x

ϕ(z)

0 b a

Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

⟹

In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

SLIDE 116

Analysis

x ∼ D

Sample a covariate x Pass to linear model, sample normal/logistic

z = hθ(x) + ε w⊤

* x

Truncate to interval [a,b]

z = hθ(x) + ε w⊤

* x

ϕ(z)

0 b a Project to get a label

z = hθ(x) + ε w⊤

* x

y = 1 y = 0

π(z)

Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

⟹

In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

SLIDE 117

Analysis

x ∼ D

Sample a covariate x Pass to linear model, sample normal/logistic

z = hθ(x) + ε w⊤

* x

Truncate to interval [a,b]

z = hθ(x) + ε w⊤

* x

ϕ(z)

0 b a Project to get a label

z = hθ(x) + ε w⊤

* x

y = 1 y = 0

π(z)

Goal: show that NSGD on NLL

converges to maximizer of the (population) log-likelihood

As with estimation, we define a

projection set where linear, probit, and logistic regression are all SLQC NSGD converges

⟹

In fact, linear regression was

shown strongly convex by [Daskalakis et al, 2019]

Theorem (informal): if for every , there is a non-zero ( ) probability that , then NSGD finds an -minimizer of the NLL in steps.

x ∈ ℝd α > 0 y = {0,1} ε

poly(1/α,1/ε, d)

SLIDE 118

Experiments

Synthetic data

SLIDE 119

Experiments

Synthetic data

Setup:

SLIDE 120

Experiments

Synthetic data

Setup:

θ* ∼ 𝒱([−1,1]10)

SLIDE 121

Experiments

Synthetic data

Setup:

θ* ∼ 𝒱([−1,1]10)
X ∼ 𝒱([0,1]10×n)

SLIDE 122

Experiments

Synthetic data

Setup:

θ* ∼ 𝒱([−1,1]10)
X ∼ 𝒱([0,1]10×n)
(normal/log)

ε ∼ DN

SLIDE 123

Experiments

Synthetic data

Setup:

θ* ∼ 𝒱([−1,1]10)
X ∼ 𝒱([0,1]10×n)
(normal/log)

ε ∼ DN

Z := θ⊤

* X + ε

SLIDE 124

Experiments

Synthetic data

Setup:

θ* ∼ 𝒱([−1,1]10)
X ∼ 𝒱([0,1]10×n)
(normal/log)

ε ∼ DN

Z := θ⊤

* X + ε

Truncation [C, ∞)

SLIDE 125

Experiments

Synthetic data

Setup:

θ* ∼ 𝒱([−1,1]10)
X ∼ 𝒱([0,1]10×n)
(normal/log)

ε ∼ DN

Z := θ⊤

* X + ε

Truncation [C, ∞)
Y = 1Z≥0

SLIDE 126

Experiments

Synthetic data

−2 −1

0.2 0.4 0.6 0.8 1 Truncation parameter C Cosine similarity with θ∗

Setup:

θ* ∼ 𝒱([−1,1]10)
X ∼ 𝒱([0,1]10×n)
(normal/log)

ε ∼ DN

Z := θ⊤

* X + ε

Truncation [C, ∞)
Y = 1Z≥0

Standard regression Truncated regression

SLIDE 127

Experiments

Synthetic data

−2 −1

0.2 0.4 0.6 0.8 1 Truncation parameter C Cosine similarity with θ∗

−2 −1

0.2 0.4 0.6 0.8 1 Truncation parameter C

Setup:

θ* ∼ 𝒱([−1,1]10)
X ∼ 𝒱([0,1]10×n)
(normal/log)

ε ∼ DN

Z := θ⊤

* X + ε

Truncation [C, ∞)
Y = 1Z≥0

Standard regression Truncated regression

SLIDE 128

Experiments

UCI MSD dataset

SLIDE 129

Experiments

UCI MSD dataset

Setup:

SLIDE 130

Experiments

UCI MSD dataset

Setup:

song attributes

X :

SLIDE 131

Experiments

UCI MSD dataset

Setup:

song attributes

X :

year recorded

Z :

SLIDE 132

Experiments

UCI MSD dataset

Setup:

song attributes

X :

year recorded

Z :

Truncation [C, ∞)

SLIDE 133

Experiments

UCI MSD dataset

Setup:

song attributes

X :

year recorded

Z :

Truncation [C, ∞)
recorded before ’96?

Y :

SLIDE 134

Experiments

UCI MSD dataset

Setup:

song attributes

X :

year recorded

Z :

Truncation [C, ∞)
recorded before ’96?

Y :

1,985 1,990 1,995 2,000 45 55 65 75 Truncation parameter C Test set accuracy Standard regression Truncated regression

SLIDE 135

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

SLIDE 136

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

We saw how to estimate parameters of truncated Gaussian

SLIDE 137

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

We saw how to estimate parameters of truncated Gaussian
Nagarajan & Panageas consider truncated mixture of two Gaussians

SLIDE 138

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

We saw how to estimate parameters of truncated Gaussian
Nagarajan & Panageas consider truncated mixture of two Gaussians

1 2 𝒪(μ, Σ) + 1 2 𝒪(−μ, Σ)

SLIDE 139

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

We saw how to estimate parameters of truncated Gaussian
Nagarajan & Panageas consider truncated mixture of two Gaussians
Likelihood can be optimized using the standard expectation-

maximization method, gives local improvement guarantee

1 2 𝒪(μ, Σ) + 1 2 𝒪(−μ, Σ)

SLIDE 140

Extensions and Limitations

Mixture of two Gaussians [Nagarajan & Panageas, 2019]

We saw how to estimate parameters of truncated Gaussian
Nagarajan & Panageas consider truncated mixture of two Gaussians
Likelihood can be optimized using the standard expectation-

maximization method, gives local improvement guarantee

Global convergence of EM for truncated mixtures is shown

1 2 𝒪(μ, Σ) + 1 2 𝒪(−μ, Σ)

SLIDE 141

Extensions and Limitations

Unknown truncation set [Kontonis et al, 2019]

SLIDE 142

Extensions and Limitations

Unknown truncation set [Kontonis et al, 2019]

For general truncation sets , estimating parameters is impossible

S

SLIDE 143

Extensions and Limitations

Unknown truncation set [Kontonis et al, 2019]

For general truncation sets , estimating parameters is impossible

S

However, [Kontonis et al, 2019] show that learning is possible if

the space of possible sets has bounded VC dimension, or Gaussian surface area (measures of complexity):

S

SLIDE 144

Extensions and Limitations

Unknown truncation set [Kontonis et al, 2019]

For general truncation sets , estimating parameters is impossible

S

However, [Kontonis et al, 2019] show that learning is possible if

the space of possible sets has bounded VC dimension, or Gaussian surface area (measures of complexity):

S

SLIDE 145

Extensions and Limitations

High-dimensional (sparse) setting [Daskalakis et al, 2020]

SLIDE 146

Extensions and Limitations

High-dimensional (sparse) setting [Daskalakis et al, 2020]

For linear regression, we can also consider the setting where the

covariates are very high dimensional, but -sparse

xi k

SLIDE 147

Extensions and Limitations

High-dimensional (sparse) setting [Daskalakis et al, 2020]

For linear regression, we can also consider the setting where the

covariates are very high dimensional, but -sparse

xi k

In this setting, [Daskalakis et al, 2020] propose a modified LASSO

algorithm for dealing with truncation

SLIDE 148

Extensions and Limitations

High-dimensional (sparse) setting [Daskalakis et al, 2020]

For linear regression, we can also consider the setting where the

covariates are very high dimensional, but -sparse

xi k

In this setting, [Daskalakis et al, 2020] propose a modified LASSO

algorithm for dealing with truncation

Recovers parameters under truncation with error O(

k log(d)/n

SLIDE 149

Future Work

SLIDE 150

Future Work

Robustness to model mis-specification

SLIDE 151

Future Work

Robustness to model mis-specification
Connections to causal inference:

SLIDE 152

Future Work

Robustness to model mis-specification
Connections to causal inference:
Selection bias

SLIDE 153

Future Work

Robustness to model mis-specification
Connections to causal inference:
Selection bias
Truncated outcomes (e.g. death in medical trials, dropping out in

school studies, non-response in surveys)

SLIDE 154

Future Work

Robustness to model mis-specification
Connections to causal inference:
Selection bias
Truncated outcomes (e.g. death in medical trials, dropping out in

school studies, non-response in surveys)

Improving algorithms for censored statistics (where the learner
bserves the truncation)