Controlling for confounders through approximate sufficiency Rina - - PowerPoint PPT Presentation

controlling for confounders through approximate
SMART_READER_LITE
LIVE PREVIEW

Controlling for confounders through approximate sufficiency Rina - - PowerPoint PPT Presentation

Controlling for confounders through approximate sufficiency Rina Foygel Barber (joint with Lucas Janson) http://www.stat.uchicago.edu/~rina/ Collaborator Lucas Janson (Harvard U.) 2/27 Intro: testing conditional independence confounders


slide-1
SLIDE 1

Controlling for confounders through approximate sufficiency

Rina Foygel Barber (joint with Lucas Janson)

http://www.stat.uchicago.edu/~rina/

slide-2
SLIDE 2

Collaborator

Lucas Janson (Harvard U.)

2/27

slide-3
SLIDE 3

Intro: testing conditional independence

confounders Z features X response Y ?

Classical (parametric) approach:

  • Assume a parametric model such as Y | X, Z ∼ f (· ; α⊤X + β⊤Z)
  • Parametric inference to test H0 : α = 0

3/27

slide-4
SLIDE 4

Intro: testing conditional independence

confounders Z features X response Y ?

Classical (parametric) approach:

  • Assume a parametric model such as Y | X, Z ∼ f (· ; α⊤X + β⊤Z)
  • Parametric inference to test H0 : α = 0

Model-X approach a.k.a.Conditional Randomization Test (Cand`

es et al 2018)

  • Known distribution of X | Z

(distrib. of Y unknown)

  • Choose function T(X; Y , Z) that measures association
  • Resample copies ˜

X(1), . . . , ˜ X(M) iid ∼ (distrib. of X | Z)

  • pval = 1 +

m ✶{T( ˜

X(m); Y , Z) ≥ T(X; Y , Z)} 1 + M

3/27

slide-5
SLIDE 5

Intro: testing conditional independence

confounders Z features X response Y ?

4/27

slide-6
SLIDE 6

Intro: testing conditional independence

confounders Z features X response Y ?

Model-X approach via sufficient statistics (Huang & Janson 2019)

  • Distribution of X | Z is only partially known
  • By conditioning on sufficient statistic S(X, Z),

can resample copies ˜ X(1), . . . , ˜ X(M) iid ∼ (distrib. of X | S(X, Z)) & compute p-value for test statistic T as before

4/27

slide-7
SLIDE 7

Intro: testing conditional independence

confounders Z features X response Y ?

Model-X approach via sufficient statistics (Huang & Janson 2019)

  • Distribution of X | Z is only partially known
  • By conditioning on sufficient statistic S(X, Z),

can resample copies ˜ X(1), . . . , ˜ X(M) iid ∼ (distrib. of X | S(X, Z)) & compute p-value for test statistic T as before

  • Example: canonical GLMs

— Xi ∼ exp

  • Xi · Z ⊤

i θ − a(Z ⊤ i θ)

  • , i = 1, . . . , n, with θ unknown

— S(X, Z) =

i XiZi is suff. stat. for X = (X1, . . . , Xn)

4/27

slide-8
SLIDE 8

Intro: testing goodness-of-fit (GoF)

More generally...

Goodness-of-fit test

Testing H0: X ∼ Pθ for some θ ∈ Θ, where {Pθ : θ ∈ Θ} is a parametric family

5/27

slide-9
SLIDE 9

Intro: testing goodness-of-fit (GoF)

More generally...

Goodness-of-fit test

Testing H0: X ∼ Pθ for some θ ∈ Θ, where {Pθ : θ ∈ Θ} is a parametric family Conditional independence testing can be a special case:

  • Assume X | Z ∼ Pθ(·|Z) for some θ ∈ Θ
  • Null hypothesis H0 : X ⊥

⊥ Y | Z

  • Equivalently... H0: X | Y , Z ∼ Pθ(·|Z) for some θ ∈ Θ
  • Note: we condition on Y and Z (i.e., treat as fixed)

5/27

slide-10
SLIDE 10

Intro: testing goodness-of-fit (GoF)

A general framework:

  • Choose any test statistic T : X → R
  • Draw copies ˜

X (1), . . . , ˜ X (M)

  • Compute rank-based p-value

pval = 1 +

m ✶{T( ˜

X(m)) ≥ T(X)} 1 + M

  • If X, ˜

X (1), . . . , ˜ X (M) are exchangeable under H0 p-value is valid

6/27

slide-11
SLIDE 11

Co-sufficient sampling (CSS)

Co-sufficient sampling

Sample copies ˜ X(m) ∼ (distrib. of X | S(X)), where S(X) is a sufficient statistic for the family {Pθ : θ ∈ Θ} Can be applied to:

  • 1. Test goodness-of-fit (GoF)

(Engen & Lilleg˚ ard 1997, Lockhart et al 2007, Stephens 2012, Hazra 2013 ....)

  • 2. Test conditional independence (special case of GoF)

(Rosenbaum 1984, Kolassa 2003, Huang & Janson 2019)

  • 3. Construct conf. intervals for a parameter of interest

(by inverting GoF tests)

7/27

slide-12
SLIDE 12

Co-sufficient sampling (CSS)

Co-sufficient sampling

Sample copies ˜ X(m) ∼ (distrib. of X | S(X)), where S(X) is a sufficient statistic for the family {Pθ : θ ∈ Θ}

8/27

slide-13
SLIDE 13

Co-sufficient sampling (CSS)

Co-sufficient sampling

Sample copies ˜ X(m) ∼ (distrib. of X | S(X)), where S(X) is a sufficient statistic for the family {Pθ : θ ∈ Θ} Permutation tests are an example of CSS

  • H0: X1, . . . , Xn

iid

∼ D for D ∈ (some set)

  • The order statistics X(1) ≤ · · · ≤ X(n) are sufficient under the null
  • Permutation test ⇔ resampling X conditional on order statistics
  • Application: testing X ⊥

⊥ Y H0: conditional on Y1, . . . , Yn, it holds that X1, . . . , Xn are i.i.d.

8/27

slide-14
SLIDE 14

Co-sufficient sampling (CSS)

Limitation of co-sufficient sampling... no power in many settings! Example—logistic model:

  • X = (X1, . . . , Xn) ∈ {0, 1}n, Z = (Z1, . . . , Zn) ∈ (Rk)n
  • If the Zi’s are in general position,

then

i XiZi ∈ Rk uniquely determines X

(so if we resample, will have ˜ X(1) = · · · = ˜ X(M) = X zero power)

9/27

slide-15
SLIDE 15

Co-sufficient sampling (CSS)

Limitation of co-sufficient sampling... no power in many settings!

10/27

slide-16
SLIDE 16

Co-sufficient sampling (CSS)

Limitation of co-sufficient sampling... no power in many settings! For many other models, the minimal sufficient statistic S(X) is essentially the data itself, e.g.,

  • Mixture of Gaussians or mixture of GLMs
  • Non-canonical GLMs
  • Heavy tailed distributions (e.g., multivariate t)
  • Models with missing or corrupted data

10/27

slide-17
SLIDE 17

Approximate sufficiency

For a family {Pθ : θ ∈ Θ}, a function S(X) is a sufficient statistic if (distrib. of X | S(X), X ∼ Pθ) = (distrib. of X | S(X), X ∼ Pθ′) ∀θ, θ′. Asymptotic sufficiency: (Le Cam, Wald, ...) Informally... (distrib. of X | S(X), X ∼ Pθ) ≈ (distrib. of X | S(X), X ∼ Pθ′) ∀θ, θ′.

  • Under regularity conditions, S(X) =

θMLE(X) is asymp. suff.

11/27

slide-18
SLIDE 18

Approximate co-sufficient sampling (aCSS)

Main idea:

  • Let

θ ∈ Θ be an approximate MLE given the data X

  • Let pθ(·|

θ) = distrib. of X | θ, if marginally X ∼ Pθ under the null, X | θ ∼ pθ0(·| θ) for the unknown true θ0

  • Sample copies ˜

X (1), . . . , ˜ X (M) from p

θ(·|

θ) ≈ pθ0(·| θ)

  • by approx. sufficiency

X, ˜ X (1), . . . , ˜ X (M) ≈ exchangeable under H0 p-value is ≈ valid

12/27

slide-19
SLIDE 19

Approximate co-sufficient sampling (aCSS)

Distance to exchangeability

dexch(X, ˜ X (1), . . . , ˜ X (M)) = inf

  • Exch. distrib.

D on X M+1

  • dTV
  • (X, ˜

X (1), . . . , ˜ X (M)), D

  • For any test statistic T(X), the p-value

pval = 1 +

m ✶{T( ˜

X(m)) ≥ T(X)} 1 + M satisfies P {pval ≤ α} ≤ α + dexch(X, ˜ X (1), . . . , ˜ X (M)).

13/27

slide-20
SLIDE 20

aCSS algorithm

  • Step 1: choose a test statistic T : X → R
  • Step 2: observe data X, and compute an approximate MLE

θ

  • Step 3: sample copies ˜

X(1), . . . , ˜ X(M) from ≈ distribution of X | θ

  • Step 4: compute a rank-based p-value to test H0:

pval = 1 +

m ✶{T( ˜

X(m)) ≥ T(X)} 1 + M

14/27

slide-21
SLIDE 21

aCSS algorithm

  • Step 1: choose a test statistic T : X → R
  • Step 2: observe data X, and compute an approximate MLE

θ

  • Step 3: sample copies ˜

X(1), . . . , ˜ X(M) from ≈ distribution of X | θ

  • Step 4: compute a rank-based p-value to test H0:

pval = 1 +

m ✶{T( ˜

X(m)) ≥ T(X)} 1 + M

14/27

slide-22
SLIDE 22

aCSS algorithm

  • Step 2: observe data X, and compute an approximate MLE

θ Ideally would like to minimize L(θ; X, W ) = L(θ; X)

  • penalized neg. log-likelihood

− log f (X;θ)+R(θ)

+ σ · W ⊤θ

  • perturb with W ∼ N (0, 1

d Id)

(choose σ ≪ n1/2)

(see also Tian & Taylor 2018—random perturbation for selective inference)

15/27

slide-23
SLIDE 23

aCSS algorithm

  • Step 2: observe data X, and compute an approximate MLE

θ Ideally would like to minimize L(θ; X, W ) = L(θ; X)

  • penalized neg. log-likelihood

− log f (X;θ)+R(θ)

+ σ · W ⊤θ

  • perturb with W ∼ N (0, 1

d Id)

(choose σ ≪ n1/2)

(see also Tian & Taylor 2018—random perturbation for selective inference)

But... what if nonconvex? what if no global minimum? — Function θ : X × Rd → Θ, returns θ(X, W ). — If θ(X, W ) is a strict SOSP of L(θ; X, W ), proceed to next step. — Otherwise return ˜ X(1) = · · · = ˜ X(M) = X pval = 1.

15/27

slide-24
SLIDE 24

aCSS algorithm

  • Step 3: sample copies ˜

X(1), . . . , ˜ X(M) from ≈ distribution of X | θ ✶ ✶

16/27

slide-25
SLIDE 25

aCSS algorithm

  • Step 3: sample copies ˜

X(1), . . . , ˜ X(M) from ≈ distribution of X | θ Density of X | θ, conditional on the event that θ(X, W ) is strict SOSP: ∝ f (x; θ0) · exp

  • −∇θL(

θ; x) 2σ2/d

  • · det
  • ∇2

θL(

θ; x)

  • · ✶x∈X

θ

տ

support of X| θ

16/27

slide-26
SLIDE 26

aCSS algorithm

  • Step 3: sample copies ˜

X(1), . . . , ˜ X(M) from ≈ distribution of X | θ Density of X | θ, conditional on the event that θ(X, W ) is strict SOSP: ∝ f (x; θ0) · exp

  • −∇θL(

θ; x) 2σ2/d

  • · det
  • ∇2

θL(

θ; x)

  • · ✶x∈X

θ

տ

support of X| θ

θ0 unknown use θ as plug-in estimate: ∝ f (x; θ) · exp

  • −∇θL(

θ; x) 2σ2/d

  • · det
  • ∇2

θL(

θ; x)

  • · ✶x∈X

θ

16/27

slide-27
SLIDE 27

aCSS algorithm

  • Step 3: sample copies ˜

X(1), . . . , ˜ X(M) from ≈ distribution of X | θ Density of X | θ, conditional on the event that θ(X, W ) is strict SOSP: ∝ f (x; θ0) · exp

  • −∇θL(

θ; x) 2σ2/d

  • · det
  • ∇2

θL(

θ; x)

  • · ✶x∈X

θ

տ

support of X| θ

θ0 unknown use θ as plug-in estimate: ∝ f (x; θ) · exp

  • −∇θL(

θ; x) 2σ2/d

  • · det
  • ∇2

θL(

θ; x)

  • · ✶x∈X

θ

If sampling directly is impossible, can use an exchangeable form of MCMC (Besag & Clifford 1989)

16/27

slide-28
SLIDE 28

Type I error guarantee

Assumption 1: regularity conditions

  • Θ ⊆ Rd convex & open
  • Pθ has positive density f (·; θ) w.r.t. base measure νX for all θ ∈ Θ
  • Log-likelihood log f (x; θ) & penalty R(θ) are continuously twice diff.

17/27

slide-29
SLIDE 29

Type I error guarantee

Assumption 2: approximate MLE

For X ∼ Pθ0 and W ∼ N(0, 1

d Id), with prob. at least 1 − δ,

  • θ(X, W ) − θ0 ≤ r and

θ(X, W ) is a strict SOSP of L(θ; X, W ).

Assumption 3: Hessian of the log-likelihood

E

  • exp
  • sup

θ∈B(θ0,r)∩Θ

r 2∇2 log f (X; θ) − E

  • ∇2 log f (X; θ)
  • ≤ eε

18/27

slide-30
SLIDE 30

Type I error guarantee

Assumption 2: approximate MLE

For X ∼ Pθ0 and W ∼ N(0, 1

d Id), with prob. at least 1 − δ,

  • θ(X, W ) − θ0 ≤ r and

θ(X, W ) is a strict SOSP of L(θ; X, W ).

Assumption 3: Hessian of the log-likelihood

E

  • exp
  • sup

θ∈B(θ0,r)∩Θ

r 2∇2 log f (X; θ) − E

  • ∇2 log f (X; θ)
  • ≤ eε

In standard settings with n independent observations... r, ε, δ = O(n−1/2)

18/27

slide-31
SLIDE 31

Type I error guarantee

Theorem

Under Assumptions 1, 2, & 3, the copies produced by aCSS satisfy dexch(X, ˜ X (1), . . . , ˜ X (M)) ≤ 3σr + δ + ε under H0. Therefore, for any test statistic T, Type I error for testing H0 satisfies P {pval ≤ α} ≤ α + 3σr + δ + ε

19/27

slide-32
SLIDE 32

Type I error guarantee

Theorem

Under Assumptions 1, 2, & 3, the copies produced by aCSS satisfy dexch(X, ˜ X (1), . . . , ˜ X (M)) ≤ 3σr + δ + ε under H0. Therefore, for any test statistic T, Type I error for testing H0 satisfies P {pval ≤ α} ≤ α + 3σr + δ + ε

ր

Excess Type I error should be o(1)...

  • r, δ, ε ≍ n−1/2 from the assumptions
  • σ = noise level, chosen by analyst

→ choose σ ≍ nc for some c ∈ [0, 1

2)

19/27

slide-33
SLIDE 33

Examples

Examples where CSS has no power, but aCSS assumptions hold:

  • Canonical GLMs such as logistic regression (low-dim.):

Xi

⊥ ⊥

∼ Bernoulli

  • eZ ⊤

i β

1 + eZ ⊤

i β

  • for unknown β
  • Two-sample difference-of-means (the Behrens–Fisher problem):

Xi

iid

∼ N(µX, σ2

X),

Yi

iid

∼ N(µY , σ2

Y ),

test H0 : µX = µY

(An aCSS-like approach for this problem was considered by Lilleg˚ ard 2001)

20/27

slide-34
SLIDE 34

Examples

Examples where CSS has no power, but aCSS assumptions hold:

  • Spatial process on integer lattice: for unknown ρ,

X ∼ N(0, Σ) where Σij = ρDij for known pairwise distances Dij

  • Multivariate t distribution (low-dim.):

Xi

iid

∼ tγ(0, Σ) for known γ & unknown Σ

  • And maybe missing data, latent variables, and more ...

21/27

slide-35
SLIDE 35

Simulations

Compare to oracle method that knows θ0:

  • Sample copies ˜

X (m) iid ∼ Pθ0

  • Compute p-value with same statistic T(x)

22/27

slide-36
SLIDE 36

Simulations

Compare to oracle method that knows θ0:

  • Sample copies ˜

X (m) iid ∼ Pθ0

  • Compute p-value with same statistic T(x)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Logistic Regression

Coefficient on X Power aCSS

  • racle

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Behrens−Fisher

µ(1) − µ(0) Power aCSS

  • racle

22/27

slide-37
SLIDE 37

Simulations

Compare to oracle method that knows θ0:

  • Sample copies ˜

X (m) iid ∼ Pθ0

  • Compute p-value with same statistic T(x)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.2 0.4 0.6 0.8 1.0

Gaussian Spatial

Anisotropy Parameter Power aCSS

  • racle

2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0

Multivariate t

True d.f. − Null d.f. Power aCSS

  • racle

22/27

slide-38
SLIDE 38

Sampling

Recall: need to sample copies ˜ X (m) from ∝ f (x; θ) · exp

  • −∇θL(

θ; x) 2σ2/d

  • · det
  • ∇2

θL(

θ; x)

  • · ✶x∈X

θ

23/27

slide-39
SLIDE 39

Sampling

Recall: need to sample copies ˜ X (m) from ∝ f (x; θ) · exp

  • −∇θL(

θ; x) 2σ2/d

  • · det
  • ∇2

θL(

θ; x)

  • · ✶x∈X

θ

Two exchangeable MCMC strategies (Besag & Clifford 1989)

X ˜ X ∗ ˜ X (1) ˜ X (2) ˜ X (3) . . . ˜ X (M−2) ˜ X (M−1) ˜ X (M)

latent hub

˜ X (4) ˜ X (2) X ˜ X (1) . . . ˜ X (M) ˜ X (3)

Random permutation of M + 1 positions

  • Run Metropolis–Hastings, where f (x;

θ) stationary for proposal distrib.

  • e.g., if X consists of n indep. observations (i.e., f (x;

θ) = n

i=1 fi(xi;

θ)), can choose proposal distrib. = resample s of n observations

23/27

slide-40
SLIDE 40

Proof sketch for Theorem

Need to bound dexch(X, ˜ X (1), . . . , ˜ X (M)) (1) Calculate joint distribution:       

  • θ

∼ (marginal distrib. of θ) X | θ ∼ pθ0(·| θ) ˜ X(m) | X, θ ∼ p

θ (·|

θ) = ⇒ dexch(X, ˜ X (1), . . . , ˜ X (M)) ≤ E

θ

  • dTV
  • pθ0(·|

θ), p

θ (·|

θ)

  • 24/27
slide-41
SLIDE 41

Proof sketch for Theorem

(2) To bound dTV: p

θ (X|

θ) pθ0(X| θ) ∝ f (X; θ ) f (X; θ0) ⇒ p

θ (X|

θ) pθ0(X| θ) =

f (X; θ ) f (X;θ0)

Epθ0(·|

θ)

  • f (X;

θ ) f (X;θ0)

  • 25/27
slide-42
SLIDE 42

Proof sketch for Theorem

(2) To bound dTV: p

θ (X|

θ) pθ0(X| θ) ∝ f (X; θ ) f (X; θ0) ⇒ p

θ (X|

θ) pθ0(X| θ) =

f (X; θ ) f (X;θ0)

Epθ0(·|

θ)

  • f (X;

θ ) f (X;θ0)

dTV

  • pθ0(·|

θ), p

θ (·|

θ)

  • = Epθ0(·|

θ)

    1 −

f (X; θ ) f (X;θ0)

Epθ0(·|

θ)

  • f (X;

θ ) f (X;θ0)

+

   So, we need to show that f (X;

θ ) f (X;θ0) is ≈ constant over distrib. X|

θ.

25/27

slide-43
SLIDE 43

Proof sketch for Theorem

log

  • f (X;

θ ) f (X; θ0)

  • = −(θ0−

θ)⊤∇θ log f (X; θ)−1 2(θ0− θ)⊤∇2

θ log f (X; ˜

θ)(θ0− θ)

26/27

slide-44
SLIDE 44

Proof sketch for Theorem

log

  • f (X;

θ ) f (X; θ0)

  • = −(θ0−

θ)⊤∇θ log f (X; θ)−1 2(θ0− θ)⊤∇2

θ log f (X; ˜

θ)(θ0− θ) = ⇒

  • log
  • f (X;

θ ) f (X; θ0)

  • + 1

2(θ0 − θ)⊤Eθ0

  • ∇2

θ log f (X; ˜

θ)

  • (θ0 −

θ)

  • ≤ r · ∇θ log f (X;

θ)

  • =σW ≍σ

+ 1 2 · r 2

  • ∇2

θ log f (X; ˜

θ) − Eθ0

  • ∇2

θ log f (X; ˜

θ)

  • ≍ε by Asm. 3

ր

θ0 − θ ≤ r with prob. ≥ 1 − δ by Asm. 2 26/27

slide-45
SLIDE 45

Proof sketch for Theorem

log

  • f (X;

θ ) f (X; θ0)

  • = −(θ0−

θ)⊤∇θ log f (X; θ)−1 2(θ0− θ)⊤∇2

θ log f (X; ˜

θ)(θ0− θ) = ⇒

  • log
  • f (X;

θ ) f (X; θ0)

  • + 1

2(θ0 − θ)⊤Eθ0

  • ∇2

θ log f (X; ˜

θ)

  • (θ0 −

θ)

  • ≤ r · ∇θ log f (X;

θ)

  • =σW ≍σ

+ 1 2 · r 2

  • ∇2

θ log f (X; ˜

θ) − Eθ0

  • ∇2

θ log f (X; ˜

θ)

  • ≍ε by Asm. 3

ր

θ0 − θ ≤ r with prob. ≥ 1 − δ by Asm. 2

Rearrange dexch(X, ˜ X (1), . . . , ˜ X (M)) ≤ E

θ

  • dTV
  • pθ0(·|

θ), p

θ (·|

θ)

  • ≤ 3σr + δ + ε

26/27

slide-46
SLIDE 46

Summary & open questions

  • Summary: aCSS can test goodness-of-fit by

sampling nearly-exchangeable copies of the data, in a much broader range of settings than CSS

27/27

slide-47
SLIDE 47

Summary & open questions

  • Summary: aCSS can test goodness-of-fit by

sampling nearly-exchangeable copies of the data, in a much broader range of settings than CSS

  • How to choose σ to balance Type I error & power?
  • Connections to Bayesian methods?
  • Apply to high dimensional regression / covariance estimation?
  • Apply to missing data / latent variables / models with singularities?
  • Extend to model-X knockoffs?

Thank you!

27/27