On Testing Marginal versus Conditional Independence Richard Guo - - PowerPoint PPT Presentation

on testing marginal versus conditional independence
SMART_READER_LITE
LIVE PREVIEW

On Testing Marginal versus Conditional Independence Richard Guo - - PowerPoint PPT Presentation

On Testing Marginal versus Conditional Independence Richard Guo ricguo@uw.edu Nov, 2019 Department of Statistics, University of Washington, Seattle 1 Introduction Motivation Inferring causal structures usually involves model selection among


slide-1
SLIDE 1

On Testing Marginal versus Conditional Independence

Richard Guo ricguo@uw.edu Nov, 2019

Department of Statistics, University of Washington, Seattle 1

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Motivation

Inferring causal structures usually involves model selection among directed acyclic graphs (DAGs). While learning undirected graphical models has been relatively well-developed (e.g., graphical lasso, neighborhood selection), model selection for DAGs is less well-understood. This poses a challenge to maintaining error guarantee in causal inference, even in large samples. In this talk, I will analyze the simplest example where such a challenge arises.

2

slide-4
SLIDE 4

Marginal vs. conditional independence

Consider (X1, X2, X3)⊺ ∼ N(0, Σ) on R3. Covariance Σ ∈ S3, the set of 3 × 3 real positive definite matrices. We want to test between M0 : X1 ⊥

⊥ X2,

(X1 → X3 ← X2), M1 : X1 ⊥

⊥ X2 | X3,

(X1 − X3 − X2), assuming that at least one of them is true. X1 − X3 − X2 includes the following Markov-equivalent DAGs X1 ← X3 ← X2, X1 → X3 → X2, X1 ← X3 → X2.

3

slide-5
SLIDE 5

Marginal vs. conditional independence

Testing between M0 : X1 ⊥

⊥ X2

vs. M1 : X1 ⊥

⊥ X2 | X3

is a non-nested model selection problem. They correspond to equality/algebraic constraints on Σ = {σij}: M0 : σ12 = 0, M1 : σ12·3 = σ12 − σ13σ−1

33 σ23 = 0 ⇔ σ12σ33 = σ13σ23.

M0 and M1 intersect at the two axes M0 ∩ M1 = {σ12 = σ13 = 0} ∪ {σ12 = σ23 = 0}.

4

slide-6
SLIDE 6

Geometry

We visualize the parameter space in the correlation space. M0 : ρ12 = 0, M1 : ρ12 = ρ13ρ23

5

slide-7
SLIDE 7

Singularity

The two axes further intersect at the origin Msing : {σ12 = σ13 = σ23 = 0}, which is a singularity. Msing corresponds to diagonal Σ.

  • M0 ∩ M1 vs. S3: Likelihood-ratio test (LRT) was studied by

Drton (2006, 2009) and Drton and Sullivant (2007).

  • LRT has a non-standard asymptotic distribution at Msing.
  • M0 vs. M1: At Msing, the tangent cones of the two models

coincide.

  • They are called “1-equivalent” by Evans (2018), meaning that

linear approximations to the parameter space are the same.

  • In the Euclidean m−1/2-ball of Msing, m2 samples are required

to distinguish M0 and M1.

6

slide-8
SLIDE 8

Difficulty

Model selection for DAGs is usually conducted by the following approaches (Drton and Maathuis, 2017).

  • Score-based: Picking the model with the highest penalized

likelihood score (e.g., AIC, BIC). Since dim(M0) = dim(M1), both AIC and BIC will pick the model with the higher likelihood.

  • Constraint-based: Testing

M0 : X1 ⊥

⊥ X2

vs. M1 : X1 ⊥

⊥ X2 | X3.

This is adopted by the PC algorithm. For Gaussian data, Fisher’s z-transformation of partial correlation is used as the test statistic.

7

slide-9
SLIDE 9

Difficulty

Simulated with n = 1, 000, ρ = 0.3 and unit variances under level α = 0.05. X1 X2 X1 X2 X3 X3 M0 \ M1 M1 \ M0

γn−1/2 ρ γn−1/2 ρ

M0\M1 M1\M0 5 10 5 10 0.00 0.25 0.50 0.75 1.00

|γ| size method

BIC/AIC PC

8

slide-10
SLIDE 10

Method

slide-11
SLIDE 11

Likelihood ratio test for nested models

Consider a parametric family {Pθ : θ ∈ Θ}, where Θ is an open subset of Rd. For Θ0 ⊆ Θ, suppose we want to test H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ. Under regularity, the likelihood ratio test (LRT) statistic λn = 2

  • sup

θ

ℓn(θ) − sup

θ0

ℓn(θ)

  • d

⇒ χ2

c,

where c = d − dim(Θ0). ℓn(·) is the log-likelihood under sample size n. For example, in linear regression y ∼ β0 + β1X1 + β2X2 + β3X3. We use χ2

2 for testing

H0 : β0 = β1 = 0 vs. H1 : β ∈ R4.

9

slide-12
SLIDE 12

Likelihood ratio test

Similarly, we define the log-likelihood ratio of M0 versus M1 as λ(0:1)

n

:=2

  • sup

Σ∈M0

ℓn(Σ) − sup

Σ∈M1

ℓn(Σ)

  • =2
  • ℓn(ˆ

Σ(0)

n ) − ℓn(ˆ

Σ(1)

n )

  • ,

where ˆ Σ(0)

n , ˆ

Σ(1)

n

are MLEs within M0 and M1 respectively. ℓn(·) is the Gaussian log-likelihood function ℓn(Σ) = n 2(− log |Σ| − Tr(SnΣ−1)).

10

slide-13
SLIDE 13

Likelihood ratio test

The Gaussian MLEs for DAGs take a closed form (Drton and Richardson, 2008), which yields the following expression for the LRT. λ(0:1)

n

= n log

  • s2

13 − s11s33

s2

23 − s22s33

  • s33

n log

  • s11s22

s22s2

13 − 2s12s23s13 + s11s2 23

s2

12 − s11s22

+ s33

  • ,

where S is the sample covariance taken with respect to mean zero.

11

slide-14
SLIDE 14

Our plan

  • 1. An information-theoretic analysis on how well the two models

can be distinguished (by any means).

  • 2. Look at the regimes of “effect size” ∼ n, such that the
  • ptimal error is between 0 and 1.
  • a stable, non-degenerate asymptotic distribution of LRT.
  • We will be doing large-n-small-effect asymptotics!
  • 3. Derive the asymptotic distributions.
  • Are they uniform?
  • 4. Develop a model selection procedure with error guarantees.

12

slide-15
SLIDE 15

Optimal error

We study the minimax rate of distinguishing two sequences of distributions, one within M0 and the other within M1, as they approach M0 ∩ M1. Lemma: testing two simple hypotheses For testing H0 : X ∼ P versus H1 : X ∼ Q, the minimum sum of type-I and type-II errors is 1 − dTV(P, Q). Total variation distance dTV(P, Q) = sup

A

|P(A) − Q(A)| = 1 2

  • |p − q| dµ.

13

slide-16
SLIDE 16

Optimal error

Consider a sequence Pn = PΣ(0)

n ,

Σ(0)

n

∈ M0 \ M1, Σ(0)

n

→ Σ∗ ∈ M0 ∩ M1. Correspondingly, let Qn = PΣ(1)

n

from M1 \ M0 such that Σ(1)

n

= arg min

Σ∈M1\M0

DKL(PΣ(0)

n PΣ),

which is the most difficult to distinguish from. With Pn = PΣ(0)

n

and Qn = PΣ(1)

n , let us compute the total

variation between the product measures (n iid samples). The limiting optimal error can be sandwiched by the Hellinger distance H(P, Q) := 1

2

  • (√p − √q)2 dµ

1/2. H2(Pn

n, Qn n) ≤ dTV(Pn n, Qn n) ≤ H(Pn n, Qn n)

  • 2 − H2(Pn

n, Qn n). 14

slide-17
SLIDE 17

Optimal error

With some algebra, we have 1 − dTV(Pn

n, Qn n) →

   0, H(Pn, Qn) = ω(n−1/2) 1, H(Pn, Qn) = o(n−1/2) , and when H(Pn, Qn) ≍ n−1/2, 0 < lim inf

n

{1 − dTV(Pn

n, Qn n)} ≤ lim sup n

{1 − dTV(Pn

n, Qn n)} < 1.

Effect size H(Pn, Qn) ≍ ρ13,nρ23,n, where ρij = σij/√σiiσjj is the correlation coefficient.

15

slide-18
SLIDE 18

Optimal error

Comparing H(Pn, Qn) to n−1/2, to stabilize the asymptotic error, there are two regimes. Two regimes {1 − dTV(Pn

n, Qn n)} → c ∈ (0, 1)

iff        ρ13,n ≍ γn−1/2, ρ23,n → ρ23 = 0 ρ23,n ≍ γn−1/2, ρ13,n → ρ13 = 0 ρ13,nρ23,n ≍ δn−1/2, ρ13,n, ρ23,n → 0.

  • “weak-strong”

“weak-weak”

16

slide-19
SLIDE 19

Asymptotics: weak-strong regime

We study the (local) asymptotic distribution of λ(0:1)

n

. For r = γ√σ11σ33, we set Σ(0)

n

=    σ11 r/√n σ22 σ23 r/√n σ23 σ33    , Σ(1)

n

=    σ11 (r/√n)σ23/σ33 r/√n (r/√n)σ23/σ33 σ22 σ23 r/√n σ23 σ33    , Σ(0)

n , Σ(1) n

→ Σ∗ =    σ11 σ22 σ23 σ23 σ33   

17

slide-20
SLIDE 20

Asymptotics: weak-strong regime ρ12

<latexit sha1_base64="C5oYjXs5XFpIvmf4zkzWcJwqK0=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmIckS5idzCZD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyc9vRI9bOgNu2XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2P3iKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPv0YBpSiyfOIKJZu5WREZY2JdRiUXQrD8ip1arBRdW/v6zUb/I4inACp3AOAVxBHe6gAU0gIOAZXuHN096L9+59LFoLXj5zDH/gf4AgsuQNQ=</latexit>

ρ13

<latexit sha1_base64="N2t1x1B6o7YXeWQNQ7GLKgYtqW0=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKewaQY9BLx4jmIckS5idzCZD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyc9vRI9bOgNu2XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2P3iKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPv0YBpSiyfOIKJZu5WREZY2JdRiUXQrD8ipXVSDWtW/v6zUb/I4inACp3AOAVxBHe6gAU0gIOAZXuHN096L9+59LFoLXj5zDH/gf4AhFCQNg=</latexit>

ρ23

<latexit sha1_base64="+q3hAqsk252mIPphuj3whPN1q8w=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKewmgh6DXjxGMImSLGF2MpsMmcyMyuEJV/hxYMiXv0cb/6Nk2QPmljQUFR1090VJZwZ6/vfXmFtfWNzq7hd2tnd2z8oHx61jUo1oS2iuNIPETaUM0lblOHxJNsYg47UTjm5nfeaLaMCXv7ShocBDyWJGsHXSY0+PVD+r1af9csWv+nOgVRLkpAI5mv3yV2+gSCqotIRjY7qBn9gw9oywum01EsNTAZ4yHtOiqxoCbM5gdP0ZlTBihW2pW0aK7+nsiwMGYiItcpsB2ZW8m/ud1UxtfhRmTSWqpJItFcqRVWj2PRowTYnlE0cw0czdisgIa0ysy6jkQgiWX14l7Vo1qFf9u4tK4zqPowgncArnEMAlNOAWmtACAgKe4RXePO29eO/ex6K14OUzx/AH3ucPhdaQNw=</latexit>

X1 X2 X1 X2 X1 X2 X3 X3 X3 Σ(0)

n

∈ M0 \ M1 Σ(1)

n

∈ M1 \ M0 Σ∗ ∈ M0 ∩ M1

γn−1/2 ρ γn−1/2 ρ ρ

18

slide-21
SLIDE 21

Asymptotics: weak-strong regime

Let Z1, Z2 be two independent standard normals. LRT in the weak-strong regime Under Σ(0)

n ,

λ(0:1)

n d

⇒ ρ  

  • Z1 +

γ

  • 2(1 − ρ)

2 −

  • Z2 +

γ

  • 2(1 + ρ)

2  ; Under Σ(1)

n ,

λ(0:1)

n d

⇒ ρ  

  • Z1 + γ
  • 1 − ρ

2 2 −

  • Z2 + γ
  • 1 + ρ

2 2  .

19

slide-22
SLIDE 22

Asymptotics: weak-strong regime

ρ = 0.1 ρ = 0.3 ρ = 0.5 ρ = 0.9 −3 3 6 −10 10 20 −20 20 40 100 200 1 1.64 1.96 2.57 3 5 1 1.64 1.96 2.57 3 5 1 1.64 1.96 2.57 3 5 1 1.64 1.96 2.57 3 5

λ(0:1) |γ| truth

M0\M1 M1\M0

The asymptotic distribution is a scaled difference between two independent non-central χ2

1 variables.

  • No simple analytic form for PDF/CDF.
  • Adding an n−1/2 shift to other elements in Σn does not

change the distribution (regularity).

  • Can be derived from local asymptotic normality (LAN) or Le

Cam’s 3rd Lemma (change of measure under contiguity).

20

slide-23
SLIDE 23

Asymptotics: weak-weak regime

ρ12

<latexit sha1_base64="C5oYjXs5XFpIvmf4zkzWcJwqK0=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKexGQY9BLx4jmIckS5idzCZD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyc9vRI9bOgNu2XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2P3iKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPv0YBpSiyfOIKJZu5WREZY2JdRiUXQrD8ip1arBRdW/v6zUb/I4inACp3AOAVxBHe6gAU0gIOAZXuHN096L9+59LFoLXj5zDH/gf4AgsuQNQ=</latexit>

ρ13

<latexit sha1_base64="N2t1x1B6o7YXeWQNQ7GLKgYtqW0=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKewaQY9BLx4jmIckS5idzCZD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FXdHdFCWfG+v63V1hb39jcKm6Xdnb39g/Kh0cto1JNaJMornQnwoZyJmnTMstpJ9EUi4jTdjS+nfntJ6oNU/LBThIaCjyULGYEWyc9vRI9bOgNu2XK37VnwOtkiAnFcjR6Je/egNFUkGlJRwb0w38xIYZ1pYRTqelXmpogskYD2nXUYkFNWE2P3iKzpwyQLHSrqRFc/X3RIaFMRMRuU6B7cgsezPxP6+b2vg6zJhMUkslWSyKU46sQrPv0YBpSiyfOIKJZu5WREZY2JdRiUXQrD8ipXVSDWtW/v6zUb/I4inACp3AOAVxBHe6gAU0gIOAZXuHN096L9+59LFoLXj5zDH/gf4AhFCQNg=</latexit>

ρ23

<latexit sha1_base64="+q3hAqsk252mIPphuj3whPN1q8w=">AB8HicbVDLSgNBEOyNrxhfUY9eBoPgKewmgh6DXjxGMImSLGF2MpsMmcyMyuEJV/hxYMiXv0cb/6Nk2QPmljQUFR1090VJZwZ6/vfXmFtfWNzq7hd2tnd2z8oHx61jUo1oS2iuNIPETaUM0lblOHxJNsYg47UTjm5nfeaLaMCXv7ShocBDyWJGsHXSY0+PVD+r1af9csWv+nOgVRLkpAI5mv3yV2+gSCqotIRjY7qBn9gw9oywum01EsNTAZ4yHtOiqxoCbM5gdP0ZlTBihW2pW0aK7+nsiwMGYiItcpsB2ZW8m/ud1UxtfhRmTSWqpJItFcqRVWj2PRowTYnlE0cw0czdisgIa0ysy6jkQgiWX14l7Vo1qFf9u4tK4zqPowgncArnEMAlNOAWmtACAgKe4RXePO29eO/ex6K14OUzx/AH3ucPhdaQNw=</latexit>

X1 X2 X1 X2 X1 X2 X3 X3 X3 Σ(0)

n

∈ M0 \ M1 Σ(1)

n

∈ M1 \ M0 Σ∗ ∈ M0 ∩ M1 21

slide-24
SLIDE 24

Asymptotics: weak-weak regime

Under the weak-weak regime ρ13,nρ23,n = δn−1/2, e.g., ρ13,n = √ δn−1/3 and ρ23,n = √ δn−1/6, the usual tactics fail due to irregularity: (i) M0 and M1 cannot be embedded into the same LAN family; (ii) contiguity to an iid static law no longer holds. Pn

n, Qn n contiguous to each other, but neither contiguous to Pn Σ∗.

M1 M0 O √ δ δ

Figure 1: M0 and M1 are √ δ away from origin; but they are δ away from each other (Evans, 2018).

22

slide-25
SLIDE 25

Asymptotics: weak-weak regime

Thanks to the closed form of λ(0:1)

n

, by a manual “change of measure” (relating the distribution of sample covariance under Σ(i)

n

to that under Σ = I), we obtain a Gaussian limit. LRT in the weak-weak regime For ρ13,nρ23,n = δn−1/2 + o(n−1/2), λ(0:1)

n d

⇒    δ(2Z + δ) =d N(δ2, (2δ)2), under Σ(0)

n

δ(2Z − δ) =d N(−δ2, (2δ)2), under Σ(1)

n

. The limit only depends on δ. It does not depend on how ρ13,n and ρ23,n approach zero individually.

23

slide-26
SLIDE 26

Limit experiments

Asymptotically, testing between M0 and M1 is equivalent to testing the location of a normal between two lines, from a single Gaussian observation. It is characterized by an angle and an intercept. M0 M1 θ = arcsin ρ γ/

  • 1 − ρ2

M1 M0 θ = arcsin ρ γ M0 M1 δ M0 M1 δ

24

slide-27
SLIDE 27

Decision, error and power

Due to non-nestedness, we refrain from choosing either as the “null”. Instead, we consider a three-way decision rule φn : Σn → {M0, M1, M0 ∪ M1}. Size For all Σn → Σ∗ on Mi \ M1−i for i = 0, 1, control lim sup

n→∞ PΣn(φn = M1−i) ≤ α.

The limit Σ∗ could be in M0 ∩ M1 or Mi \ M1−i.

Power Under Σn → Σ∗ from Mi \ M1−i, power is defined as lim inf

n→∞ PΣn(φn = Mi). 25

slide-28
SLIDE 28

Decision boundaries from asymptotics

Given the (1) regime, (2) ρ and (3) the local parameter (γ or δ), a three-way decision can be constructed from asymptotic quantiles.

0.00 0.05 0.10 0.15 0.20 − 10 − 5 5 10

density colour

M0 M1

ρ=0. 5, γ=2. 5, α=0. 05

φn =        M0, λ(0:1)

n

> F −1

1 (1 − α)

M1, λ(0:1)

n

< F −1

0 (α)

M0 ∪ M1,

  • therwise

.

26

slide-29
SLIDE 29

Non-uniform asymptotics :(

But this is impossible.

  • Depends on the regime (“where”): weak-strong or

weak-weak.

  • Discontinuity across regimes: the law under weak-strong does

not converge to that of weak-weak when ρ → 0.

  • Depends on the local parameter γ or δ (“how”).
  • Local parameter has scale n−1/2, not point-identified.
  • Impossible to judge if an edge is weak based on whether its

confidence interval contains zero without further assumptions.

  • Further, a procedure that tries to first estimate “where” and

“how” before applying the decision rule is susceptible to irregularity issues.

27

slide-30
SLIDE 30

Envelope!

28

slide-31
SLIDE 31

Extremal quantile

Let us look at the weak-weak Gaussian asymptotic as an example. F −1

0 (α) = (δ + Φ−1(α))2 − Φ−1(α)2.

  • 5

5

x

δ=0. 50

29

slide-32
SLIDE 32

Extremal quantile

Let us look at the weak-weak Gaussian asymptotic as an example. F −1

0 (α) = (δ + Φ−1(α))2 − Φ−1(α)2.

  • 5

5

x

δ=1. 00

29

slide-33
SLIDE 33

Extremal quantile

Let us look at the weak-weak Gaussian asymptotic as an example. F −1

0 (α) = (δ + Φ−1(α))2 − Φ−1(α)2.

  • 5

5

x

δ=1. 64

29

slide-34
SLIDE 34

Extremal quantile

Let us look at the weak-weak Gaussian asymptotic as an example. F −1

0 (α) = (δ + Φ−1(α))2 − Φ−1(α)2.

  • 5

5

x

δ=2. 00

29

slide-35
SLIDE 35

Extremal quantile

Let us look at the weak-weak Gaussian asymptotic as an example. F −1

0 (α) = (δ + Φ−1(α))2 − Φ−1(α)2.

  • 5

5

x

δ=2. 50

29

slide-36
SLIDE 36

Envelope distribution

Taking extremal quantiles for every α is equivalent to taking pointwise supremum of CDF over the local parameter γ or δ. Envelope distribution Given a family of distribution functions {Fh : h ∈ H} on R, define ¯ F ∗(x) := sup

h∈H

Fh(x), and ¯ F(x) :=    ¯ F ∗(x), ¯ F ∗ is continuous at x limy→x+ ¯ F ∗(y), ¯ F ∗ is discontinuous at x . We call ¯ F the envelope distribution of {Fh : h ∈ H} if ¯ F is a valid distribution function.

30

slide-37
SLIDE 37

Envelope distribution

Envelope distribution function Lemma: If ¯ F ∗(x) → 0 as x → −∞, then ¯ F(x) is a valid distribution function. For the weak-weak regime, it can be shown ¯ F = 1

2(−χ2 1) + 1 2δ0.

0.00 0.25 0.50 0.75 1.00 −5.0 −2.5 0.0 2.5

x cumulative probability δ

0.1 0.25 0.5 0.75 1 1.5 2 4 Envelope

Envelope of N(δ2,(2δ)2)

31

slide-38
SLIDE 38

Envelope distribution

The same phenomenon occurs for the weak-strong regime! We can verify that ¯ F ∗

ρ (x) → 0 as x → −∞ for every |ρ| ∈ (0, 1].

Therefore, ¯ Fρ, the envelope of {Fρ,γ : γ ∈ R}, is a valid distribution function.

0.00 0.25 0.50 0.75 1.00 − 5

x cumulative probability γ

0.1 1 2 3 4 6 8 envelope

Envelope of Fρ,γ under ρ = 0.3

32

slide-39
SLIDE 39

Envelope distribution

Continuity of envelope! Proposition: ¯ Fρ

d

⇒ ¯ F as ρ → 0, where ¯ F is the envelope distribution for the weak-weak regime. Further, we show the following properties for {Fρ : −1 ≤ ρ ≤ 1}.

  • ¯

Fρ = ¯ F|ρ|.

  • ¯

Fρ under M0 \ M1 and M1 \ M0 have the same form.

  • The positive part of ¯

Fρ for |ρ| ∈ (0, 1] is distributed as the positive part of ρ(Z 2

1 − Z 2 2 ) for two independent standard

normals.

  • Only the negative part of ¯

Fρ is relevant for decision making.

  • We do not have an analytic form for the negative part of ¯

Fρ, except for ρ ∈ {−1, 0, 1}.

33

slide-40
SLIDE 40

Envelope quantiles

Quantiles of ¯ Fρ can be evaluated by Monte Carlo on a grid of values for ρ and interpolating. It is interesting to notice that ¯ F −1

ρ (α) is not monotonic in |ρ|.

3 4 5 6 0.00 0.25 0.50 0.75 1.00

ρ negated envelope quantiles α

0.01 0.05 Bessel Chi square

34

slide-41
SLIDE 41

Model selection procedure: adaptive rule

Note that ¯ Fρ is continuous in ρ. Recall that ρ = ρstrong in the weak-strong regime, and ρ = 0 in the weak-weak regime. |ρ| can be consistently estimated by ˆ ρn = |ˆ ρ13,n| ∨ |ˆ ρ23,n|. Adaptive rule φada

n

:=        M0, λ(0:1)

n

> − ¯ F −1

ˆ ρn (α)

M1, λ(0:1)

n

< ¯ F −1

ˆ ρn (α)

M0 ∪ M1,

  • therwise

.

35

slide-42
SLIDE 42

Envelope of envelopes

The negative parts of { ¯ Fρ : ρ ∈ [−1, 1]} are dominated by that of ¯ Fρ=1.

0.00 0.25 0.50 0.75 1.00 −5.0 −2.5 0.0 2.5

x envelope CDF ρ

0.1 0.3 0.5 0.7 0.9 1

36

slide-43
SLIDE 43

Envelope of envelopes

Bessel envelope ¯ Fρ=1 is distributed as the difference between two independent χ2

1

variables. It has density involving modified Bessel function of the 2nd kind pB(u) = 1 2πK0(|u|/2).

0.00 0.25 0.50 0.75 −4 −2 2 4

x PDF of Bessel K−form

37

slide-44
SLIDE 44

Model selection procedure: uniform rule

Uniform rule φunif

n

:=        M0, λ(0:1)

n

> − ¯ F −1

ρ=1(α)

M1, λ(0:1)

n

< ¯ F −1

ρ=1(α)

M0 ∪ M1,

  • therwise

. The quantile is 3.19 for α = 0.05 and 5.97 for α = 0.01.

38

slide-45
SLIDE 45

Error guarantee

Error guarantee (rate-free) Theorem: The adaptive rule φada

n

controls asymptotic error uniformly below α for 0 < α < 1/2.

  • This holds for the local model sequences ρ13,nρ23,n ≍ n−1/2 such

that the asymptotic error is between 0 and 1.

  • This also holds for ρ13,nρ23,n = o(n−1/2) since λ(0:1)

n

→p 0 and Pr(φn = M0 ∪ M1) → 1.

  • And also holds for ρ13,nρ23,n = ω(n−1/2) where λ(0:1)

n

goes to ±∞.

Hence, our guarantee holds under Pn

Σn for any converging

sequence Σn. An assumption on the rate of signal strength is not required. Corollary: φunif

n

has the same guarantee.

39

slide-46
SLIDE 46

p-value

When it is desired to report a p-value, the rules can be restated as φn =        M0, λ(0:1)

n

> 0 and p-value < α M1, λ(0:1)

n

< 0 and p-value < α M0 ∪ M1,

  • therwise

, where a potentially conservative p-value is defined as p-value := ¯ Fρ(−|λ(0:1)

n

|) for ρ = 1 (uniform) or ρ = ˆ ρn (adaptive).

40

slide-47
SLIDE 47

Numerical results

slide-48
SLIDE 48

Methods for comparison

Naive Simply choosing the model with highest likelihood/AIC/BIC φnaive

n

:=    M0, λ(0:1)

n

> 0 M1, λ(0:1)

n

< 0 . Interval selection This is based on Drton and Perlman (2004). Construct (marginally) (1 − α)-level confidence intervals for ρ12 and ρ12·3, and let φinterval

n

:=        M0,

  • nly C.I. for ρ12 contains 0

M1,

  • nly C.I. for ρ12·3 contains 0

M0 ∪ M1, both C.I.’s contain 0 . φinterval

n

guarantees asymptotic size below α (suppose M0 is true, then one only makes an error when the C.I. for ρ12 does not contain zero).

41

slide-49
SLIDE 49

Weak-strong regime: size under M0 and M1

Models are simulated as in the weak-strong regime.

γ = 0.1 γ = 1 γ = 2 γ = 3 γ = 4 M0\M1 M1\M0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.06

|ρstrong| size method

adaptive interval uniform

n = 1000, 4000 replicates, α = 0. 05

size of procedure under different values of γ

42

slide-50
SLIDE 50

Weak-strong regime: power to select M0 or M1

Grey curves are bounds on the theoretically optimal power.

γ = 0.1 γ = 1 γ = 2 γ = 3 γ = 4 M0\M1 M1\M0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

|ρstrong| power linetype

lower bound upper bound

method

adaptive interval uniform

power of procedure under different values of γ

43

slide-51
SLIDE 51

Weak-strong regime: varying sample sizes

Fix γ = 1 and vary n.

n = 100 n = 200 n = 500 n = 1000 n = 2500 n = 5000 M0\M1 M1\M0 0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0 0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.06

|ρstrong| size method

adaptive interval uniform

4000 replicates, α = 0. 05, γ = 1

size of procedure under different n

44

slide-52
SLIDE 52

Weak-strong regime: varying sample sizes

Grey curves are bounds on the theoretically optimal power.

n = 100 n = 200 n = 500 n = 1000 n = 2500 n = 5000 M0\M1 M1\M0 0.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.00.0 0.5 1.0 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

|ρstrong| power linetype

lower bound upper bound

method

adaptive interval uniform

power of procedure under different n

45

slide-53
SLIDE 53

Weak-weak regime: size under M0 and M1

The weak-weak regime.

n = 20 n = 50 n = 100 n = 500 n = 1000 M0\M1 M1\M0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0.01 0.02 0.03 0.04 0.05 0.06 0.01 0.02 0.03 0.04 0.05 0.06

a size method

adaptive interval uniform

4000 replicates, α = 0. 05

size of procedure under ρ13=n−a/4, ρ23=n−1/2+a/4

46

slide-54
SLIDE 54

Weak-weak regime: power to select M0 or M1

Grey curves are bounds on the theoretically optimal power.

n = 20 n = 50 n = 100 n = 500 n = 1000 M0\M1 M1\M0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 0.00 0.25 0.50 0.75 0.00 0.25 0.50 0.75

a power linetype

lower bound upper bound

method

adaptive interval uniform

power of procedure under under ρ13=n−a/4, ρ23=n−1/2+a/4

47

slide-55
SLIDE 55

Projected Wishart

Draw Σ ∼ Wishart

  • ν, (σij)3×3 = (− 1

2)|i−j|

and then projected Σ to M0 and M1 by MLE.

n = 20 n = 50 n = 100 n = 300 M0\M1 M1\M0 10 20 30 10 20 30 10 20 30 10 20 30 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05

degrees of freedom size method

adaptive interval uniform

4000 replicates, α = 0. 05

size of procedure on the projected Wishart

48

slide-56
SLIDE 56

Projected Wishart

Draw Σ ∼ Wishart

  • ν, (σij)3×3 = (− 1

2)|i−j|

and then projected Σ to M0 and M1 by MLE.

n = 20 n = 50 n = 100 n = 300 M0\M1 M1\M0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0.25 0.50 0.75 0.25 0.50 0.75

degrees of freedom power method

adaptive interval uniform

4000 replicates, α = 0. 05

power of procedure on the projected Wishart

48

slide-57
SLIDE 57

Linear regression

(Y1, Y2, Y3) = X ⊺(β1, β2, β3) + ε with ε ∼ N(0, Σ(i)). Σ(i) is drawn from the projected Wishart. Y1 ⊥

⊥ Y2 | X1, · · · Xp

X1 X2 · · · Xp Y1 Y3 Y2 Y1 ⊥

⊥ Y2 | Y3, X1, · · · Xp

X1 X2 · · · Xp Y1 Y3 Y2

49

slide-58
SLIDE 58

Linear regression

(Y1, Y2, Y3) = X ⊺(β1, β2, β3) + ε with ε ∼ N(0, Σ(i)). Σ(i) is drawn from the projected Wishart.

M0\M1 M1\M0 power size 100 200 300 400 500 0 100 200 300 400 500 0.6 0.7 0.8 0.9 0.00 0.05 0.10 0.15

p value method

adaptive interval uniform naive

n = 1000, 1000 replicates, α = 0. 05

size and power conditional on p covariates

50

slide-59
SLIDE 59

Real-data example: American occupational structure

Blau and Duncan (1967) measured the following covariates on n = 20, 700 subjects: V : father’s educational attainment, X: father’s occupational status, U: educational attainment, W : status of the first job, Y : status of occupation in 1962. Blau and Duncan summarized the data as a correlation matrix.

51

slide-60
SLIDE 60

Real-data example: structure learning

We run PC algorithm at level α = 0.01. It first identifies the skeleton by d-separation, which only removes the edge between V and Y based on Y ⊥

⊥ V | U, X.

V U X W Y The blue edges are oriented based on a common-sense temporal

  • rdering {V , X} < U < {W , Y }.

52

slide-61
SLIDE 61

Real-data example: structure learning

V U X W Y

Next, the PC algorithm orients edges based on V -structures. The

  • rientation of V − X is statistically unidentifiable (no V -structure).

However, the orientation of W − Y raises the question of testing

M0 (Y → W ) : V ⊥

⊥ Y | U, X,

M1 (Y ← W ) : V ⊥

⊥ Y | W , U, X.

We have λ(0:1)

n

= 3.72 and p-value = 0.026 under the envelope distribution ¯ Fˆ

ρn. Hence, under α = 0.01 we would leave the edge

unoriented (even though n = 20, 700!).

53

slide-62
SLIDE 62

Future work

Can we generalize the method as an off-the-shelf tool for non-nested model selection with error guarantees?

  • Mi as a manifold defined on some ambient Θ. Models can

have different dimensions.

  • The simplest case is to select between two models. Dealing

with more than two models involves multiplicity correction.

  • Need a characterization of all possible stable laws of λ(0:1).
  • Take any θ ∈ M0 ∩ M1 and consider θ(0)

n , θ(1) n

→ θ in respective neighborhoods. θ(0)

n

and θ(1)

n

are “closest” to each

  • ther in the KL sense.
  • Recall that ρ13ρ23 is effectively the parameter that determines

the distribution of λ(0:1).

  • Can we always introduce a reparametrization such that the

asymptotic at every neighborhood is equivalent to something simple, even under high-order equivalence (Evans, 2018)?

  • Take an envelope over all these laws.

54

slide-63
SLIDE 63

Thanks!

For details: https://arxiv.org/abs/1906.01850

55

slide-64
SLIDE 64

Additional slides

slide-65
SLIDE 65

Blau and Duncan dataset

Data collected during the March, 1962 Current Population Survey,

  • n a nationwide sample of about 20,000 American men aged 20-64.
  • Occupational statuses are measured by some index.
  • Educational attainment is measured by some coding for the

number of years of schooling completed. Sn =         1.000 0.516 0.453 0.332 0.322 0.516 1.000 0.438 0.417 0.405 0.453 0.438 1.000 0.538 0.596 0.332 0.417 0.538 1.000 0.541 0.322 0.405 0.596 0.541 1.000         .

slide-66
SLIDE 66

Limit experiment

Consider an “experiment” E = (X, A, Ph : h ∈ H) in the sense of van der Vaart. h is typically a local parameter. Fix a “base” h0 ∈ H. The likelihood ratio process is dPh dPh0 (X)

  • h∈H

, X ∼ Ph0. A sequence of experiments En = (Xn, An, Ph,n : h ∈ H) converges a limit experiment E = (X, A, Ph : h ∈ H) if the likelihood ratio process weakly converges (marginally). That is, for any finite subset I ⊂ H and any h0 ∈ H, dPh,n dPh0,n (Xn)

  • h∈I

h0

  • dPh

dPh0 (X)

  • h∈I

.

slide-67
SLIDE 67

Limit experiment

If (Pn,θ : θ ∈ Θ) is locally asymptotic normal (LAN) with norming sequence n−1/2 and non-singular Iθ, then the sequence of experiments (Pθ+n−1/2,n : h ∈ Rd) converges to the limit experiment (N(h, I −1

θ

) : h ∈ Rd).

slide-68
SLIDE 68

References i

References

Blau, Peter M and Otis Dudley Duncan (1967). The American Occupational Structure. Wiley New York. Drton, Mathias (2006). “Algebraic techniques for Gaussian models”. In: Prague Stochastics. Ed. by M. Huˇ skov´ a and

  • M. Janˇ
  • zura. Matfyzpress, Charles Univ.

– (2009). “Likelihood ratio tests and singularities”. In: The Annals

  • f Statistics 37.2, pp. 979–1012.

Drton, Mathias and Marloes H Maathuis (2017). “Structure learning in graphical modeling”. In: Annual Review of Statistics and Its Application 4, pp. 365–393.

slide-69
SLIDE 69

References ii

Drton, Mathias and Michael D Perlman (2004). “Model selection for Gaussian concentration graphs”. In: Biometrika 91.3,

  • pp. 591–602.

Drton, Mathias and Thomas S Richardson (2008). “Graphical methods for efficient likelihood inference in Gaussian covariance models”. In: Journal of Machine Learning Research 9.May,

  • pp. 893–914.

Drton, Mathias and Seth Sullivant (2007). “Algebraic statistical models”. In: Statistica Sinica, pp. 1273–1297. Evans, Robin J (2018). “Model selection and local geometry”. In: arXiv preprint arXiv:1801.08364v3.