High-dimensional consistency in score-based and hybrid structure - - PowerPoint PPT Presentation

high dimensional consistency in score based and hybrid
SMART_READER_LITE
LIVE PREVIEW

High-dimensional consistency in score-based and hybrid structure - - PowerPoint PPT Presentation

High-dimensional consistency in score-based and hybrid structure learning Marloes Maathuis joint work with Preetam Nandy and Alain Hauser Structure learning We consider random variables ( X 1 , . . . , X p ) with distribution F 0 , where F 0


slide-1
SLIDE 1

High-dimensional consistency in score-based and hybrid structure learning

Marloes Maathuis joint work with Preetam Nandy and Alain Hauser

slide-2
SLIDE 2

Structure learning

◮ We consider random variables (X1, . . . , Xp) with distribution F0,

where F0 is multivariate Gaussian (or nonparanormal)

◮ We assume that F0 has a perfect map G0 ◮ Based on n i.i.d. observations from F0, we want to learn the

CPDAG of G0

slide-3
SLIDE 3

Terminology...

◮ We consider directed acyclic graphs (DAGs), where each node

represents a random variable

slide-4
SLIDE 4

Terminology...

◮ We consider directed acyclic graphs (DAGs), where each node

represents a random variable

◮ A DAG encodes d-separations (Pearl). Example:

X1 → X2 → X3 encodes that X1 and X3 are d-separated by X2.

slide-5
SLIDE 5

Terminology...

◮ We consider directed acyclic graphs (DAGs), where each node

represents a random variable

◮ A DAG encodes d-separations (Pearl). Example:

X1 → X2 → X3 encodes that X1 and X3 are d-separated by X2.

◮ A DAG G is a perfect map of a distribution F if

{d-separations in G} = {conditional independencies in F}

slide-6
SLIDE 6

Terminology...

◮ We consider directed acyclic graphs (DAGs), where each node

represents a random variable

◮ A DAG encodes d-separations (Pearl). Example:

X1 → X2 → X3 encodes that X1 and X3 are d-separated by X2.

◮ A DAG G is a perfect map of a distribution F if

{d-separations in G} = {conditional independencies in F}

◮ Examples:

◮ (X1, X2, X3) with X1 ⊥

⊥ X3|X2: 3 perfect maps: X1 → X2 → X3, X1 ← X2 ← X3, X1 ← X2 → X3

slide-7
SLIDE 7

Terminology...

◮ We consider directed acyclic graphs (DAGs), where each node

represents a random variable

◮ A DAG encodes d-separations (Pearl). Example:

X1 → X2 → X3 encodes that X1 and X3 are d-separated by X2.

◮ A DAG G is a perfect map of a distribution F if

{d-separations in G} = {conditional independencies in F}

◮ Examples:

◮ (X1, X2, X3) with X1 ⊥

⊥ X3|X2: 3 perfect maps: X1 → X2 → X3, X1 ← X2 ← X3, X1 ← X2 → X3

◮ (X1, X2, X3) with X1 ⊥

⊥ X3: 1 perfect map: X1 → X2 ← X3 (v-structure)

slide-8
SLIDE 8

Terminology...

◮ We consider directed acyclic graphs (DAGs), where each node

represents a random variable

◮ A DAG encodes d-separations (Pearl). Example:

X1 → X2 → X3 encodes that X1 and X3 are d-separated by X2.

◮ A DAG G is a perfect map of a distribution F if

{d-separations in G} = {conditional independencies in F}

◮ Examples:

◮ (X1, X2, X3) with X1 ⊥

⊥ X3|X2: 3 perfect maps: X1 → X2 → X3, X1 ← X2 ← X3, X1 ← X2 → X3

◮ (X1, X2, X3) with X1 ⊥

⊥ X3: 1 perfect map: X1 → X2 ← X3 (v-structure)

◮ (X1, X2, X3, X4) with X1 ⊥

⊥ X3, X2 ⊥ ⊥ X4: no perfect map

slide-9
SLIDE 9

Terminology...

◮ We consider directed acyclic graphs (DAGs), where each node

represents a random variable

◮ A DAG encodes d-separations (Pearl). Example:

X1 → X2 → X3 encodes that X1 and X3 are d-separated by X2.

◮ A DAG G is a perfect map of a distribution F if

{d-separations in G} = {conditional independencies in F}

◮ Examples:

◮ (X1, X2, X3) with X1 ⊥

⊥ X3|X2: 3 perfect maps: X1 → X2 → X3, X1 ← X2 ← X3, X1 ← X2 → X3

◮ (X1, X2, X3) with X1 ⊥

⊥ X3: 1 perfect map: X1 → X2 ← X3 (v-structure)

◮ (X1, X2, X3, X4) with X1 ⊥

⊥ X3, X2 ⊥ ⊥ X4: no perfect map

◮ We consider distributions that have a perfect map

slide-10
SLIDE 10

Markov equivalence classes and CPDAGs

◮ DAGs that encode the same set of d-separations form a Markov

equivalence class. Example: X1 → X2 → X3, X1 ← X2 ← X3, X1 ← X2 → X3

slide-11
SLIDE 11

Markov equivalence classes and CPDAGs

◮ DAGs that encode the same set of d-separations form a Markov

equivalence class. Example: X1 → X2 → X3, X1 ← X2 ← X3, X1 ← X2 → X3

◮ All DAGs in a Markov equivalence class share the same skeleton

and the same v-structures

slide-12
SLIDE 12

Markov equivalence classes and CPDAGs

◮ DAGs that encode the same set of d-separations form a Markov

equivalence class. Example: X1 → X2 → X3, X1 ← X2 ← X3, X1 ← X2 → X3

◮ All DAGs in a Markov equivalence class share the same skeleton

and the same v-structures

◮ A Markov equivalence class can be described uniquely by a CPDAG.

We want to learn the CPDAG. Example:

X1 X2 X3 X4 CPDAG

slide-13
SLIDE 13

Markov equivalence classes and CPDAGs

◮ DAGs that encode the same set of d-separations form a Markov

equivalence class. Example: X1 → X2 → X3, X1 ← X2 ← X3, X1 ← X2 → X3

◮ All DAGs in a Markov equivalence class share the same skeleton

and the same v-structures

◮ A Markov equivalence class can be described uniquely by a CPDAG.

We want to learn the CPDAG. Example:

X1 X2 X3 X4 CPDAG X1 X2 X3 X4 DAG 1 X1 X2 X3 X4 DAG 2 X1 X2 X3 X4 DAG 3

slide-14
SLIDE 14

Possible applications of DAGs/CPDAGs

◮ Efficient estimation/computation using factorization:

f (x1, . . . , xp) =

p

  • j=1

f (xj|pa(xj, G))

◮ Probabilistic reasoning in expert systems ◮ Causal inference ◮ ...

slide-15
SLIDE 15

CPDAG versus conditional independence graph

◮ A conditional independence graph (CIG) is an undirected graph,

where Xi and Xj are adjacent ⇔ Xi ⊥ ⊥ Xj | S for S = {all remaining variables}

slide-16
SLIDE 16

CPDAG versus conditional independence graph

◮ A conditional independence graph (CIG) is an undirected graph,

where Xi and Xj are adjacent ⇔ Xi ⊥ ⊥ Xj | S for S = {all remaining variables}

◮ A CPDAG is a partially directed graph, where Xi and Xj are adjacent

⇔ Xi ⊥ ⊥ Xj | S for all S ⊆ {all remaining variables}

slide-17
SLIDE 17

CPDAG versus conditional independence graph

◮ A conditional independence graph (CIG) is an undirected graph,

where Xi and Xj are adjacent ⇔ Xi ⊥ ⊥ Xj | S for S = {all remaining variables}

◮ A CPDAG is a partially directed graph, where Xi and Xj are adjacent

⇔ Xi ⊥ ⊥ Xj | S for all S ⊆ {all remaining variables}

◮ The skeleton of the CPDAG is a subgraph of the CIG

slide-18
SLIDE 18

CPDAG versus conditional independence graph

◮ A conditional independence graph (CIG) is an undirected graph,

where Xi and Xj are adjacent ⇔ Xi ⊥ ⊥ Xj | S for S = {all remaining variables}

◮ A CPDAG is a partially directed graph, where Xi and Xj are adjacent

⇔ Xi ⊥ ⊥ Xj | S for all S ⊆ {all remaining variables}

◮ The skeleton of the CPDAG is a subgraph of the CIG ◮ The CIG can be obtained from the CPDAG by “moralization”:

Marry unmarried parents and then make all edges undirected

slide-19
SLIDE 19

CPDAG versus conditional independence graph

◮ A conditional independence graph (CIG) is an undirected graph,

where Xi and Xj are adjacent ⇔ Xi ⊥ ⊥ Xj | S for S = {all remaining variables}

◮ A CPDAG is a partially directed graph, where Xi and Xj are adjacent

⇔ Xi ⊥ ⊥ Xj | S for all S ⊆ {all remaining variables}

◮ The skeleton of the CPDAG is a subgraph of the CIG ◮ The CIG can be obtained from the CPDAG by “moralization”:

Marry unmarried parents and then make all edges undirected

◮ Example:

X1 X2 X3 X4 CPDAG

slide-20
SLIDE 20

CPDAG versus conditional independence graph

◮ A conditional independence graph (CIG) is an undirected graph,

where Xi and Xj are adjacent ⇔ Xi ⊥ ⊥ Xj | S for S = {all remaining variables}

◮ A CPDAG is a partially directed graph, where Xi and Xj are adjacent

⇔ Xi ⊥ ⊥ Xj | S for all S ⊆ {all remaining variables}

◮ The skeleton of the CPDAG is a subgraph of the CIG ◮ The CIG can be obtained from the CPDAG by “moralization”:

Marry unmarried parents and then make all edges undirected

◮ Example:

X1 X2 X3 X4 CPDAG X1 X2 X3 X4 Marrying

slide-21
SLIDE 21

CPDAG versus conditional independence graph

◮ A conditional independence graph (CIG) is an undirected graph,

where Xi and Xj are adjacent ⇔ Xi ⊥ ⊥ Xj | S for S = {all remaining variables}

◮ A CPDAG is a partially directed graph, where Xi and Xj are adjacent

⇔ Xi ⊥ ⊥ Xj | S for all S ⊆ {all remaining variables}

◮ The skeleton of the CPDAG is a subgraph of the CIG ◮ The CIG can be obtained from the CPDAG by “moralization”:

Marry unmarried parents and then make all edges undirected

◮ Example:

X1 X2 X3 X4 CPDAG X1 X2 X3 X4 Marrying X1 X2 X3 X4 CIG

slide-22
SLIDE 22

Summary of problem definition

◮ We consider random variables (X1, . . . , Xp) with distribution F0,

where F0 is multivariate Gaussian (or nonparanormal)

◮ We assume that F0 has a perfect map G0 ◮ Based on n i.i.d. observations from F0, we want to learn the

CPDAG of G0

slide-23
SLIDE 23

Three main approaches for structure learning

◮ Constraint-based:

◮ Conditional independencies in the data impose constraints on the

CPDAG

◮ Example: PC-algorithm (Spirtes et al. ’93)

◮ Score-based:

◮ A score function is optimized over the space of DAGs/CPDAGs ◮ Example: greedy equivalence search (GES) (Chickering ’02)

◮ Hybrid:

◮ A score function is optimized over a restricted space of

DAGs/CPDAGs, where the restricted space is determined using conditional independence constraints

◮ Examples: Max-Min Hill Climbing (MMHC) (Tsmardinos et al ’06),

Restricted GES (RGES: GES restricted to estimated CIG)

slide-24
SLIDE 24

Constraint-based: PC-algorithm

◮ Algorithm:

◮ Start with a complete undirected graph ◮ Determine skeleton by conducting many conditional independence

tests in a clever order, and removing an edge if conditional independence cannot be rejected at level α (tuning parameter)

◮ Determine v-structures ◮ Determine orientation of remaining edges

slide-25
SLIDE 25

Constraint-based: PC-algorithm

◮ Algorithm:

◮ Start with a complete undirected graph ◮ Determine skeleton by conducting many conditional independence

tests in a clever order, and removing an edge if conditional independence cannot be rejected at level α (tuning parameter)

◮ Determine v-structures ◮ Determine orientation of remaining edges

◮ High-dimensional consistency results

(Kalisch & B¨ uhlmann ’08, Harris & Drton ’13, Colombo & M ’14)

slide-26
SLIDE 26

Constraint-based: PC-algorithm

◮ Algorithm:

◮ Start with a complete undirected graph ◮ Determine skeleton by conducting many conditional independence

tests in a clever order, and removing an edge if conditional independence cannot be rejected at level α (tuning parameter)

◮ Determine v-structures ◮ Determine orientation of remaining edges

◮ High-dimensional consistency results

(Kalisch & B¨ uhlmann ’08, Harris & Drton ’13, Colombo & M ’14)

◮ Scales well to large graphs: polynomial time complexity in the

number of nodes for sparse graphs

slide-27
SLIDE 27

Constraint-based: PC-algorithm

◮ Algorithm:

◮ Start with a complete undirected graph ◮ Determine skeleton by conducting many conditional independence

tests in a clever order, and removing an edge if conditional independence cannot be rejected at level α (tuning parameter)

◮ Determine v-structures ◮ Determine orientation of remaining edges

◮ High-dimensional consistency results

(Kalisch & B¨ uhlmann ’08, Harris & Drton ’13, Colombo & M ’14)

◮ Scales well to large graphs: polynomial time complexity in the

number of nodes for sparse graphs

◮ Frequently applied in high-dimensional settings

slide-28
SLIDE 28

Score-based: GES

◮ Aims to optimize ℓ0-penalized likelihood score:

−2 loglikelihood + λ(nr of edges)

slide-29
SLIDE 29

Score-based: GES

◮ Aims to optimize ℓ0-penalized likelihood score:

−2 loglikelihood + λ(nr of edges)

◮ Conducts greedy search over the space of CPDAGs.

It can move from CPDAG C to CPDAG C′ if:

C G2 G1 G3 G4 G′

2

G′

1

G′

3

C′

Single edge addition/deletion

slide-30
SLIDE 30

Score-based: GES

◮ Aims to optimize ℓ0-penalized likelihood score:

−2 loglikelihood + λ(nr of edges)

◮ Conducts greedy search over the space of CPDAGs.

It can move from CPDAG C to CPDAG C′ if:

C G2 G1 G3 G4 G′

2

G′

1

G′

3

C′

Single edge addition/deletion

Hence, skeletons of C and C′ differ by one edge, but orientations may differ for many edges.

slide-31
SLIDE 31

Score-based: GES

◮ Aims to optimize ℓ0-penalized likelihood score:

−2 loglikelihood + λ(nr of edges)

◮ Conducts greedy search over the space of CPDAGs:

◮ Forward phase: choose best possible moves with single edge

additions until score can no longer be improved

◮ Backward phase: choose best possible moves with single edge

deletions until score can no longer be improved

slide-32
SLIDE 32

Score-based: GES

◮ Aims to optimize ℓ0-penalized likelihood score:

−2 loglikelihood + λ(nr of edges)

◮ Conducts greedy search over the space of CPDAGs:

◮ Forward phase: choose best possible moves with single edge

additions until score can no longer be improved

◮ Backward phase: choose best possible moves with single edge

deletions until score can no longer be improved

◮ Not guaranteed to find the global optimum for finite samples.

But, despite greedy search, it is consistent (for fixed p) (Chickering ’02)

slide-33
SLIDE 33

Score-based: GES

◮ Aims to optimize ℓ0-penalized likelihood score:

−2 loglikelihood + λ(nr of edges)

◮ Conducts greedy search over the space of CPDAGs:

◮ Forward phase: choose best possible moves with single edge

additions until score can no longer be improved

◮ Backward phase: choose best possible moves with single edge

deletions until score can no longer be improved

◮ Not guaranteed to find the global optimum for finite samples.

But, despite greedy search, it is consistent (for fixed p) (Chickering ’02)

◮ No high-dimensional consistency:

Global optimum of ℓ0-penalized likelihood score is consistent (Van de Geer & B¨ uhlmann ’13), but no algorithm to find global optimum

slide-34
SLIDE 34

Hybrid: RGES

◮ Idea:

◮ The skeleton of the CPDAG is subgraph of the CIG ◮ Estimation of the CIG is well-behaved and well-understood ◮ We can use an estimated CIG as restricted search space

slide-35
SLIDE 35

Hybrid: RGES

◮ Idea:

◮ The skeleton of the CPDAG is subgraph of the CIG ◮ Estimation of the CIG is well-behaved and well-understood ◮ We can use an estimated CIG as restricted search space

◮ Algorithm:

◮ Estimate CIG, e.g., using neighborhood selection with tuning

parameter γ (Meinshausen & B¨ uhlmann ’06)

◮ Apply GES with tuning parameter λ, only allowing edges from the

estimated CIG in the forward phase. Backward phase remains unchanged.

slide-36
SLIDE 36

Hybrid: RGES

◮ Idea:

◮ The skeleton of the CPDAG is subgraph of the CIG ◮ Estimation of the CIG is well-behaved and well-understood ◮ We can use an estimated CIG as restricted search space

◮ Algorithm:

◮ Estimate CIG, e.g., using neighborhood selection with tuning

parameter γ (Meinshausen & B¨ uhlmann ’06)

◮ Apply GES with tuning parameter λ, only allowing edges from the

estimated CIG in the forward phase. Backward phase remains unchanged.

◮ Consistency unknown even for fixed p

slide-37
SLIDE 37

Simulation results: average ROC curves

n 100 200 300 400 pn 300 600 1200 2400 en 2 2.8 3.5 4

Skeleton, p=300 Skeleton, p=600 Skeleton, p=1200 Skeleton, p=2400 Directed part, p=300 Directed part, p=600 Directed part, p=1200 Directed part, p=2400 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015

False positive rate True positive rate

GES RGES PC

ROC curves obtained by varying tuning parameters α and λ; γ was fixed at sufficiently small value.

slide-38
SLIDE 38

Simulation results: computation time

n 100 200 300 400 pn 300 600 1200 2400 en 2 2.8 3.5 4

  • 1

5 10 50 100 500 1000 300 600 1200 2400

Number of variables average runtime (s)

  • GES

PC RGES

Tuning parameters α and λ chosen to get roughly the right sparsity; γ was fixed at sufficiently small value.

slide-39
SLIDE 39

Summary

speed estimation performance fixed p consistency high-dimensional consistency PC ✓ ✗ ✓ ✓ GES ✗ ✓ ✓ ? RGES ✓ ✓ ? ?

slide-40
SLIDE 40

Example

Consider the following structural equation model:

X1 ← ǫ1 X2 ← ǫ2 X3 ← 1.4X1 + 1.3X2 + ǫ3 X4 ← 1.2X2 + 0.9X3 + ǫ4 X3 X4 X1 X2 1.4 1.3 1.2 0.9

where the error variables ǫ1, . . . , ǫ4 are i.i.d. N(0, 1).

slide-41
SLIDE 41

Example: oracle output

RGES is inconsistent!

slide-42
SLIDE 42

Example: oracle output

RGES is inconsistent!

slide-43
SLIDE 43

Example: oracle output

slide-44
SLIDE 44

Example: oracle output

RGES is inconsistent! Same example shows inconsistency of MMHC.

slide-45
SLIDE 45

Visualization

full space

The true CPDAG lies within the restricted space defined by the CIG, but the search path of GES may need to leave this space to get there

slide-46
SLIDE 46

Visualization

full space restricted space

The true CPDAG lies within the restricted space defined by the CIG, but the search path of GES may need to leave this space to get there

slide-47
SLIDE 47

Visualization

full space restricted space CPDAG

The true CPDAG lies within the restricted space defined by the CIG, but the search path of GES may need to leave this space to get there

slide-48
SLIDE 48

Visualization

full space restricted space CPDAG

The true CPDAG lies within the restricted space defined by the CIG, but the search path of GES may need to leave this space to get there

slide-49
SLIDE 49

Visualization

full space restricted space CPDAG

The true CPDAG lies within the restricted space defined by the CIG, but the search path of GES may need to leave this space to get there

slide-50
SLIDE 50

Summary

speed estimation performance fixed p consistency high-dimensional consistency PC ✓ ✗ ✓ ✓ GES ✗ ✓ ✓ ? RGES ✓ ✓ ✗ ✗

slide-51
SLIDE 51

Solution: adaptively restricted GES (ARGES)

Use adaptive restriction on the search space: edges in estimated CIG + shields of v-structures of the current state

slide-52
SLIDE 52

Solution: adaptively restricted GES (ARGES)

Use adaptive restriction on the search space: edges in estimated CIG + shields of v-structures of the current state Intuition: RGES can get stuck with v-structures

slide-53
SLIDE 53

Solution: adaptively restricted GES (ARGES)

Use adaptive restriction on the search space: edges in estimated CIG + shields of v-structures of the current state

slide-54
SLIDE 54

Solution: adaptively restricted GES (ARGES)

Use adaptive restriction on the search space: edges in estimated CIG + shields of v-structures of the current state Intuition: RGES can get stuck with v-structures

slide-55
SLIDE 55

Adaptively restricted GES (ARGES)

◮ Use adaptive restriction on the search space:

edges in estimated CIG + shields of v-structures of the current state

◮ ARGES is consistent for fixed p (Nandy et al. ’16)

Proof uses new characterization of conditional independence maps

slide-56
SLIDE 56

Adaptively restricted GES (ARGES)

◮ Use adaptive restriction on the search space:

edges in estimated CIG + shields of v-structures of the current state

◮ ARGES is consistent for fixed p (Nandy et al. ’16)

Proof uses new characterization of conditional independence maps

◮ We only need a “small” modification to achieve consistency.

In other words, RGES is “almost consistent”.

◮ Speed and estimation performance of RGES and ARGES

are very similar

slide-57
SLIDE 57

Summary

speed estimation performance fixed p consistency high-dimensional consistency PC ✓ ✗ ✓ ✓ GES ✗ ✓ ✓ ? RGES ✓ ✓ ✗ ✗ ARGES ✓ ✓ ✓ ?

slide-58
SLIDE 58

Single step score difference and partial correlations

◮ Assume multivariate Gaussianity and normalized score

S(G) = −(1/n)loglikelihood + λn(nr of edges)

slide-59
SLIDE 59

Single step score difference and partial correlations

◮ Assume multivariate Gaussianity and normalized score

S(G) = −(1/n)loglikelihood + λn(nr of edges)

◮ If G′ = G ∪ {Xi → Xk}, then

S(G′) − S(G) = (1/2) log

  • 1 − ˆ

ρ2

ik|pa(k,G)

  • + λn
slide-60
SLIDE 60

Single step score difference and partial correlations

◮ Assume multivariate Gaussianity and normalized score

S(G) = −(1/n)loglikelihood + λn(nr of edges)

◮ If G′ = G ∪ {Xi → Xk}, then

S(G′) − S(G) = (1/2) log

  • 1 − ˆ

ρ2

ik|pa(k,G)

  • + λn

◮ Hence, move in forward phase of GES only if

ˆ ρ2

ik|pa(k,G) > 1 − exp(−2λn)

slide-61
SLIDE 61

Connection between (AR)GES and PC

◮ Close connection between (AR)GES and PC:

◮ PC starts with full graph and deletes edges if

|partial correlation| is small

◮ Forward phase of (AR)GES starts with empty graph and adds edges if

|partial correlation| is large. Backward phase of (AR)GES removes edges if |partial correlation| is small.

slide-62
SLIDE 62

Connection between (AR)GES and PC

◮ Close connection between (AR)GES and PC:

◮ PC starts with full graph and deletes edges if

|partial correlation| is small

◮ Forward phase of (AR)GES starts with empty graph and adds edges if

|partial correlation| is large. Backward phase of (AR)GES removes edges if |partial correlation| is small.

◮ Differences between (AR)GES and PC:

◮ (AR)GES can recover from mistakes in the forward phase ◮ (AR)GES makes optimal single edge moves ◮ (AR)GES considers skeleton and orientation simultaneously ◮ (AR)GES is not limited to triples of nodes when orienting edges

slide-63
SLIDE 63

High-dimensional consistency of (AR)GES

◮ In the sample version, we cannot guarantee that (AR)GES makes

the optimal oracle moves with high probability

slide-64
SLIDE 64

High-dimensional consistency of (AR)GES

◮ In the sample version, we cannot guarantee that (AR)GES makes

the optimal oracle moves with high probability

◮ But optimal moves are not needed for soundness of the oracle

  • version. Any move that improves the score is OK.
slide-65
SLIDE 65

High-dimensional consistency of (AR)GES

◮ In the sample version, we cannot guarantee that (AR)GES makes

the optimal oracle moves with high probability

◮ But optimal moves are not needed for soundness of the oracle

  • version. Any move that improves the score is OK.

◮ We can guarantee that the sample version of (AR)GES makes a

“δ-optimal oracle move” with high probability

slide-66
SLIDE 66

High-dimensional consistency of (AR)GES

◮ In the sample version, we cannot guarantee that (AR)GES makes

the optimal oracle moves with high probability

◮ But optimal moves are not needed for soundness of the oracle

  • version. Any move that improves the score is OK.

◮ We can guarantee that the sample version of (AR)GES makes a

“δ-optimal oracle move” with high probability

◮ Consider high-dimensional scenario like the one considered for PC,

with the additional assumptions that:

◮ P(ˆ

In ⊇ I0) → 1 as n → ∞ (for ARGES only)

◮ Maximum neighborhood size of the forward phase of any δn-optimal

  • racle (AR)GES version is O(n1−b), 0 < b ≤ 1, where δn → 0
slide-67
SLIDE 67

High-dimensional consistency of (AR)GES

◮ In the sample version, we cannot guarantee that (AR)GES makes

the optimal oracle moves with high probability

◮ But optimal moves are not needed for soundness of the oracle

  • version. Any move that improves the score is OK.

◮ We can guarantee that the sample version of (AR)GES makes a

“δ-optimal oracle move” with high probability

◮ Consider high-dimensional scenario like the one considered for PC,

with the additional assumptions that:

◮ P(ˆ

In ⊇ I0) → 1 as n → ∞ (for ARGES only)

◮ Maximum neighborhood size of the forward phase of any δn-optimal

  • racle (AR)GES version is O(n1−b), 0 < b ≤ 1, where δn → 0

◮ Then (AR)GES considers a sub-exponential number of partial

correlations, and consistency follows as for PC

slide-68
SLIDE 68

High-dimensional consistency of (AR)GES

◮ In the sample version, we cannot guarantee that (AR)GES makes

the optimal oracle moves with high probability

◮ But optimal moves are not needed for soundness of the oracle

  • version. Any move that improves the score is OK.

◮ We can guarantee that the sample version of (AR)GES makes a

“δ-optimal oracle move” with high probability

◮ Consider high-dimensional scenario like the one considered for PC,

with the additional assumptions that:

◮ P(ˆ

In ⊇ I0) → 1 as n → ∞ (for ARGES only)

◮ Maximum neighborhood size of the forward phase of any δn-optimal

  • racle (AR)GES version is O(n1−b), 0 < b ≤ 1, where δn → 0

◮ Then (AR)GES considers a sub-exponential number of partial

correlations, and consistency follows as for PC

◮ We do not yet know distributional assumptions that guarantee the

second assumption

slide-69
SLIDE 69

Simulation results: average ROC curves

n 100 200 300 400 pn 300 600 1200 2400 en 2 2.8 3.5 4

Skeleton, p=300 Skeleton, p=600 Skeleton, p=1200 Skeleton, p=2400 Directed part, p=300 Directed part, p=600 Directed part, p=1200 Directed part, p=2400 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015 0.0005 0.0015

False positive rate True positive rate

ARGES* GES RGES ARGES MMHC PC

ARGES* uses true CIG. Good performance shows potential for improvement.

slide-70
SLIDE 70

Simulation results: computation time

n 100 200 300 400 pn 300 600 1200 2400 en 2 2.8 3.5 4

  • 1

5 10 50 100 500 1000 5000 300 600 1200 2400

Number of variables average runtime (s)

  • GES

MMHC PC ARGES RGES ARGES*

slide-71
SLIDE 71

Summary

speed estimation performance fixed p consistency high-dimensional consistency PC ✓ ✗ ✓ ✓ GES ✗ ✓ ✓ ✓ RGES ✓ ✓ ✗ ✗ ARGES ✓ ✓ ✓ ✓ MMHC ✗ ∼ ✗ ✗

slide-72
SLIDE 72

Summary

◮ Hybrid methods are often a good middle ground between scaling

properties and estimation performance

◮ Little was known about (in)consistency of hybrid methods ◮ We showed that MMHC and RGES are inconsistent ◮ We showed that RGES (and variations of it) can be made consistent

with small adaptations

slide-73
SLIDE 73

Summary

◮ Score-based and constraint-based methods are usually seen as two

distinct approaches. We showed a close connection between the two.

◮ This connection allows:

◮ insight into reasons for differences in estimation performance ◮ first high-dimensional consistency proof for score-based and hybrid

algorithms

◮ possibilities for new families of methods, such as GES using rank

  • correlations. In fact, we also proved high-dimensional consistency for

(AR)GES for nonparanormal distributions. (cf. Harris & Drton ’13 for the PC algorithm)

slide-74
SLIDE 74

Summary

◮ One can also use the skeleton of the CPDAG or the marginal

independence graph as restricted search space. Then one has to allow shields of all unshielded triples of the current CPDAG in the forward phase.

◮ Our work can be combined with other recent work on GES:

◮ Selective GES (SGES) (Chickering & Meek, 2015) has a modified

backward phase that is of polynomial time complexity for sparse

  • graphs. One can combine the forward phase of ARGES with the

backward phase of SGES.

◮ There are fast parallel implementations of GES (Ramsey, 2015).

One can make similar implementations for ARGES.

slide-75
SLIDE 75

The non-paranormal trick

◮ (g1(X1), . . . , gp(Xp)) has a nonparanormal distribution if

(X1, . . . , Xp) has a multivariate Gaussian distribution and g1, . . . , gp are strictly increasing (or strictly decreasing) functions

◮ The sample (Spearman’s or Kendall’s) rank correlation matrix ˆ

R is an estimator of the underlying Gaussian rank correlation matrix R

◮ sin( π 2 ˆ

R) is a consistent estimator of the underlying Gaussian correlation matrix, even in high-dimensional settings (Liu et al., 2012)

◮ We showed that the high-dimensional consistency results continue to

hold in this setting, with sin( π

2 ˆ

R) as the input for (AR)GES

slide-76
SLIDE 76

Application to cauality: joint-IDA

slide-77
SLIDE 77

Simulation results: Joint-IDA with ARGES

# variables average degree sample size 1000 4 1000

Target set: all triples (i, j, k) for which θ(i,j)

ik

= 0

slide-78
SLIDE 78
slide-79
SLIDE 79

References

Chickering (2002). Optimal structure identification with greedy search. JMLR. Chickering & Meek (2015). Selective Greedy Equivalence Search: Finding optimal Bayesian networks using a polynomial number of score evaluations. UAI. Harris & Drton (2013). PC algorithm for nonparanormal graphical models. JMLR. Kalisch & B¨ uhlmann (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm. JMLR. Nandy, Hauser & Maathuis (2016). High-dimensional consistency in score-based and hybrid structure learning. arXiv:1507.02608. Nandy, Maathuis & Richardson (2016). Estimating the effect of joint interventions from observational data in sparse high-dimensional settings. Ann. Statist. Ramsey (2015). Scaling up Greedy Equivalence Search for continuous variables. arXiv: 1507.07749 Van de Geer & B¨ uhlmann (2013). ℓ0-penalized maximum likelihood for sparse directed acyclic graphs. Ann. Statist.

slide-80
SLIDE 80

Characterization of independence maps

Theorem

A DAG H is not an independence map of G0 if and only if

  • 1. skeleton(G0) skeleton(H), or
  • 2. there exists a triple of nodes (Xi, Xj, Xk) such that Xi and Xk are

non-adjacent in H, πH(Xi, Xj, Xk) is a non-collider path, and πG0(Xi, Xj, Xk) is a v-structure, or

  • 3. there exists a triple of nodes (Xi, Xj, Xk) such that πH(Xi, Xj, Xk) is

a v-structure and Xi ⊥G0 Xk | PaH(Xk), where without loss of generality we assume Xi ∈ NdH(Xk).

Corollary

We need to allow the following edges in the forward phase to achieve consistency: (i) edges that exist in G0, (ii) shields of v-structures in G0, and (iii) shields of v-structures in the current CPDAG.

slide-81
SLIDE 81

Conditions for high-dimensional consistency

(A1) (Gaussianity) The distribution of Xn is multivariate Gaussian for all n; (A2) (high-dimensional setting) pn = O(na) for some 0 ≤ a < ∞; (A3) (sparsity condition) Let qn = max1≤i≤pn |AdjCn0(Xni)| be the maximum degree in Cn0. Then qn = O(n1−b1) for some 0 < b1 ≤ 1; (A4) (high-dimensionally consistent estimators of the CIG or the CPDAG-skeleton) There exists a sequence of estimators ˆ In (for ARGES-CIG) or a sequence of estimators ˆ Un (for ARGES-skeleton) such that P(ˆ In ⊇ In0) → 1 or P( ˆ Un ⊇ skeleton(Cn0)) → 1, as n → ∞.

slide-82
SLIDE 82

Conditions for high-dimensional consistency

(A5) (bounds on the growth of oracle versions) The maximum degree in the output of the forward phase of every δn-optimal oracle version of (AR)GES based on ˆ In (for ARGES-CIG), based on ˆ Un (for ARGES-skeleton), or based on the complete undirected graph (for GES) is bounded by Knqn, for some sequences δ−1

n

= O(nd1) and Kn = O(nb1−b2) such that 0 ≤ d1 < b2/2 and 0 < b2 ≤ b1, where qn and b1 are given by (A3). (A6) (bounds on partial correlations) The partial correlations ρnij|S between Xni and Xnj given {Xnr : r ∈ S} satisfy the following upper and lower bounds for all n, uniformly over i, j ∈ {1, . . . , pn} and S ⊆ {1, . . . , pn} \ {i, j} such that |S| ≤ Knqn (where Kn and qn are from (A3) and (A5)): sup

i=j,S

|ρnij|S| ≤ M < 1, and inf

i,j,S{|ρnij|S| : ρnij|S = 0} ≥ cn,

with c−1

n

= O(nd2) for some 0 ≤ d2 < b2/2 where 0 < b2 ≤ b1 ≤ 1 are as in (A5) and (A3).