High-Dimensional Graphical Model Selection Anima Anandkumar U.C. - - PowerPoint PPT Presentation

high dimensional graphical model selection
SMART_READER_LITE
LIVE PREVIEW

High-Dimensional Graphical Model Selection Anima Anandkumar U.C. - - PowerPoint PPT Presentation

High-Dimensional Graphical Model Selection Anima Anandkumar U.C. Irvine Joint work with Vincent Tan (U. Wisc.) and Alan Willsky (MIT). Graphical Models: Definition Conditional Independence Dodgers Everton X A X B | X S B Red Sox S


slide-1
SLIDE 1

High-Dimensional Graphical Model Selection

Anima Anandkumar

U.C. Irvine

Joint work with Vincent Tan (U. Wisc.) and Alan Willsky (MIT).

slide-2
SLIDE 2

Graphical Models: Definition

Conditional Independence

XA ⊥ ⊥ XB|XS

Everton Mets Arsenal Red Sox Yankees Dodgers Chelsea Manchester United Phillies Baseball Soccer

A S B

slide-3
SLIDE 3

Graphical Models: Definition

Conditional Independence

XA ⊥ ⊥ XB|XS

Factorization

P(x) ∝ exp  

(i,j)∈G

Ψi,j(xi, xj)   .

Everton Mets Arsenal Red Sox Yankees Dodgers Chelsea Manchester United Phillies Baseball Soccer

A S B

slide-4
SLIDE 4

Graphical Models: Definition

Conditional Independence

XA ⊥ ⊥ XB|XS

Factorization

P(x) ∝ exp  

(i,j)∈G

Ψi,j(xi, xj)   .

Everton Mets Arsenal Red Sox Yankees Dodgers Chelsea Manchester United Phillies Baseball Soccer

A S B

Tree-Structured Graphical Models

1 2 3 4

slide-5
SLIDE 5

Graphical Models: Definition

Conditional Independence

XA ⊥ ⊥ XB|XS

Factorization

P(x) ∝ exp  

(i,j)∈G

Ψi,j(xi, xj)   .

Everton Mets Arsenal Red Sox Yankees Dodgers Chelsea Manchester United Phillies Baseball Soccer

A S B

Tree-Structured Graphical Models

P(x) =

  • i∈V

Pi(xi)

  • (i,j)∈E

Pi,j(xi, xj) Pi(xi)Pj(xj) = P1(x1)P2|1(x2|x1)P3|1(x3|x1)P4|1(x4|x1).

1 2 3 4

slide-6
SLIDE 6

Structure Learning of Graphical Models

Graphical model on p nodes n i.i.d. samples from multivariate distribution Output estimated structure Gn

B

Structural Consistency: lim

n→∞ P

  • Gn = G
  • = 0.
slide-7
SLIDE 7

Structure Learning of Graphical Models

Graphical model on p nodes n i.i.d. samples from multivariate distribution Output estimated structure Gn

B

Structural Consistency: lim

n→∞ P

  • Gn = G
  • = 0.

Challenge: High Dimensionality (“Data-Poor” Regime)

Large p, small n regime (p ≫ n) Sample Complexity: Required # of samples to achieve consistency

Challenge: Computational Complexity

Goal: Address above challenges and provide provable guarantees

slide-8
SLIDE 8

Tree Graphical Models: Tractable Learning

Maximum likelihood learning of tree structure

Proposed by Chow and Liu (68)

  • Max. weight spanning tree

ˆ TML = arg max

T n

  • k=1

log P(xV ).

B

slide-9
SLIDE 9

Tree Graphical Models: Tractable Learning

Maximum likelihood learning of tree structure

Proposed by Chow and Liu (68)

  • Max. weight spanning tree

ˆ TML = arg max

T n

  • k=1

log P(xV ). ˆ TML = arg max

T

  • (i,j)∈T
  • In(Xi; Xj).

B

slide-10
SLIDE 10

Tree Graphical Models: Tractable Learning

Maximum likelihood learning of tree structure

Proposed by Chow and Liu (68)

  • Max. weight spanning tree

ˆ TML = arg max

T n

  • k=1

log P(xV ). ˆ TML = arg max

T

  • (i,j)∈T
  • In(Xi; Xj).

B

Pairwise statistics suffice for ML

slide-11
SLIDE 11

Tree Graphical Models: Tractable Learning

Maximum likelihood learning of tree structure

Proposed by Chow and Liu (68)

  • Max. weight spanning tree

ˆ TML = arg max

T n

  • k=1

log P(xV ). ˆ TML = arg max

T

  • (i,j)∈T
  • In(Xi; Xj).

B

Pairwise statistics suffice for ML n samples and p nodes: Sample complexity: log p n = O(1).

slide-12
SLIDE 12

Tree Graphical Models: Tractable Learning

Maximum likelihood learning of tree structure

Proposed by Chow and Liu (68)

  • Max. weight spanning tree

ˆ TML = arg max

T n

  • k=1

log P(xV ). ˆ TML = arg max

T

  • (i,j)∈T
  • In(Xi; Xj).

B

Pairwise statistics suffice for ML n samples and p nodes: Sample complexity: log p n = O(1). What other classes of graphical models are tractable for learning?

slide-13
SLIDE 13

Learning Graphical Models Beyond Trees

Challenges

Presence of cycles

◮ Pairwise statistics no longer suffice ◮ Likelihood function not tractable

P(x) = 1 Z exp  

(i,j)∈G

Ψi,j(xi, xj)   .

slide-14
SLIDE 14

Learning Graphical Models Beyond Trees

Challenges

Presence of cycles

◮ Pairwise statistics no longer suffice ◮ Likelihood function not tractable

P(x) = 1 Z exp  

(i,j)∈G

Ψi,j(xi, xj)   .

Presence of high-degree nodes

◮ Brute-force search not tractable

slide-15
SLIDE 15

Learning Graphical Models Beyond Trees

Challenges

Presence of cycles

◮ Pairwise statistics no longer suffice ◮ Likelihood function not tractable

P(x) = 1 Z exp  

(i,j)∈G

Ψi,j(xi, xj)   .

Presence of high-degree nodes

◮ Brute-force search not tractable

Can we provide learning guarantees under above conditions?

slide-16
SLIDE 16

Learning Graphical Models Beyond Trees

Challenges

Presence of cycles

◮ Pairwise statistics no longer suffice ◮ Likelihood function not tractable

P(x) = 1 Z exp  

(i,j)∈G

Ψi,j(xi, xj)   .

Presence of high-degree nodes

◮ Brute-force search not tractable

Can we provide learning guarantees under above conditions?

Our Perspective: Tractable Graph Families

Characterize the class of tractable families Incorporate all the above challenges Relevant for real datasets, e.g., social-network data

slide-17
SLIDE 17

Related Work in Structure Learning

Algorithms for Structure Learning

Chow and Liu (68) Meinshausen and Buehlmann (06) Bresler, Mossel and Sly (09) Ravikumar, Wainwright and Lafferty (10) . . .

Approaches Employed

EM/Search approaches Combinatorial/Greedy approach Convex relaxation, . . .

slide-18
SLIDE 18

Outline

1

Introduction

2

Tractable Graph Families

3

Structure Estimation in Graphical Models

4

Method and Guarantees

5

Conclusion

slide-19
SLIDE 19

Intuitions: Conditional Mutual Information Test

Separators in Graphical Models

S i j

Xi ⊥ ⊥ Xj|XS

?

⇐ ⇒ I(Xi; Xj|XS) = 0

slide-20
SLIDE 20

Intuitions: Conditional Mutual Information Test

Separators in Graphical Models

S i j

Xi ⊥ ⊥ Xj|XS

?

⇐ ⇒ I(Xi; Xj|XS) = 0 Observations ∆-separator for graphs with maximum degree ∆

◮ Brute-force search for the separator: argmin

|S|≤∆

I(Xi; Xj|XS)

◮ Computational complexity scales as O(p∆)

slide-21
SLIDE 21

Intuitions: Conditional Mutual Information Test

Separators in Graphical Models

S i j

Xi ⊥ ⊥ Xj|XS

?

⇐ ⇒ I(Xi; Xj|XS) = 0 Observations ∆-separator for graphs with maximum degree ∆

◮ Brute-force search for the separator: argmin

|S|≤∆

I(Xi; Xj|XS)

◮ Computational complexity scales as O(p∆)

Approximate separators in general graphs?

slide-22
SLIDE 22

Intuitions: Conditional Mutual Information Test

Separators in Graphical Models

S i j

Xi ⊥ ⊥ Xj|XS

?

⇐ ⇒ I(Xi; Xj|XS) = 0 Observations ∆-separator for graphs with maximum degree ∆

◮ Brute-force search for the separator: argmin

|S|≤∆

I(Xi; Xj|XS)

◮ Computational complexity scales as O(p∆)

Approximate separators in general graphs?

slide-23
SLIDE 23

Intuitions: Conditional Mutual Information Test

Separators in Graphical Models

S i j

Xi ⊥ ⊥ Xj|XS

?

= ⇒ I(Xi; Xj|XS) ≈ 0 Observations ∆-separator for graphs with maximum degree ∆

◮ Brute-force search for the separator: argmin

|S|≤∆

I(Xi; Xj|XS)

◮ Computational complexity scales as O(p∆)

Approximate separators in general graphs?

slide-24
SLIDE 24

Tractable Graph Families: Local Separation

γ-Local Separator Sγ(i, j)

Minimal vertex separator with respect to paths of length less than γ

(η, γ)-Local Separation Property for Graph G

|Sγ(i, j)| ≤ η for all (i, j) / ∈ G

S i j

Locally tree-like

Erd˝

  • s-R´

enyi graphs Power-law/scale-free graphs

B

Small-world Graphs

Watts-Strogatz model Hybrid/augmented graphs

slide-25
SLIDE 25

Outline

1

Introduction

2

Tractable Graph Families

3

Structure Estimation in Graphical Models

4

Method and Guarantees

5

Conclusion

slide-26
SLIDE 26

Setup: Ising and Gaussian Graphical Models

n i.i.d. samples available for structure estimation

slide-27
SLIDE 27

Setup: Ising and Gaussian Graphical Models

n i.i.d. samples available for structure estimation Ising and Gaussian Graphical Models P(x) ∝ exp 1 2xT JGx + hT x

  • ,

x ∈ {−1, 1}p. f(x) ∝ exp

  • −1

2xT JGx + hT x

  • ,

x ∈ Rp.

slide-28
SLIDE 28

Setup: Ising and Gaussian Graphical Models

n i.i.d. samples available for structure estimation Ising and Gaussian Graphical Models P(x) ∝ exp 1 2xT JGx + hT x

  • ,

x ∈ {−1, 1}p. f(x) ∝ exp

  • −1

2xT JGx + hT x

  • ,

x ∈ Rp. For (i, j) ∈ G, Jmin ≤ |Ji,j| ≤ Jmax

slide-29
SLIDE 29

Setup: Ising and Gaussian Graphical Models

n i.i.d. samples available for structure estimation Ising and Gaussian Graphical Models P(x) ∝ exp 1 2xT JGx + hT x

  • ,

x ∈ {−1, 1}p. f(x) ∝ exp

  • −1

2xT JGx + hT x

  • ,

x ∈ Rp. For (i, j) ∈ G, Jmin ≤ |Ji,j| ≤ Jmax Graph G satisfies (η, γ) local separation property

slide-30
SLIDE 30

Setup: Ising and Gaussian Graphical Models

n i.i.d. samples available for structure estimation Ising and Gaussian Graphical Models P(x) ∝ exp 1 2xT JGx + hT x

  • ,

x ∈ {−1, 1}p. f(x) ∝ exp

  • −1

2xT JGx + hT x

  • ,

x ∈ Rp. For (i, j) ∈ G, Jmin ≤ |Ji,j| ≤ Jmax Graph G satisfies (η, γ) local separation property Tradeoff between η, γ, Jmin, Jmax for tractable learning

slide-31
SLIDE 31

Regime of Tractable Learning Efficient Learning Under Approximate Separation

Maximum edge potential Jmax of Ising model satisfies Jmax < J∗. J∗ is threshold for phase transition for conditional uniqueness.

slide-32
SLIDE 32

Regime of Tractable Learning Efficient Learning Under Approximate Separation

Maximum edge potential Jmax of Ising model satisfies Jmax < J∗. J∗ is threshold for phase transition for conditional uniqueness. Gaussian model is α-walk summable RG ≤ α < 1. RG is absolute partial correlation matrix. JG = I − RG.

slide-33
SLIDE 33

Regime of Tractable Learning Efficient Learning Under Approximate Separation

Maximum edge potential Jmax of Ising model satisfies Jmax < J∗. J∗ is threshold for phase transition for conditional uniqueness. Gaussian model is α-walk summable RG ≤ α < 1. RG is absolute partial correlation matrix. JG = I − RG.

Tractable Parameter Regime for Structure Learning

slide-34
SLIDE 34

Tractable Graph Families and Regimes

Graph G satisfies (η, γ)-local separation property where η = O(1).

slide-35
SLIDE 35

Tractable Graph Families and Regimes

Graph G satisfies (η, γ)-local separation property where η = O(1). Maximum edge potential Jmax satisfies α := tanh Jmax tanh J∗ < 1 or RG ≤ α < 1.

slide-36
SLIDE 36

Tractable Graph Families and Regimes

Graph G satisfies (η, γ)-local separation property where η = O(1). Maximum edge potential Jmax satisfies α := tanh Jmax tanh J∗ < 1 or RG ≤ α < 1. Minimum edge potential Jmin is sufficiently strong Jmin αγ = ω(1).

slide-37
SLIDE 37

Tractable Graph Families and Regimes

Graph G satisfies (η, γ)-local separation property where η = O(1). Maximum edge potential Jmax satisfies α := tanh Jmax tanh J∗ < 1 or RG ≤ α < 1. Minimum edge potential Jmin is sufficiently strong Jmin αγ = ω(1). Edge potentials are generic.

slide-38
SLIDE 38

Example: girth g, maximum degree ∆

Structural criteria: (η, γ)-local separation property is satisfied η = 1, γ = g. Parameter criteria: The maximum edge potential satisfies Jmax < J∗ = atanh(∆−1), α := tanh Jmax tanh J∗ . Tradeoff: The minimum edge potential satisfies Jminαg = ω(1). For example, when Jmin = Θ(∆−1) ⇒ ∆αg = o(1). Learnability regime involves a tradeoff between degree and girth.

slide-39
SLIDE 39

Outline

1

Introduction

2

Tractable Graph Families

3

Structure Estimation in Graphical Models

4

Method and Guarantees

5

Conclusion

slide-40
SLIDE 40

Algorithm for Structure Learning

Conditional Mutual Information Thresholding (CMIT)

Empirical Conditional Mutual Information from samples Attempt to search for approx. separator of size η (i, j) ∈ G if min

S⊂V \{i,j} |S|≤η

  • I(Xi; Xj|XS) > ξn,p
slide-41
SLIDE 41

Algorithm for Structure Learning

Conditional Mutual Information Thresholding (CMIT)

Empirical Conditional Mutual Information from samples Attempt to search for approx. separator of size η (i, j) ∈ G if min

S⊂V \{i,j} |S|≤η

  • I(Xi; Xj|XS) > ξn,p

Threshold ξn,p Depends only on # of samples n and # of nodes p ξn,p = O(J2

min)∩ω(α2γ)∩Ω

log p n

  • .

Edge Threshold Non-edge

  • No. of nodes

Conditional mutual information

slide-42
SLIDE 42

Algorithm for Structure Learning

Conditional Mutual Information Thresholding (CMIT)

Empirical Conditional Mutual Information from samples Attempt to search for approx. separator of size η (i, j) ∈ G if min

S⊂V \{i,j} |S|≤η

  • I(Xi; Xj|XS) > ξn,p

Threshold ξn,p Depends only on # of samples n and # of nodes p ξn,p = O(J2

min)∩ω(α2γ)∩Ω

log p n

  • .

Edge Threshold Non-edge

  • No. of nodes

Conditional mutual information

Local Test Using Low-order Statistics

slide-43
SLIDE 43

Guarantees on Conditional Mutual Information Test

(i, j) ∈ G if min

S⊂V \{i,j} |S|≤η

  • I(Xi; Xj|XS) > ξn,p

Ising/Gaussian graphical model on p nodes

  • No. of samples n such that

n = Ω(J−4

min log p).

Theorem

CMIT is structurally consistent lim

p,n→∞ n=Ω(J−4

min log p)

P

  • Gn

p = Gp

  • = 0.

Probability measure on both graph and samples

slide-44
SLIDE 44

Lower Bound on Sample Complexity

Erd˝

  • s-R´

enyi random graph G ∼ G(p, c/p)

Theorem

For any estimator Gn

p, it is necessary that

Discrete distribution over X: n ≥ c log2 p 2 log2 |X| Gaussian with α-walk summability: n ≥ c log2 p log2

  • 2πe
  • 1

1−α + 1

  • lim

n→∞ P

  • Gn

p = Gp

  • = 0.

Ω(c log p) samples needed for random graph structure estimation.

slide-45
SLIDE 45

Lower Bound on Sample Complexity

Erd˝

  • s-R´

enyi random graph G ∼ G(p, c/p)

Theorem

For any estimator Gn

p, it is necessary that

Discrete distribution over X: n ≥ c log2 p 2 log2 |X| Gaussian with α-walk summability: n ≥ c log2 p log2

  • 2πe
  • 1

1−α + 1

  • lim

n→∞ P

  • Gn

p = Gp

  • = 0.

Proof Techniques

Fano’s inequality over typical graphs Characterize typical graphs for Erd˝

  • s-R´

enyi ensemble Ω(c log p) samples needed for random graph structure estimation.

slide-46
SLIDE 46

Proof Ideas

(i, j) ∈ G if min

S⊂V \{i,j} |S|≤η

  • I(Xi; Xj|XS) > ξn,p

Correctness of algorithm under exact statistics Consistency under prescribed sample complexity

◮ Concentration bounds for empirical quantities

slide-47
SLIDE 47

Proof Ideas

(i, j) ∈ G if min

S⊂V \{i,j} |S|≤η

  • I(Xi; Xj|XS) > ξn,p

Correctness of algorithm under exact statistics Consistency under prescribed sample complexity

◮ Concentration bounds for empirical quantities

Analysis for non-neighbors

Conditional mutual information upon conditioning by local separator Derive rate of decay for conditional mutual information

◮ Self-avoiding walk tree analysis for Ising models ◮ Walk-sum analysis for Gaussian models

slide-48
SLIDE 48

Proof Ideas

(i, j) ∈ G if min

S⊂V \{i,j} |S|≤η

  • I(Xi; Xj|XS) > ξn,p

Correctness of algorithm under exact statistics Consistency under prescribed sample complexity

◮ Concentration bounds for empirical quantities

Analysis for non-neighbors

Conditional mutual information upon conditioning by local separator Derive rate of decay for conditional mutual information

◮ Self-avoiding walk tree analysis for Ising models ◮ Walk-sum analysis for Gaussian models

Analysis for neighbors

Lower bound under generic edge potentials

slide-49
SLIDE 49

Proof Ideas

(i, j) ∈ G if min

S⊂V \{i,j} |S|≤η

  • I(Xi; Xj|XS) > ξn,p

Correctness of algorithm under exact statistics Consistency under prescribed sample complexity

◮ Concentration bounds for empirical quantities

Analysis for non-neighbors

Conditional mutual information upon conditioning by local separator Derive rate of decay for conditional mutual information

◮ Self-avoiding walk tree analysis for Ising models ◮ Walk-sum analysis for Gaussian models

Analysis for neighbors

Lower bound under generic edge potentials Consistent Graph Estimation Under Local Separation

slide-50
SLIDE 50

Outline

1

Introduction

2

Tractable Graph Families

3

Structure Estimation in Graphical Models

4

Method and Guarantees

5

Conclusion

slide-51
SLIDE 51

Summary and Outlook

Summary

Local algorithm based on low-order statistics Transparent assumptions Logarithmic sample complexity

Outlook

Is structure learning beyond this regime hard? Connections with incoherence conditions Structure learning with latent variables

  • A. Anandkumar, V. Tan and Alan Willsky, “High-Dimensional Structure Learning
  • f Ising Models: Tractable Graph Families” ArXiv 1107.1736.
  • A. Anandkumar, V. Tan and Alan Willsky, “High-Dimensional Gaussian Graphical

Model Selection: Tractable Graph Families” ArXiv 1107.1270.