Learning Gaussian Tree Models: Analysis of Error Exponents and - - PowerPoint PPT Presentation

learning gaussian tree models analysis of error exponents
SMART_READER_LITE
LIVE PREVIEW

Learning Gaussian Tree Models: Analysis of Error Exponents and - - PowerPoint PPT Presentation

Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures Vincent Tan Animashree Anandkumar, Alan Willsky Stochastic Systems Group, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology


slide-1
SLIDE 1

Learning Gaussian Tree Models: Analysis of Error Exponents and Extremal Structures

Vincent Tan Animashree Anandkumar, Alan Willsky

Stochastic Systems Group, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology

Allerton Conference (Sep 30, 2009)

1/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 1 / 20

slide-2
SLIDE 2

Motivation

Given a set of i.i.d. samples drawn from p, a Gaussian tree model.

2/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 2 / 20

slide-3
SLIDE 3

Motivation

Given a set of i.i.d. samples drawn from p, a Gaussian tree model. Inferring structure of Phylogenetic Trees from observed data. Carlson et al. 2008, PLoS Comp. Bio.

2/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 2 / 20

slide-4
SLIDE 4

More motivation

What is the exact rate of decay of the probability of error?

3/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 3 / 20

slide-5
SLIDE 5

More motivation

What is the exact rate of decay of the probability of error? How do the structure and parameters of the model influence the error exponent (rate of decay)?

3/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 3 / 20

slide-6
SLIDE 6

More motivation

What is the exact rate of decay of the probability of error? How do the structure and parameters of the model influence the error exponent (rate of decay)? What are extremal tree distributions for learning?

3/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 3 / 20

slide-7
SLIDE 7

More motivation

What is the exact rate of decay of the probability of error? How do the structure and parameters of the model influence the error exponent (rate of decay)? What are extremal tree distributions for learning? Consistency is well established (Chow and Wagner 1973). Error Exponent is a quantitative measure of the “goodness” of learning.

3/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 3 / 20

slide-8
SLIDE 8

Main Contributions

1

Provide the exact Rate of Decay for a given p.

4/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 4 / 20

slide-9
SLIDE 9

Main Contributions

1

Provide the exact Rate of Decay for a given p.

2

Rate of decay ≈ SNR for learning.

4/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 4 / 20

slide-10
SLIDE 10

Main Contributions

1

Provide the exact Rate of Decay for a given p.

2

Rate of decay ≈ SNR for learning.

3

Characterized the extremal trees structures for learning, i.e., stars and Markov chains. Stars have the slowest rate. Chains have the fastest rate.

4/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 4 / 20

slide-11
SLIDE 11

Notation and Background

p = N(0, Σ): d-dimensional Gaussian tree model.

5/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 5 / 20

slide-12
SLIDE 12

Notation and Background

p = N(0, Σ): d-dimensional Gaussian tree model. Samples xn = {x1, x2, . . . , xn} drawn i.i.d. from p.

5/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 5 / 20

slide-13
SLIDE 13

Notation and Background

p = N(0, Σ): d-dimensional Gaussian tree model. Samples xn = {x1, x2, . . . , xn} drawn i.i.d. from p. p: Markov on Tp = (V, Ep), a tree. p: Factorizes according to Tp.

5/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 5 / 20

slide-14
SLIDE 14

Notation and Background

p = N(0, Σ): d-dimensional Gaussian tree model. Samples xn = {x1, x2, . . . , xn} drawn i.i.d. from p. p: Markov on Tp = (V, Ep), a tree. p: Factorizes according to Tp. p(x) = p1(x1)p1,2(x1, x2) p1(x1) p1,3(x1, x3) p1(x1) p1,4(x1, x4) p1(x1) ,

5/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 5 / 20

slide-15
SLIDE 15

Notation and Background

p = N(0, Σ): d-dimensional Gaussian tree model. Samples xn = {x1, x2, . . . , xn} drawn i.i.d. from p. p: Markov on Tp = (V, Ep), a tree. p: Factorizes according to Tp. p(x) = p1(x1)p1,2(x1, x2) p1(x1) p1,3(x1, x3) p1(x1) p1,4(x1, x4) p1(x1) , Σ−1 =     ♠ ♣ ♣ ♣ ♣ ♠ ♣ ♠ ♣ ♠    

5/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 5 / 20

slide-16
SLIDE 16

Max-Likelihood Learning of Tree Distributions (Chow-Liu)

Denote p = pxn as the empirical distribution of xn, i.e.,

  • p(x) := N(x; 0,

Σ) where Σ is the empirical covariance matrix of xn.

6/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 6 / 20

slide-17
SLIDE 17

Max-Likelihood Learning of Tree Distributions (Chow-Liu)

Denote p = pxn as the empirical distribution of xn, i.e.,

  • p(x) := N(x; 0,

Σ) where Σ is the empirical covariance matrix of xn.

  • pe: Empirical on edge e.

6/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 6 / 20

slide-18
SLIDE 18

Max-Likelihood Learning of Tree Distributions (Chow-Liu)

Denote p = pxn as the empirical distribution of xn, i.e.,

  • p(x) := N(x; 0,

Σ) where Σ is the empirical covariance matrix of xn.

  • pe: Empirical on edge e.

Reduces to a max-weight spanning tree problem (Chow-Liu 1968)

  • ECL(xn) = argmax

Eq : q∈Trees

  • e∈Eq

I( pe). I( pe) := I(Xi; Xj).

6/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 6 / 20

slide-19
SLIDE 19

Max-Likelihood Learning of Tree Distributions

7/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 7 / 20

slide-20
SLIDE 20

Max-Likelihood Learning of Tree Distributions

True MIs {I(pe)} Max-weight spanning tree Ep

7/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 7 / 20

slide-21
SLIDE 21

Max-Likelihood Learning of Tree Distributions

True MIs {I(pe)} Max-weight spanning tree Ep

7/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 7 / 20

slide-22
SLIDE 22

Max-Likelihood Learning of Tree Distributions

True MIs {I(pe)} Max-weight spanning tree Ep Empirical MIs {I( pe)} from xn Max-weight spanning tree ECL(xn) = Ep

7/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 7 / 20

slide-23
SLIDE 23

Problem Statement

The estimated edge set is ECL(xn)

8/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 8 / 20

slide-24
SLIDE 24

Problem Statement

The estimated edge set is ECL(xn) and the error event is

  • ECL(xn) = Ep
  • .

8/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 8 / 20

slide-25
SLIDE 25

Problem Statement

The estimated edge set is ECL(xn) and the error event is

  • ECL(xn) = Ep
  • .

8/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 8 / 20

slide-26
SLIDE 26

Problem Statement

The estimated edge set is ECL(xn) and the error event is

  • ECL(xn) = Ep
  • .

Find and analyze the error exponent Kp: Kp := lim

n→∞ −1

n log P

  • ECL(xn) = Ep
  • .

8/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 8 / 20

slide-27
SLIDE 27

Problem Statement

The estimated edge set is ECL(xn) and the error event is

  • ECL(xn) = Ep
  • .

Find and analyze the error exponent Kp: Kp := lim

n→∞ −1

n log P

  • ECL(xn) = Ep
  • .

Alternatively, P

  • ECL(xn) = Ep
  • .

= exp(−nKp).

8/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 8 / 20

slide-28
SLIDE 28

The Crossover Rate I

9/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 9 / 20

slide-29
SLIDE 29

The Crossover Rate I

Two pairs of nodes e, e′ ∈ V

2

  • with distribution pe,e′, s.t.

I(pe) > I(pe′).

9/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 9 / 20

slide-30
SLIDE 30

The Crossover Rate I

Two pairs of nodes e, e′ ∈ V

2

  • with distribution pe,e′, s.t.

I(pe) > I(pe′). Consider the crossover event: {I( pe) ≤ I( pe′)}.

9/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 9 / 20

slide-31
SLIDE 31

The Crossover Rate I

Two pairs of nodes e, e′ ∈ V

2

  • with distribution pe,e′, s.t.

I(pe) > I(pe′). Consider the crossover event: {I( pe) ≤ I( pe′)}. Definition: Crossover Rate Je,e′ := lim

n→∞ −1

n log P ({I( pe) ≤ I( pe′)}) .

9/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 9 / 20

slide-32
SLIDE 32

The Crossover Rate I

Two pairs of nodes e, e′ ∈ V

2

  • with distribution pe,e′, s.t.

I(pe) > I(pe′). Consider the crossover event: {I( pe) ≤ I( pe′)}. Definition: Crossover Rate Je,e′ := lim

n→∞ −1

n log P ({I( pe) ≤ I( pe′)}) . This event may potentially lead to an error in structure learning. Why?

9/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 9 / 20

slide-33
SLIDE 33

The Crossover Rate II

Theorem

The crossover rate is Je,e′ = inf

q∈Gaussians

  • D(q || pe,e′) : I(qe′) = I(qe)
  • .

10/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 10 / 20

slide-34
SLIDE 34

The Crossover Rate II

Theorem

The crossover rate is Je,e′ = inf

q∈Gaussians

  • D(q || pe,e′) : I(qe′) = I(qe)
  • .

By assumption I(pe) > I(pe′).

✈ ✈

pe,e′ q∗

e,e′

{I(qe)=I(qe′)}

10/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 10 / 20

slide-35
SLIDE 35

The Crossover Rate II

Theorem

The crossover rate is Je,e′ = inf

q∈Gaussians

  • D(q || pe,e′) : I(qe′) = I(qe)
  • .

By assumption I(pe) > I(pe′).

✈ ✈

pe,e′ q∗

e,e′

{I(qe)=I(qe′)} D(q∗

e,e′||pe,e′)

10/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 10 / 20

slide-36
SLIDE 36

Error Exponent for Structure Learning II

P

  • ECL(xn) = Ep
  • .

= exp(−nKp).

11/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 11 / 20

slide-37
SLIDE 37

Error Exponent for Structure Learning II

P

  • ECL(xn) = Ep
  • .

= exp(−nKp).

Theorem (First Result)

Kp = min

e′ / ∈Ep

  • min

e∈Path(e′;Ep) Je,e′

  • .

11/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 11 / 20

slide-38
SLIDE 38

Approximating the Crossover Rate I

Definition: pe,e′ satisfies the very noisy learning condition if ||ρe| − |ρe′|| < ǫ ⇒ I(pe) ≈ I(pe′). Euclidean Information Theory (Borade, Zheng 2007).

12/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 12 / 20

slide-39
SLIDE 39

Approximating the Crossover Rate II

Theorem (Second Result)

The approximate crossover rate is:

  • Je,e′ = (I(pe′) − I(pe))2

2 Var(se′ − se)

13/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 13 / 20

slide-40
SLIDE 40

Approximating the Crossover Rate II

Theorem (Second Result)

The approximate crossover rate is:

  • Je,e′ = (I(pe′) − I(pe))2

2 Var(se′ − se) where se is the information density: se(xi, xj) = log pi,j(xi, xj) pi(xi)pj(xj)

13/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 13 / 20

slide-41
SLIDE 41

Approximating the Crossover Rate II

Theorem (Second Result)

The approximate crossover rate is:

  • Je,e′ = (I(pe′) − I(pe))2

2 Var(se′ − se) where se is the information density: se(xi, xj) = log pi,j(xi, xj) pi(xi)pj(xj) The approximate error exponent is

  • Kp = min

e′∈Ep

  • min

e∈Path(e′;Ep)

  • Je,e′
  • .

13/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 13 / 20

slide-42
SLIDE 42

Correlation Decay

14/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 14 / 20

slide-43
SLIDE 43

Correlation Decay

✇ ✇ ✇ ✇ ❅ ❅ ❅ ❅ ❅

  • ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣

♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣

x1 x2 x3 x4 ρ1,2 ρ2,3 ρ3,4 ρ1,4 ρ1,3 ρi,j = E[xi xj].

14/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 14 / 20

slide-44
SLIDE 44

Correlation Decay

✇ ✇ ✇ ✇ ❅ ❅ ❅ ❅ ❅

  • ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣

♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣

x1 x2 x3 x4 ρ1,2 ρ2,3 ρ3,4 ρ1,4 ρ1,3 ρi,j = E[xi xj]. Markov property ⇒ ρ1,3 = ρ1,2 × ρ2,3. Correlation decay ⇒ |ρ1,4| ≤ |ρ1,3|.

14/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 14 / 20

slide-45
SLIDE 45

Correlation Decay

✇ ✇ ✇ ✇ ❅ ❅ ❅ ❅ ❅

  • ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣

♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣ ♣

x1 x2 x3 x4 ρ1,2 ρ2,3 ρ3,4 ρ1,4 ρ1,3 ρi,j = E[xi xj]. Markov property ⇒ ρ1,3 = ρ1,2 × ρ2,3. Correlation decay ⇒ |ρ1,4| ≤ |ρ1,3|. (1, 4) is not likely to be mistaken as a true edge. Only need to consider triangles in the true tree.

14/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 14 / 20

slide-46
SLIDE 46

Extremal Structures I

Fix ρ, a vector of correlation coefficients on the tree, e.g.

✇ ✇ ✇ ✇ ✇ ✇ ❅ ❅ ❅ ❅ ❅

  • ρ1

ρ2 ρ3 ρ4 ρ5 ρ := [ρ1, ρ2, ρ3, ρ4, ρ5].

15/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 15 / 20

slide-47
SLIDE 47

Extremal Structures I

Fix ρ, a vector of correlation coefficients on the tree, e.g.

✇ ✇ ✇ ✇ ✇ ✇ ❅ ❅ ❅ ❅ ❅

  • ρ1

ρ2 ρ3 ρ4 ρ5 ρ := [ρ1, ρ2, ρ3, ρ4, ρ5]. Which structures gives the highest and lowest exponents?

15/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 15 / 20

slide-48
SLIDE 48

Extremal Structures II

Theorem (Main Result)

Worst: The star minimizes Kp.

  • Kstar ≤

Kp. Best: The Markov chain maximizes Kp.

  • Kchain ≥

Kp.

✇ ✇ ✇ ✇ ✇

ρ1 ρ2 ρ3 ρ4

✇ ✇ ✇ ✇ ✇

ρπ(1) ρπ(2) ρπ(3) ρπ(4)

16/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 16 / 20

slide-49
SLIDE 49

Extremal Structures III

Chain, Star and Hybrid Graphs for d = 10.

17/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 17 / 20

slide-50
SLIDE 50

Extremal Structures III

Chain, Star and Hybrid Graphs for d = 10.

10

3

10

4

0.2 0.4 0.6 0.8 Number of samples n Simulated Prob of Error Chain Hybrid Star 10

3

10

4

0.5 1 1.5 2 2.5 x 10

−3

Number of samples n Simulated Error Exponent

Plot of the error probability and error exponent for 3 tree graphs.

17/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 17 / 20

slide-51
SLIDE 51

Extremal Structures IV

Remarks: Universal result. Extremal structures wrt diameter are the extremal structures for learning.

18/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 18 / 20

slide-52
SLIDE 52

Extremal Structures IV

Remarks: Universal result. Extremal structures wrt diameter are the extremal structures for learning. Corroborates our intuition about correlation decay.

18/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 18 / 20

slide-53
SLIDE 53

Extensions

Significant reduction of complexity for computation of error exponent.

19/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 19 / 20

slide-54
SLIDE 54

Extensions

Significant reduction of complexity for computation of error exponent. Finding the best distributions for fixed ρ.

19/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 19 / 20

slide-55
SLIDE 55

Extensions

Significant reduction of complexity for computation of error exponent. Finding the best distributions for fixed ρ. Effect of adding and deleting nodes and edges on error exponent.

19/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 19 / 20

slide-56
SLIDE 56

Conclusion

1

Found the rate of decay of the error probability using large deviations.

20/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 20 / 20

slide-57
SLIDE 57

Conclusion

1

Found the rate of decay of the error probability using large deviations.

20/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 20 / 20

slide-58
SLIDE 58

Conclusion

1

Found the rate of decay of the error probability using large deviations.

2

Used Euclidean Information Theory to obtain an SNR-like expression for crossover rate.

20/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 20 / 20

slide-59
SLIDE 59

Conclusion

1

Found the rate of decay of the error probability using large deviations.

2

Used Euclidean Information Theory to obtain an SNR-like expression for crossover rate.

3

We can say which structures are easy and hard based on the error exponent. Extremal structures in terms of the tree diameter.

20/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 20 / 20

slide-60
SLIDE 60

Conclusion

1

Found the rate of decay of the error probability using large deviations.

2

Used Euclidean Information Theory to obtain an SNR-like expression for crossover rate.

3

We can say which structures are easy and hard based on the error exponent. Extremal structures in terms of the tree diameter.

20/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 20 / 20

slide-61
SLIDE 61

Conclusion

1

Found the rate of decay of the error probability using large deviations.

2

Used Euclidean Information Theory to obtain an SNR-like expression for crossover rate.

3

We can say which structures are easy and hard based on the error exponent. Extremal structures in terms of the tree diameter. Full versions can be found at http://arxiv.org/abs/0905.0940. http://arxiv.org/abs/0909.5216.

20/20 Vincent Tan (MIT) Learning Gaussian Tree Models Allerton Conference 20 / 20