Provable Efficient Skeleton Learning of Encodable Discrete Bayes - - PowerPoint PPT Presentation

provable efficient skeleton learning of encodable
SMART_READER_LITE
LIVE PREVIEW

Provable Efficient Skeleton Learning of Encodable Discrete Bayes - - PowerPoint PPT Presentation

Provable Efficient Skeleton Learning of Encodable Discrete Bayes Nets in Poly-Time and Sample Complexity ISIT 2020 Adarsh Barik, Jean Honorio Purdue University 1 What are Bayesian networks? Example: Burglar Alarm [Russel02] Im at


slide-1
SLIDE 1

Provable Efficient Skeleton Learning of Encodable Discrete Bayes Nets in Poly-Time and Sample Complexity

ISIT 2020 Adarsh Barik, Jean Honorio Purdue University

1

slide-2
SLIDE 2

What are Bayesian networks?

Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t

  • call. Sometimes it’s set off by minor earthquakes. Is there a burglar?”

2

slide-3
SLIDE 3

What are Bayesian networks?

Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t

  • call. Sometimes it’s set off by minor earthquakes. Is there a burglar?”

2

slide-4
SLIDE 4

What are Bayesian networks?

Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t

  • call. Sometimes it’s set off by minor earthquakes. Is there a burglar?”

Systems with multiple variables and their interactions

2

slide-5
SLIDE 5

What are Bayesian networks?

How to model variables and their interactions?

  • John calls: J takes two values

{Calls, Doesn’t Call} = {1, 0}.

  • Mary doesn’t call: M ∈ {1, 0}
  • Alarm is ringing: A ∈ {1, 0}
  • Earthquake : E ∈ {1, 0}
  • Burglar : B ∈ {1, 0}
  • Need joint probability distribution table
  • 2n − 1 entries for n variables
  • Quickly becomes too large: 250 ∼ 1015
  • Cannot handle even moderately big

systems this way

3

slide-6
SLIDE 6

What are Bayesian networks?

How to model variables and their interactions?

  • John calls: J takes two values

{Calls, Doesn’t Call} = {1, 0}.

  • Mary doesn’t call: M ∈ {1, 0}
  • Alarm is ringing: A ∈ {1, 0}
  • Earthquake : E ∈ {1, 0}
  • Burglar : B ∈ {1, 0}
  • Need joint probability distribution table
  • 2n − 1 entries for n variables
  • Quickly becomes too large: 250 ∼ 1015
  • Cannot handle even moderately big

systems this way

3

slide-7
SLIDE 7

What are Bayesian networks?

Bayesian networks A Directed Acyclic Graph (DAG) that specifies a joint distribution

  • ver random variables as a product of conditional probability func-

tions, one for each variable given its set of parents

B E A J M

  • P(B, E, A, J, M) =

P(B)P(E)P(A|B, E)P(J|A)P(M|A)

  • From 25 − 1 = 31 entries to

1 + 1 + 22 + 2 + 2 = 10 entries

4

slide-8
SLIDE 8

What’s the problem?

We want to learn the structure of Bayesian network from data.

X1 X2 X3 Xi Xn

  • Realization of each random variable:

X1, X2, X3, · · · , Xi, · · · , Xn

  • Need not be ordered:

X3, Xn, Xi, · · · , X1, · · · , X2

  • N i.i.d. samples

5

slide-9
SLIDE 9

What’s the problem?

We want to learn the structure of Bayesian network from data.

X1 X2 X3 Xi Xn

  • Realization of each random variable:

X1, X2, X3, · · · , Xi, · · · , Xn

  • Need not be ordered:

X3, Xn, Xi, · · · , X1, · · · , X2

  • N i.i.d. samples

5

slide-10
SLIDE 10

What’s the problem?

We want to learn the structure of Bayesian network from data.

X1 X2 X3 Xi Xn

  • Realization of each random variable:

X1, X2, X3, · · · , Xi, · · · , Xn

  • Need not be ordered:

X3, Xn, Xi, · · · , X1, · · · , X2

  • N i.i.d. samples

5

slide-11
SLIDE 11

What’s the problem?

We want to learn the structure of Bayesian network from data.

X1 X2 X3 Xi Xn

  • Realization of each random variable:

X1, X2, X3, · · · , Xi, · · · , Xn

  • Need not be ordered:

X3, Xn, Xi, · · · , X1, · · · , X2

  • N i.i.d. samples

5

slide-12
SLIDE 12

What’s the problem?

We want to learn the structure of Bayesian network from data.

X1 X2 X3 Xi Xn

  • Realization of each random variable:

X1, X2, X3, · · · , Xi, · · · , Xn

  • Need not be ordered:

X3, Xn, Xi, · · · , X1, · · · , X2

  • N i.i.d. samples

n variables N samples

Data

5

slide-13
SLIDE 13

What’s the problem?

n variables N samples

Data

→ X1 X2 X3 Xi Xn

Can we recover Bayesian network structure from data? Yes, but it is hard!

6

slide-14
SLIDE 14

What’s the problem?

n variables N samples

Data

→ X1 X2 X3 Xi Xn

Can we recover Bayesian network structure from data? Yes, but it is NP-Hard [Chickering’04]!

6

slide-15
SLIDE 15

Recovering structure of Bayesian network from data

Related Work

  • Score maximization method - [Friedman’99, Margaritis’00, Moore’03,

Tsamardinos’06], [Koivisto’04, Jakkola’10, Silander’12, Cussens’12]

  • Independence test based method - [Spirtes’00, Cheng’02, Yehezkel’05, Xie’08]
  • Special Cases with guarantees
  • Linear structural equation models with additive noise: poly time and sample

complexity [Ghoshal’17a, 17b]

  • Node ordering for ordinal variables: poly time and sample complexity [Park’15,17]
  • Binary variables: poly sample complexity [Brenner’13]

7

slide-16
SLIDE 16

Our Problem

We want to learn the structure skeleton of Bayesian network from data.

n variables N samples

Data

→ X1 X2 X3 Xi Xn

  • Correctness
  • Polynomial time complexity
  • Polynomial sample complexity

8

slide-17
SLIDE 17

Our Problem

We want to learn the structure skeleton of Bayesian network from data.

n variables N samples

Data

→ X1 X2 X3 Xi Xn

  • Correctness
  • Polynomial time complexity
  • Polynomial sample complexity

8

slide-18
SLIDE 18

Our Problem

We want to learn the structure skeleton of Bayesian network from data.

n variables N samples

Data

→ X1 X2 X3 Xi Xn

  • Correctness
  • Polynomial time complexity
  • Polynomial sample complexity

Possible to learn DAG from skeleton efficiently under some technical conditions [Ordyniak’13]

8

slide-19
SLIDE 19

Encoding Variables

n variables N samples

Data

  • Type of variables matters as it is the
  • nly input we have for our

mathematical model

  • Numerical variables - marks in exam:

25 < 30 < 45 (natural order)

  • Categorical (Nominal) variables -

country name: USA, India, China (no natural order)

9

slide-20
SLIDE 20

Encoding Variables

  • We use encoding for categorical

variables

  • Variable - country name: USA, India,

China

  • Dummy Encoding - USA: (1, 0, 0),

India: (0, 1, 0), China: (0, 0, 1)

  • Effects Encoding - USA: (1, 0), India:

(0, 1), China: (−1, −1)

  • Xr is encoded as E(Xr) ∈ Rk

10

slide-21
SLIDE 21

Our Method

  • X−r = [X1 · · · Xr−1 Xr+1 · · · Xn]⊺
  • E(X−r) =

[E(X1) · · · E(Xr−1) E(Xr+1) · · · E(Xn)]⊺

  • π(r) is set of parents for variable r and

c(r) is set of children for variable r

  • Recover π(r) and c(r) for every node

and then combine

X1 X2 X3 Xi Xn

11

slide-22
SLIDE 22

Our Method

  • X−r = [X1 · · · Xr−1 Xr+1 · · · Xn]⊺
  • E(X−r) =

[E(X1) · · · E(Xr−1) E(Xr+1) · · · E(Xn)]⊺

  • π(r) is set of parents for variable r and

c(r) is set of children for variable r

  • Recover π(r) and c(r) for every node

and then combine

X1 X2 X3 Xi Xn

Idea: Express E(Xr) as a linear function of E(X−r)

11

slide-23
SLIDE 23

Our Method

Substitute model in population setting

  • Let E(Xr) = W∗⊺E(X−r) + e(E(X−r))
  • e(E(X−r)) depends on X−r. Let ∥E(|e|)∥∞ ⩽ µ and ∥e∥∞ = 2σ
  • Solve for W∗ as

arg min

W

1 2E(∥E(Xr) − W⊺E(X−r)∥2

2)

such that Wi = 0, ∀i /

∈ π(r) ∪ c(r)

  • W =

    

W1 W2 . . . Wn

    

, each Wi ∈ Rk×k, W ∈ R(n−1)k×k

12

slide-24
SLIDE 24

Our Method

Substitute model in sample setting

  • Solve for

W = arg minW

1 2N∥E(Xr) − E(X−r)W∥2 F + λN∥W∥B,1,2

  • E(Xr) ∈ RN×k, E(X−r) ∈ RN×(n−1)k
  • ∥W∥B,1,2 = ∑

i∈B ∥Wi∥F

  • Idea: Provide condition on N and λN such that ∥Wi∥F = 0, ∀i /

∈ π(r) ∪ c(r) and ∥Wi∥F ̸= 0, ∀i ∈ π(r) ∪ c(r)

13

slide-25
SLIDE 25

Our Method

Substitute model in sample setting

  • Solve for

W = arg minW

1 2N∥E(Xr) − E(X−r)W∥2 F + λN∥W∥B,1,2

  • E(Xr) ∈ RN×k, E(X−r) ∈ RN×(n−1)k
  • ∥W∥B,1,2 = ∑

i∈B ∥Wi∥F

  • Idea: Provide condition on N and λN such that ∥Wi∥F = 0, ∀i /

∈ π(r) ∪ c(r) and ∥Wi∥F ̸= 0, ∀i ∈ π(r) ∪ c(r)

13

slide-26
SLIDE 26

Our Method

Let H = E(E(X−r)E(X−r)⊺) and H = 1

NE(X−r)⊺E(X−r).

Assumptions

  • 1. Need to have a unique solution. Mathematically,

Λmin(Hπ(r)∪c(r),π(r)∪c(r)) = C > 0

  • 2. Mutual Incoherence: Large number of irrelavant covariates (non parent or children
  • f node r) should not exert an overly strong effect on the subset of relevant

covariates (parent and children of node r). Mathematically, for some α ∈ (0, 1],

∥H(π(r)∪c(r))cπ(r)∪c(r)H−1

π(r)∪c(r)π(r)∪c(r)∥B,∞,1 ⩽ 1 − α

  • 3. ∥A∥B,∞,1 = maxi∈B ∥ vec(Ai)∥1
  • 4. Also holds in sample with high probability if N = O(k5d3 log(kn)) where

d = |π(r) ∪ c(r)|

14

slide-27
SLIDE 27

Our Method

Let H = E(E(X−r)E(X−r)⊺) and H = 1

NE(X−r)⊺E(X−r).

Assumptions

  • 1. Need to have a unique solution. Mathematically,

Λmin(Hπ(r)∪c(r),π(r)∪c(r)) = C > 0

  • 2. Mutual Incoherence: Large number of irrelavant covariates (non parent or children
  • f node r) should not exert an overly strong effect on the subset of relevant

covariates (parent and children of node r). Mathematically, for some α ∈ (0, 1],

∥H(π(r)∪c(r))cπ(r)∪c(r)H−1

π(r)∪c(r)π(r)∪c(r)∥B,∞,1 ⩽ 1 − α

  • 3. ∥A∥B,∞,1 = maxi∈B ∥ vec(Ai)∥1
  • 4. Also holds in sample with high probability if N = O(k5d3 log(kn)) where

d = |π(r) ∪ c(r)|

14

slide-28
SLIDE 28

Our Method

Let H = E(E(X−r)E(X−r)⊺) and H = 1

NE(X−r)⊺E(X−r).

Assumptions

  • 1. Need to have a unique solution. Mathematically,

Λmin(Hπ(r)∪c(r),π(r)∪c(r)) = C > 0

  • 2. Mutual Incoherence: Large number of irrelavant covariates (non parent or children
  • f node r) should not exert an overly strong effect on the subset of relevant

covariates (parent and children of node r). Mathematically, for some α ∈ (0, 1],

∥H(π(r)∪c(r))cπ(r)∪c(r)H−1

π(r)∪c(r)π(r)∪c(r)∥B,∞,1 ⩽ 1 − α

  • 3. ∥A∥B,∞,1 = maxi∈B ∥ vec(Ai)∥1
  • 4. Also holds in sample with high probability if N = O(k5d3 log(kn)) where

d = |π(r) ∪ c(r)|

14

slide-29
SLIDE 29

Our Method

Example

1 2 3 4

  • Let |E[E(Xi)E(Xj)]| < 1, ∀i, j ∈ {1, 2, 3, 4}, i ̸= j.
  • Also E[E(X1)E(X4)] = q,

E[E(X1)E(X2)] = E[E(X2)E(X4)] = E[E(X3)E(X4)] = p

and E[E(X1)] = E[E(X3)] = 0.

  • Then Mutual Incoherence implies that |p| + |q| < 1

15

slide-30
SLIDE 30

Main Result

Let

λN > 4 α( k N)

1 2 max

( (1 − α)((2σ2 log(k2d))

1 2 + µ), k 1 2 ((2σ2 log((n − d)k2)) 1 2 + µ)

)

and

N = O(k5d3 log(kn))

then,

  • 1. We correctly recover every node which is neither a parent nor a child for all the

nodes.

  • 2. If mini∈π(r)∪c(r) ∥W∗

i∥F > 4k C ( α 4(1−α) + k

1 2 + 1)(kd) 1 2 λN then we correctly recover

every node which is either parent or a child for all the nodes.

  • 3. Subsequently, we correctly recover the skeleton.

16

slide-31
SLIDE 31

Main Result

Let

λN > 4 α( k N)

1 2 max

( (1 − α)((2σ2 log(k2d))

1 2 + µ), k 1 2 ((2σ2 log((n − d)k2)) 1 2 + µ)

)

and

N = O(k5d3 log(kn))

then,

  • 1. We correctly recover every node which is neither a parent nor a child for all the

nodes.

  • 2. If mini∈π(r)∪c(r) ∥W∗

i∥F > 4k C ( α 4(1−α) + k

1 2 + 1)(kd) 1 2 λN then we correctly recover

every node which is either parent or a child for all the nodes.

  • 3. Subsequently, we correctly recover the skeleton.

Proof is done through primal-dual witness construction [Wainwright’09]

16

slide-32
SLIDE 32

Main Result

Method Correctness guarantee Time Complexity Sample Complexity MMPC/MMHC

[Tsamardinos’03, 06]

Skeleton Exponential Infinite sample OverDispersion

[Park’15, 17]

DAG, for ordinal data Polynomial Polynomial SparsityBoost

[Brenner’13]

DAG without false in- clusion, for binary data Exponential Polynomial PC Algorithm

[Spirtes’00]

DAG Exponential with degree Infinite sample

[Moore’03, Koivisto’04, Jakkola’10, Cussens’12...]

No guarantees Exponential No guarantees Our Method Skeleton Polynomial Logarithmic Correctness guarantee, and time and sample complexity, with respect to the number of nodes

17

slide-33
SLIDE 33

Experimental Results

(a) Precision vs. CP (k = 4) (b) Recall vs. CP (k = 4) Figure: Plots of precision and recall versus the control parameter CP for Bayesian networks on

n = 20, 200 and 500 nodes with N = 10CP log((k − 1)n) samples.

18

slide-34
SLIDE 34

Experimental Results

Network (n) Our Method + Greedy MMHC Greedy Search barley (48)

0.74 ± 0.06 0.73 ± 0.05 0.61 ± 0.05

carpo (60)

0.89 ± 0.01 0.86 ± 0.03 0.78 ± 0.02

mildew (35)

0.83 ± 0.01 0.77 ± 0.00 0.70 ± 0.08

water (32)

0.61 ± 0.02 0.60 ± 0.05 0.51 ± 0.06 F1-Scores and standard errors at 95% confidence level on benchmark Bayesian networks

Network (n) Our Method + Greedy MMHC Greedy Search moviereview (1001)

339.27 ± 8.85 340.60 ± 9.03 343.64 ± 9.28

dna (180)

80.37 ± 0.24 80.57 ± 0.24 80.45 ± 0.25

autos (26)

18.61 ± 2.98 24.45 ± 1.74 18.62 ± 3.27

Negative log-likelihood and standard errors at 95% confidence level on real world datasets

19

slide-35
SLIDE 35

Conclusion

  • We provided a method to correctly recover skeleton of a Bayesian network with

provable theoretical guarantees.

  • Our method works with non-ordinal categorical variables.
  • Our formulation is also independent of any specific conditional probability

distribution for the nodes.

  • We obtain sufficient conditions for skeleton recovery by controlling the interaction

between node pairs for arbitrary conditional probability distributions.

  • Sample complexity of our method is logarithmic with respect to the number of

nodes and has a polynomial run time.

20

slide-36
SLIDE 36

Thank You!

21