SLIDE 1
Provable Efficient Skeleton Learning of Encodable Discrete Bayes Nets in Poly-Time and Sample Complexity
ISIT 2020 Adarsh Barik, Jean Honorio Purdue University
1
SLIDE 2 What are Bayesian networks?
Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t
- call. Sometimes it’s set off by minor earthquakes. Is there a burglar?”
2
SLIDE 3 What are Bayesian networks?
Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t
- call. Sometimes it’s set off by minor earthquakes. Is there a burglar?”
2
SLIDE 4 What are Bayesian networks?
Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t
- call. Sometimes it’s set off by minor earthquakes. Is there a burglar?”
Systems with multiple variables and their interactions
2
SLIDE 5 What are Bayesian networks?
How to model variables and their interactions?
- John calls: J takes two values
{Calls, Doesn’t Call} = {1, 0}.
- Mary doesn’t call: M ∈ {1, 0}
- Alarm is ringing: A ∈ {1, 0}
- Earthquake : E ∈ {1, 0}
- Burglar : B ∈ {1, 0}
- Need joint probability distribution table
- 2n − 1 entries for n variables
- Quickly becomes too large: 250 ∼ 1015
- Cannot handle even moderately big
systems this way
3
SLIDE 6 What are Bayesian networks?
How to model variables and their interactions?
- John calls: J takes two values
{Calls, Doesn’t Call} = {1, 0}.
- Mary doesn’t call: M ∈ {1, 0}
- Alarm is ringing: A ∈ {1, 0}
- Earthquake : E ∈ {1, 0}
- Burglar : B ∈ {1, 0}
- Need joint probability distribution table
- 2n − 1 entries for n variables
- Quickly becomes too large: 250 ∼ 1015
- Cannot handle even moderately big
systems this way
3
SLIDE 7 What are Bayesian networks?
Bayesian networks A Directed Acyclic Graph (DAG) that specifies a joint distribution
- ver random variables as a product of conditional probability func-
tions, one for each variable given its set of parents
B E A J M
P(B)P(E)P(A|B, E)P(J|A)P(M|A)
- From 25 − 1 = 31 entries to
1 + 1 + 22 + 2 + 2 = 10 entries
4
SLIDE 8 What’s the problem?
We want to learn the structure of Bayesian network from data.
X1 X2 X3 Xi Xn
- Realization of each random variable:
X1, X2, X3, · · · , Xi, · · · , Xn
X3, Xn, Xi, · · · , X1, · · · , X2
5
SLIDE 9 What’s the problem?
We want to learn the structure of Bayesian network from data.
X1 X2 X3 Xi Xn
- Realization of each random variable:
X1, X2, X3, · · · , Xi, · · · , Xn
X3, Xn, Xi, · · · , X1, · · · , X2
5
SLIDE 10 What’s the problem?
We want to learn the structure of Bayesian network from data.
X1 X2 X3 Xi Xn
- Realization of each random variable:
X1, X2, X3, · · · , Xi, · · · , Xn
X3, Xn, Xi, · · · , X1, · · · , X2
5
SLIDE 11 What’s the problem?
We want to learn the structure of Bayesian network from data.
X1 X2 X3 Xi Xn
- Realization of each random variable:
X1, X2, X3, · · · , Xi, · · · , Xn
X3, Xn, Xi, · · · , X1, · · · , X2
5
SLIDE 12 What’s the problem?
We want to learn the structure of Bayesian network from data.
X1 X2 X3 Xi Xn
- Realization of each random variable:
X1, X2, X3, · · · , Xi, · · · , Xn
X3, Xn, Xi, · · · , X1, · · · , X2
n variables N samples
Data
5
SLIDE 13
What’s the problem?
n variables N samples
Data
→ X1 X2 X3 Xi Xn
Can we recover Bayesian network structure from data? Yes, but it is hard!
6
SLIDE 14
What’s the problem?
n variables N samples
Data
→ X1 X2 X3 Xi Xn
Can we recover Bayesian network structure from data? Yes, but it is NP-Hard [Chickering’04]!
6
SLIDE 15 Recovering structure of Bayesian network from data
Related Work
- Score maximization method - [Friedman’99, Margaritis’00, Moore’03,
Tsamardinos’06], [Koivisto’04, Jakkola’10, Silander’12, Cussens’12]
- Independence test based method - [Spirtes’00, Cheng’02, Yehezkel’05, Xie’08]
- Special Cases with guarantees
- Linear structural equation models with additive noise: poly time and sample
complexity [Ghoshal’17a, 17b]
- Node ordering for ordinal variables: poly time and sample complexity [Park’15,17]
- Binary variables: poly sample complexity [Brenner’13]
7
SLIDE 16 Our Problem
We want to learn the structure skeleton of Bayesian network from data.
n variables N samples
Data
→ X1 X2 X3 Xi Xn
- Correctness
- Polynomial time complexity
- Polynomial sample complexity
8
SLIDE 17 Our Problem
We want to learn the structure skeleton of Bayesian network from data.
n variables N samples
Data
→ X1 X2 X3 Xi Xn
- Correctness
- Polynomial time complexity
- Polynomial sample complexity
8
SLIDE 18 Our Problem
We want to learn the structure skeleton of Bayesian network from data.
n variables N samples
Data
→ X1 X2 X3 Xi Xn
- Correctness
- Polynomial time complexity
- Polynomial sample complexity
Possible to learn DAG from skeleton efficiently under some technical conditions [Ordyniak’13]
8
SLIDE 19 Encoding Variables
n variables N samples
Data
- Type of variables matters as it is the
- nly input we have for our
mathematical model
- Numerical variables - marks in exam:
25 < 30 < 45 (natural order)
- Categorical (Nominal) variables -
country name: USA, India, China (no natural order)
9
SLIDE 20 Encoding Variables
- We use encoding for categorical
variables
- Variable - country name: USA, India,
China
- Dummy Encoding - USA: (1, 0, 0),
India: (0, 1, 0), China: (0, 0, 1)
- Effects Encoding - USA: (1, 0), India:
(0, 1), China: (−1, −1)
- Xr is encoded as E(Xr) ∈ Rk
10
SLIDE 21 Our Method
- X−r = [X1 · · · Xr−1 Xr+1 · · · Xn]⊺
- E(X−r) =
[E(X1) · · · E(Xr−1) E(Xr+1) · · · E(Xn)]⊺
- π(r) is set of parents for variable r and
c(r) is set of children for variable r
- Recover π(r) and c(r) for every node
and then combine
X1 X2 X3 Xi Xn
11
SLIDE 22 Our Method
- X−r = [X1 · · · Xr−1 Xr+1 · · · Xn]⊺
- E(X−r) =
[E(X1) · · · E(Xr−1) E(Xr+1) · · · E(Xn)]⊺
- π(r) is set of parents for variable r and
c(r) is set of children for variable r
- Recover π(r) and c(r) for every node
and then combine
X1 X2 X3 Xi Xn
Idea: Express E(Xr) as a linear function of E(X−r)
11
SLIDE 23 Our Method
Substitute model in population setting
- Let E(Xr) = W∗⊺E(X−r) + e(E(X−r))
- e(E(X−r)) depends on X−r. Let ∥E(|e|)∥∞ ⩽ µ and ∥e∥∞ = 2σ
- Solve for W∗ as
arg min
W
1 2E(∥E(Xr) − W⊺E(X−r)∥2
2)
such that Wi = 0, ∀i /
∈ π(r) ∪ c(r)
W1 W2 . . . Wn
, each Wi ∈ Rk×k, W ∈ R(n−1)k×k
12
SLIDE 24 Our Method
Substitute model in sample setting
W = arg minW
1 2N∥E(Xr) − E(X−r)W∥2 F + λN∥W∥B,1,2
- E(Xr) ∈ RN×k, E(X−r) ∈ RN×(n−1)k
- ∥W∥B,1,2 = ∑
i∈B ∥Wi∥F
- Idea: Provide condition on N and λN such that ∥Wi∥F = 0, ∀i /
∈ π(r) ∪ c(r) and ∥Wi∥F ̸= 0, ∀i ∈ π(r) ∪ c(r)
13
SLIDE 25 Our Method
Substitute model in sample setting
W = arg minW
1 2N∥E(Xr) − E(X−r)W∥2 F + λN∥W∥B,1,2
- E(Xr) ∈ RN×k, E(X−r) ∈ RN×(n−1)k
- ∥W∥B,1,2 = ∑
i∈B ∥Wi∥F
- Idea: Provide condition on N and λN such that ∥Wi∥F = 0, ∀i /
∈ π(r) ∪ c(r) and ∥Wi∥F ̸= 0, ∀i ∈ π(r) ∪ c(r)
13
SLIDE 26 Our Method
Let H = E(E(X−r)E(X−r)⊺) and H = 1
NE(X−r)⊺E(X−r).
Assumptions
- 1. Need to have a unique solution. Mathematically,
Λmin(Hπ(r)∪c(r),π(r)∪c(r)) = C > 0
- 2. Mutual Incoherence: Large number of irrelavant covariates (non parent or children
- f node r) should not exert an overly strong effect on the subset of relevant
covariates (parent and children of node r). Mathematically, for some α ∈ (0, 1],
∥H(π(r)∪c(r))cπ(r)∪c(r)H−1
π(r)∪c(r)π(r)∪c(r)∥B,∞,1 ⩽ 1 − α
- 3. ∥A∥B,∞,1 = maxi∈B ∥ vec(Ai)∥1
- 4. Also holds in sample with high probability if N = O(k5d3 log(kn)) where
d = |π(r) ∪ c(r)|
14
SLIDE 27 Our Method
Let H = E(E(X−r)E(X−r)⊺) and H = 1
NE(X−r)⊺E(X−r).
Assumptions
- 1. Need to have a unique solution. Mathematically,
Λmin(Hπ(r)∪c(r),π(r)∪c(r)) = C > 0
- 2. Mutual Incoherence: Large number of irrelavant covariates (non parent or children
- f node r) should not exert an overly strong effect on the subset of relevant
covariates (parent and children of node r). Mathematically, for some α ∈ (0, 1],
∥H(π(r)∪c(r))cπ(r)∪c(r)H−1
π(r)∪c(r)π(r)∪c(r)∥B,∞,1 ⩽ 1 − α
- 3. ∥A∥B,∞,1 = maxi∈B ∥ vec(Ai)∥1
- 4. Also holds in sample with high probability if N = O(k5d3 log(kn)) where
d = |π(r) ∪ c(r)|
14
SLIDE 28 Our Method
Let H = E(E(X−r)E(X−r)⊺) and H = 1
NE(X−r)⊺E(X−r).
Assumptions
- 1. Need to have a unique solution. Mathematically,
Λmin(Hπ(r)∪c(r),π(r)∪c(r)) = C > 0
- 2. Mutual Incoherence: Large number of irrelavant covariates (non parent or children
- f node r) should not exert an overly strong effect on the subset of relevant
covariates (parent and children of node r). Mathematically, for some α ∈ (0, 1],
∥H(π(r)∪c(r))cπ(r)∪c(r)H−1
π(r)∪c(r)π(r)∪c(r)∥B,∞,1 ⩽ 1 − α
- 3. ∥A∥B,∞,1 = maxi∈B ∥ vec(Ai)∥1
- 4. Also holds in sample with high probability if N = O(k5d3 log(kn)) where
d = |π(r) ∪ c(r)|
14
SLIDE 29 Our Method
Example
1 2 3 4
- Let |E[E(Xi)E(Xj)]| < 1, ∀i, j ∈ {1, 2, 3, 4}, i ̸= j.
- Also E[E(X1)E(X4)] = q,
E[E(X1)E(X2)] = E[E(X2)E(X4)] = E[E(X3)E(X4)] = p
and E[E(X1)] = E[E(X3)] = 0.
- Then Mutual Incoherence implies that |p| + |q| < 1
15
SLIDE 30 Main Result
Let
λN > 4 α( k N)
1 2 max
( (1 − α)((2σ2 log(k2d))
1 2 + µ), k 1 2 ((2σ2 log((n − d)k2)) 1 2 + µ)
)
and
N = O(k5d3 log(kn))
then,
- 1. We correctly recover every node which is neither a parent nor a child for all the
nodes.
i∥F > 4k C ( α 4(1−α) + k
1 2 + 1)(kd) 1 2 λN then we correctly recover
every node which is either parent or a child for all the nodes.
- 3. Subsequently, we correctly recover the skeleton.
16
SLIDE 31 Main Result
Let
λN > 4 α( k N)
1 2 max
( (1 − α)((2σ2 log(k2d))
1 2 + µ), k 1 2 ((2σ2 log((n − d)k2)) 1 2 + µ)
)
and
N = O(k5d3 log(kn))
then,
- 1. We correctly recover every node which is neither a parent nor a child for all the
nodes.
i∥F > 4k C ( α 4(1−α) + k
1 2 + 1)(kd) 1 2 λN then we correctly recover
every node which is either parent or a child for all the nodes.
- 3. Subsequently, we correctly recover the skeleton.
Proof is done through primal-dual witness construction [Wainwright’09]
16
SLIDE 32
Main Result
Method Correctness guarantee Time Complexity Sample Complexity MMPC/MMHC
[Tsamardinos’03, 06]
Skeleton Exponential Infinite sample OverDispersion
[Park’15, 17]
DAG, for ordinal data Polynomial Polynomial SparsityBoost
[Brenner’13]
DAG without false in- clusion, for binary data Exponential Polynomial PC Algorithm
[Spirtes’00]
DAG Exponential with degree Infinite sample
[Moore’03, Koivisto’04, Jakkola’10, Cussens’12...]
No guarantees Exponential No guarantees Our Method Skeleton Polynomial Logarithmic Correctness guarantee, and time and sample complexity, with respect to the number of nodes
17
SLIDE 33
Experimental Results
(a) Precision vs. CP (k = 4) (b) Recall vs. CP (k = 4) Figure: Plots of precision and recall versus the control parameter CP for Bayesian networks on
n = 20, 200 and 500 nodes with N = 10CP log((k − 1)n) samples.
18
SLIDE 34
Experimental Results
Network (n) Our Method + Greedy MMHC Greedy Search barley (48)
0.74 ± 0.06 0.73 ± 0.05 0.61 ± 0.05
carpo (60)
0.89 ± 0.01 0.86 ± 0.03 0.78 ± 0.02
mildew (35)
0.83 ± 0.01 0.77 ± 0.00 0.70 ± 0.08
water (32)
0.61 ± 0.02 0.60 ± 0.05 0.51 ± 0.06 F1-Scores and standard errors at 95% confidence level on benchmark Bayesian networks
Network (n) Our Method + Greedy MMHC Greedy Search moviereview (1001)
339.27 ± 8.85 340.60 ± 9.03 343.64 ± 9.28
dna (180)
80.37 ± 0.24 80.57 ± 0.24 80.45 ± 0.25
autos (26)
18.61 ± 2.98 24.45 ± 1.74 18.62 ± 3.27
Negative log-likelihood and standard errors at 95% confidence level on real world datasets
19
SLIDE 35 Conclusion
- We provided a method to correctly recover skeleton of a Bayesian network with
provable theoretical guarantees.
- Our method works with non-ordinal categorical variables.
- Our formulation is also independent of any specific conditional probability
distribution for the nodes.
- We obtain sufficient conditions for skeleton recovery by controlling the interaction
between node pairs for arbitrary conditional probability distributions.
- Sample complexity of our method is logarithmic with respect to the number of
nodes and has a polynomial run time.
20
SLIDE 36
Thank You!
21