 
              Provable Efficient Skeleton Learning of Encodable Discrete Bayes Nets in Poly-Time and Sample Complexity ISIT 2020 Adarsh Barik, Jean Honorio Purdue University 1
What are Bayesian networks? Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar?” 2
What are Bayesian networks? Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar?” 2
What are Bayesian networks? Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar?” Systems with multiple variables and their interactions 2
What are Bayesian networks? How to model variables and their interactions? systems this way 3 • Need joint probability distribution table • John calls: J takes two values • 2 n − 1 entries for n variables { Calls , Doesn’t Call } = { 1, 0 } . • Quickly becomes too large: 2 50 ∼ 10 15 • Mary doesn’t call: M ∈ { 1, 0 } • Alarm is ringing: A ∈ { 1, 0 } • Cannot handle even moderately big • Earthquake : E ∈ { 1, 0 } • Burglar : B ∈ { 1, 0 }
What are Bayesian networks? How to model variables and their interactions? systems this way 3 • John calls: J takes two values • Need joint probability distribution table { Calls , Doesn’t Call } = { 1, 0 } . • 2 n − 1 entries for n variables • Mary doesn’t call: M ∈ { 1, 0 } • Quickly becomes too large: 2 50 ∼ 10 15 • Alarm is ringing: A ∈ { 1, 0 } • Cannot handle even moderately big • Earthquake : E ∈ { 1, 0 } • Burglar : B ∈ { 1, 0 }
What are Bayesian networks? Bayesian networks A Directed Acyclic Graph (DAG) that specifies a joint distribution over random variables as a product of conditional probability func- tions, one for each variable given its set of parents 4 B E • P ( B , E , A , J , M ) = P ( B ) P ( E ) P ( A | B , E ) P ( J | A ) P ( M | A ) A • From 2 5 − 1 = 31 entries to 1 + 1 + 2 2 + 2 + 2 = 10 entries J M
What’s the problem? We want to learn the structure of Bayesian network from data. 5 • Realization of each random variable: X 1 X 2 X 1 , X 2 , X 3 , · · · , X i , · · · , X n • Need not be ordered: X 3 X i X 3 , X n , X i , · · · , X 1 , · · · , X 2 • N i.i.d. samples X n
What’s the problem? We want to learn the structure of Bayesian network from data. 5 • Realization of each random variable: X 1 X 2 X 1 , X 2 , X 3 , · · · , X i , · · · , X n • Need not be ordered: X 3 X i X 3 , X n , X i , · · · , X 1 , · · · , X 2 • N i.i.d. samples X n
What’s the problem? We want to learn the structure of Bayesian network from data. 5 • Realization of each random variable: X 1 X 2 X 1 , X 2 , X 3 , · · · , X i , · · · , X n • Need not be ordered: X 3 X i X 3 , X n , X i , · · · , X 1 , · · · , X 2 • N i.i.d. samples X n
What’s the problem? We want to learn the structure of Bayesian network from data. 5 • Realization of each random variable: X 1 X 2 X 1 , X 2 , X 3 , · · · , X i , · · · , X n • Need not be ordered: X 3 X i X 3 , X n , X i , · · · , X 1 , · · · , X 2 • N i.i.d. samples X n
What’s the problem? We want to learn the structure of Bayesian network from data. Data 5 • Realization of each random variable: X 1 , X 2 , X 3 , · · · , X i , · · · , X n • Need not be ordered: X 1 X 2 X 3 , X n , X i , · · · , X 1 , · · · , X 2 • N i.i.d. samples X 3 X i n variables X n N samples
What’s the problem? Data Can we recover Bayesian network structure from data? Yes, but it is hard! 6 X 1 X 2 n variables X 3 X i → X n N samples
What’s the problem? Data Can we recover Bayesian network structure from data? Yes, but it is NP-Hard [Chickering’04]! 6 X 1 X 2 n variables X 3 X i → X n N samples
Recovering structure of Bayesian network from data Related Work Tsamardinos’06], [Koivisto’04, Jakkola’10, Silander’12, Cussens’12] complexity [Ghoshal’17a, 17b] 7 • Score maximization method - [Friedman’99, Margaritis’00, Moore’03, • Independence test based method - [Spirtes’00, Cheng’02, Yehezkel’05, Xie’08] • Special Cases with guarantees • Linear structural equation models with additive noise: poly time and sample • Node ordering for ordinal variables: poly time and sample complexity [Park’15,17] • Binary variables: poly sample complexity [Brenner’13]
Our Problem We want to learn the structure skeleton of Bayesian network from data. Data 8 X 1 X 2 n variables X 3 X i → X n N samples • Correctness • Polynomial time complexity • Polynomial sample complexity
Our Problem We want to learn the structure skeleton of Bayesian network from data. Data 8 X 1 X 2 n variables X 3 X i → X n N samples • Correctness • Polynomial time complexity • Polynomial sample complexity
Our Problem We want to learn the structure skeleton of Bayesian network from data. [Ordyniak’13] Possible to learn DAG from skeleton efficiently under some technical conditions 8 Data X 1 X 2 n variables X 3 X i → X n N samples • Correctness • Polynomial time complexity • Polynomial sample complexity
Encoding Variables Data only input we have for our mathematical model country name: USA, India, China (no natural order) 9 n variables • Numerical variables - marks in exam: 25 < 30 < 45 (natural order) • Categorical (Nominal) variables - N samples • Type of variables matters as it is the
Encoding Variables variables China 10 • Dummy Encoding - USA: ( 1, 0, 0 ) , • We use encoding for categorical India: ( 0, 1, 0 ) , China: ( 0, 0, 1 ) • Effects Encoding - USA: ( 1, 0 ) , India: • Variable - country name: USA, India, ( 0, 1 ) , China: (− 1, − 1 ) • X r is encoded as E ( X r ) ∈ R k
Our Method and then combine 11 • X − r = [ X 1 · · · X r − 1 X r + 1 · · · X n ] ⊺ X 1 X 2 • E ( X − r ) = [ E ( X 1 ) · · · E ( X r − 1 ) E ( X r + 1 ) · · · E ( X n )] ⊺ X 3 X i • π ( r ) is set of parents for variable r and c ( r ) is set of children for variable r X n • Recover π ( r ) and c ( r ) for every node
Our Method and then combine 11 • X − r = [ X 1 · · · X r − 1 X r + 1 · · · X n ] ⊺ X 1 X 2 • E ( X − r ) = [ E ( X 1 ) · · · E ( X r − 1 ) E ( X r + 1 ) · · · E ( X n )] ⊺ X 3 X i • π ( r ) is set of parents for variable r and c ( r ) is set of children for variable r X n • Recover π ( r ) and c ( r ) for every node Idea: Express E ( X r ) as a linear function of E ( X − r )
Our Method Substitute model in population setting . . . W 2 W 1 12 arg min 1 W • Let E ( X r ) = W ∗ ⊺ E ( X − r ) + e ( E ( X − r )) • e ( E ( X − r )) depends on X − r . Let ∥ E ( | e | ) ∥ ∞ ⩽ µ and ∥ e ∥ ∞ = 2 σ • Solve for W ∗ as 2 E ( ∥ E ( X r ) − W ⊺ E ( X − r ) ∥ 2 2 ) such that W i = 0 , ∀ i / ∈ π ( r ) ∪ c ( r )       , each W i ∈ R k × k , W ∈ R ( n − 1 ) k × k • W =     W n
Our Method Substitute model in sample setting 1 13 • Solve for � W = arg min W 2 N ∥ E ( X r ) − E ( X − r ) W ∥ 2 F + λ N ∥ W ∥ B ,1,2 • E ( X r ) ∈ R N × k , E ( X − r ) ∈ R N × ( n − 1 ) k • ∥ W ∥ B ,1,2 = ∑ i ∈ B ∥ W i ∥ F • Idea: Provide condition on N and λ N such that ∥ W i ∥ F = 0, ∀ i / ∈ π ( r ) ∪ c ( r ) and ∥ W i ∥ F ̸ = 0, ∀ i ∈ π ( r ) ∪ c ( r )
Our Method Substitute model in sample setting 1 13 • Solve for � W = arg min W 2 N ∥ E ( X r ) − E ( X − r ) W ∥ 2 F + λ N ∥ W ∥ B ,1,2 • E ( X r ) ∈ R N × k , E ( X − r ) ∈ R N × ( n − 1 ) k • ∥ W ∥ B ,1,2 = ∑ i ∈ B ∥ W i ∥ F • Idea: Provide condition on N and λ N such that ∥ W i ∥ F = 0, ∀ i / ∈ π ( r ) ∪ c ( r ) and ∥ W i ∥ F ̸ = 0, ∀ i ∈ π ( r ) ∪ c ( r )
Our Method Assumptions 1. Need to have a unique solution. Mathematically, 2. Mutual Incoherence: Large number of irrelavant covariates (non parent or children 14 Let H = E ( E ( X − r ) E ( X − r ) ⊺ ) and � H = 1 N E ( X − r ) ⊺ E ( X − r ) . Λ min ( H π ( r ) ∪ c ( r ) , π ( r ) ∪ c ( r ) ) = C > 0 of node r ) should not exert an overly strong effect on the subset of relevant covariates (parent and children of node r ). Mathematically, for some α ∈ ( 0, 1 ] , ∥ H ( π ( r ) ∪ c ( r )) c π ( r ) ∪ c ( r ) H − 1 π ( r ) ∪ c ( r ) π ( r ) ∪ c ( r ) ∥ B , ∞ ,1 ⩽ 1 − α 3. ∥ A ∥ B , ∞ ,1 = max i ∈ B ∥ vec ( A i ) ∥ 1 4. Also holds in sample with high probability if N = O ( k 5 d 3 log ( kn )) where d = | π ( r ) ∪ c ( r ) |
Recommend
More recommend