provable efficient skeleton learning of encodable
play

Provable Efficient Skeleton Learning of Encodable Discrete Bayes - PowerPoint PPT Presentation

Provable Efficient Skeleton Learning of Encodable Discrete Bayes Nets in Poly-Time and Sample Complexity ISIT 2020 Adarsh Barik, Jean Honorio Purdue University 1 What are Bayesian networks? Example: Burglar Alarm [Russel02] Im at


  1. Provable Efficient Skeleton Learning of Encodable Discrete Bayes Nets in Poly-Time and Sample Complexity ISIT 2020 Adarsh Barik, Jean Honorio Purdue University 1

  2. What are Bayesian networks? Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar?” 2

  3. What are Bayesian networks? Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar?” 2

  4. What are Bayesian networks? Example: Burglar Alarm [Russel’02] “I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar?” Systems with multiple variables and their interactions 2

  5. What are Bayesian networks? How to model variables and their interactions? systems this way 3 • Need joint probability distribution table • John calls: J takes two values • 2 n − 1 entries for n variables { Calls , Doesn’t Call } = { 1, 0 } . • Quickly becomes too large: 2 50 ∼ 10 15 • Mary doesn’t call: M ∈ { 1, 0 } • Alarm is ringing: A ∈ { 1, 0 } • Cannot handle even moderately big • Earthquake : E ∈ { 1, 0 } • Burglar : B ∈ { 1, 0 }

  6. What are Bayesian networks? How to model variables and their interactions? systems this way 3 • John calls: J takes two values • Need joint probability distribution table { Calls , Doesn’t Call } = { 1, 0 } . • 2 n − 1 entries for n variables • Mary doesn’t call: M ∈ { 1, 0 } • Quickly becomes too large: 2 50 ∼ 10 15 • Alarm is ringing: A ∈ { 1, 0 } • Cannot handle even moderately big • Earthquake : E ∈ { 1, 0 } • Burglar : B ∈ { 1, 0 }

  7. What are Bayesian networks? Bayesian networks A Directed Acyclic Graph (DAG) that specifies a joint distribution over random variables as a product of conditional probability func- tions, one for each variable given its set of parents 4 B E • P ( B , E , A , J , M ) = P ( B ) P ( E ) P ( A | B , E ) P ( J | A ) P ( M | A ) A • From 2 5 − 1 = 31 entries to 1 + 1 + 2 2 + 2 + 2 = 10 entries J M

  8. What’s the problem? We want to learn the structure of Bayesian network from data. 5 • Realization of each random variable: X 1 X 2 X 1 , X 2 , X 3 , · · · , X i , · · · , X n • Need not be ordered: X 3 X i X 3 , X n , X i , · · · , X 1 , · · · , X 2 • N i.i.d. samples X n

  9. What’s the problem? We want to learn the structure of Bayesian network from data. 5 • Realization of each random variable: X 1 X 2 X 1 , X 2 , X 3 , · · · , X i , · · · , X n • Need not be ordered: X 3 X i X 3 , X n , X i , · · · , X 1 , · · · , X 2 • N i.i.d. samples X n

  10. What’s the problem? We want to learn the structure of Bayesian network from data. 5 • Realization of each random variable: X 1 X 2 X 1 , X 2 , X 3 , · · · , X i , · · · , X n • Need not be ordered: X 3 X i X 3 , X n , X i , · · · , X 1 , · · · , X 2 • N i.i.d. samples X n

  11. What’s the problem? We want to learn the structure of Bayesian network from data. 5 • Realization of each random variable: X 1 X 2 X 1 , X 2 , X 3 , · · · , X i , · · · , X n • Need not be ordered: X 3 X i X 3 , X n , X i , · · · , X 1 , · · · , X 2 • N i.i.d. samples X n

  12. What’s the problem? We want to learn the structure of Bayesian network from data. Data 5 • Realization of each random variable: X 1 , X 2 , X 3 , · · · , X i , · · · , X n • Need not be ordered: X 1 X 2 X 3 , X n , X i , · · · , X 1 , · · · , X 2 • N i.i.d. samples X 3 X i n variables X n N samples

  13. What’s the problem? Data Can we recover Bayesian network structure from data? Yes, but it is hard! 6 X 1 X 2 n variables X 3 X i → X n N samples

  14. What’s the problem? Data Can we recover Bayesian network structure from data? Yes, but it is NP-Hard [Chickering’04]! 6 X 1 X 2 n variables X 3 X i → X n N samples

  15. Recovering structure of Bayesian network from data Related Work Tsamardinos’06], [Koivisto’04, Jakkola’10, Silander’12, Cussens’12] complexity [Ghoshal’17a, 17b] 7 • Score maximization method - [Friedman’99, Margaritis’00, Moore’03, • Independence test based method - [Spirtes’00, Cheng’02, Yehezkel’05, Xie’08] • Special Cases with guarantees • Linear structural equation models with additive noise: poly time and sample • Node ordering for ordinal variables: poly time and sample complexity [Park’15,17] • Binary variables: poly sample complexity [Brenner’13]

  16. Our Problem We want to learn the structure skeleton of Bayesian network from data. Data 8 X 1 X 2 n variables X 3 X i → X n N samples • Correctness • Polynomial time complexity • Polynomial sample complexity

  17. Our Problem We want to learn the structure skeleton of Bayesian network from data. Data 8 X 1 X 2 n variables X 3 X i → X n N samples • Correctness • Polynomial time complexity • Polynomial sample complexity

  18. Our Problem We want to learn the structure skeleton of Bayesian network from data. [Ordyniak’13] Possible to learn DAG from skeleton efficiently under some technical conditions 8 Data X 1 X 2 n variables X 3 X i → X n N samples • Correctness • Polynomial time complexity • Polynomial sample complexity

  19. Encoding Variables Data only input we have for our mathematical model country name: USA, India, China (no natural order) 9 n variables • Numerical variables - marks in exam: 25 < 30 < 45 (natural order) • Categorical (Nominal) variables - N samples • Type of variables matters as it is the

  20. Encoding Variables variables China 10 • Dummy Encoding - USA: ( 1, 0, 0 ) , • We use encoding for categorical India: ( 0, 1, 0 ) , China: ( 0, 0, 1 ) • Effects Encoding - USA: ( 1, 0 ) , India: • Variable - country name: USA, India, ( 0, 1 ) , China: (− 1, − 1 ) • X r is encoded as E ( X r ) ∈ R k

  21. Our Method and then combine 11 • X − r = [ X 1 · · · X r − 1 X r + 1 · · · X n ] ⊺ X 1 X 2 • E ( X − r ) = [ E ( X 1 ) · · · E ( X r − 1 ) E ( X r + 1 ) · · · E ( X n )] ⊺ X 3 X i • π ( r ) is set of parents for variable r and c ( r ) is set of children for variable r X n • Recover π ( r ) and c ( r ) for every node

  22. Our Method and then combine 11 • X − r = [ X 1 · · · X r − 1 X r + 1 · · · X n ] ⊺ X 1 X 2 • E ( X − r ) = [ E ( X 1 ) · · · E ( X r − 1 ) E ( X r + 1 ) · · · E ( X n )] ⊺ X 3 X i • π ( r ) is set of parents for variable r and c ( r ) is set of children for variable r X n • Recover π ( r ) and c ( r ) for every node Idea: Express E ( X r ) as a linear function of E ( X − r )

  23. Our Method Substitute model in population setting . . . W 2 W 1 12 arg min 1 W • Let E ( X r ) = W ∗ ⊺ E ( X − r ) + e ( E ( X − r )) • e ( E ( X − r )) depends on X − r . Let ∥ E ( | e | ) ∥ ∞ ⩽ µ and ∥ e ∥ ∞ = 2 σ • Solve for W ∗ as 2 E ( ∥ E ( X r ) − W ⊺ E ( X − r ) ∥ 2 2 ) such that W i = 0 , ∀ i / ∈ π ( r ) ∪ c ( r )       , each W i ∈ R k × k , W ∈ R ( n − 1 ) k × k • W =     W n

  24. Our Method Substitute model in sample setting 1 13 • Solve for � W = arg min W 2 N ∥ E ( X r ) − E ( X − r ) W ∥ 2 F + λ N ∥ W ∥ B ,1,2 • E ( X r ) ∈ R N × k , E ( X − r ) ∈ R N × ( n − 1 ) k • ∥ W ∥ B ,1,2 = ∑ i ∈ B ∥ W i ∥ F • Idea: Provide condition on N and λ N such that ∥ W i ∥ F = 0, ∀ i / ∈ π ( r ) ∪ c ( r ) and ∥ W i ∥ F ̸ = 0, ∀ i ∈ π ( r ) ∪ c ( r )

  25. Our Method Substitute model in sample setting 1 13 • Solve for � W = arg min W 2 N ∥ E ( X r ) − E ( X − r ) W ∥ 2 F + λ N ∥ W ∥ B ,1,2 • E ( X r ) ∈ R N × k , E ( X − r ) ∈ R N × ( n − 1 ) k • ∥ W ∥ B ,1,2 = ∑ i ∈ B ∥ W i ∥ F • Idea: Provide condition on N and λ N such that ∥ W i ∥ F = 0, ∀ i / ∈ π ( r ) ∪ c ( r ) and ∥ W i ∥ F ̸ = 0, ∀ i ∈ π ( r ) ∪ c ( r )

  26. Our Method Assumptions 1. Need to have a unique solution. Mathematically, 2. Mutual Incoherence: Large number of irrelavant covariates (non parent or children 14 Let H = E ( E ( X − r ) E ( X − r ) ⊺ ) and � H = 1 N E ( X − r ) ⊺ E ( X − r ) . Λ min ( H π ( r ) ∪ c ( r ) , π ( r ) ∪ c ( r ) ) = C > 0 of node r ) should not exert an overly strong effect on the subset of relevant covariates (parent and children of node r ). Mathematically, for some α ∈ ( 0, 1 ] , ∥ H ( π ( r ) ∪ c ( r )) c π ( r ) ∪ c ( r ) H − 1 π ( r ) ∪ c ( r ) π ( r ) ∪ c ( r ) ∥ B , ∞ ,1 ⩽ 1 − α 3. ∥ A ∥ B , ∞ ,1 = max i ∈ B ∥ vec ( A i ) ∥ 1 4. Also holds in sample with high probability if N = O ( k 5 d 3 log ( kn )) where d = | π ( r ) ∪ c ( r ) |

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend