Bayesian Networks, Big Data and Greedy Search Efficient - PowerPoint PPT Presentation

Bayesian Networks, Big Data and Greedy Search Efficient Implementation with Classic Statistics Marco Scutari scutari@idsia.ch April 3, 2019

Marco Scutari, IDSIA Overview • Learning the structure of Bayesian networks from data is known to be a computationally challenging, NP-hard problem [2, 4, 6]. • Greedy search is the most common score-based heuristic for structure learning, how challenging is it in terms of computational complexity? • For discrete data; • for continuous data; • for hybrid (discrete + continuous) data; • for big data ( n ≫ N and/or n ≫ | Θ | ). • How are scores computed, and can we do better by revisiting learning • from classic statistics? • from a machine learning perspective?

Bayesian Networks and Structure Learning

Bayesian Networks and Structure Learning Marco Scutari, IDSIA Bayesian Networks: A Graph and a Probability Distribution A Bayesian network [15, BN] is defined by: • a network structure, a directed acyclic graph G in which each node v i ∈ V corresponds to a random variable X i ; • a global probability distribution P( X ) with parameters Θ , which can be factorised into smaller local probability distributions according to the arcs present in the graph. The main role of the network structure is to express the conditional independence relationships among the variables in the model through graphical separation, thus specifying the factorisation of the global distribution: p � P( X ) = P( X i | Π X i ; Θ X i ) where Π X i = { parents of X i } . i =1

Bayesian Networks and Structure Learning Marco Scutari, IDSIA Common Distributional Assumptions The three most common choices for P( X ) in the literature (by far), are: • Discrete BNs [13], in which X and the X i | Π X i are multinomial: X i | Π X i ∼ Mul ( π ik | j ) , π ik | j = P( X i = k | Π X i = j ) . • Gaussian BNs [11, GBNs], in which X is multivariate normal and the X i | Π X i are univariate normals linked by linear dependencies: X i | Π X i ∼ N ( µ X i + Π X i β X i , σ 2 X i ) , which can be equivalently written as a linear regression model ε X i ∼ N (0 , σ 2 X i = µ X i + Π X i β X i + ε X i , X i ) .

Bayesian Networks and Structure Learning Marco Scutari, IDSIA Common Distributional Assumptions • Conditional linear Gaussian BNs [17, CLGBNs], in which X is a mixture of multivariate normals. Discrete X i | Π X i are multinomial and are only allowed to have discrete parents (denoted ∆ X i ). Continuous X i are allowed to have both discrete and continuous parents (denoted Γ X i , ∆ X i ∪ Γ X i = Π X i ). Their local distributions are � � µ X i ,δ Xi + Γ X i β X i ,δ Xi , σ 2 X i | Π X i ∼ N , X i ,δ Xi which can be written as a mixture of linear regressions � � 0 , σ 2 X i = µ X i ,δ Xi + Γ X i β X i ,δ Xi + ε X i ,δ Xi , ε X i ,δ Xi ∼ N , X i ,δ Xi against the continuous parents with one component for each configuration δ X i ∈ Val (∆ X i ) of the discrete parents. Other, less common options: copulas [9], truncated exponentials [18].

Bayesian Networks and Structure Learning Marco Scutari, IDSIA Bayesian Network Structure Learning Learning a BN B = ( G , Θ) from a data set D is performed in two steps: P( B | D ) = P( G , Θ | D ) = P( G | D ) · P(Θ | G , D ) . � �� learning structure learning parameter learning In a Bayesian setting structure learning consists in finding the DAG with the best P( G | D ) (BIC [20] is a common alternative) with some search algorithm. We can decompose P( G | D ) into � P( G | D ) ∝ P( G ) P( D | G ) = P( G ) P( D | G , Θ) P(Θ | G ) d Θ where P( G ) is the prior distribution over the space of the DAGs and P( D | G ) is the marginal likelihood of the data given G averaged over all possible parameter sets Θ ; and then N �� P( D | G ) = P( X i | Π X i , Θ X i ) P(Θ X i | Π X i ) d Θ X i . i =1 where Π X i are the parents of X i in G .

Bayesian Networks and Structure Learning Marco Scutari, IDSIA Structure Learning Algorithms Structure learning algorithms fall into one three classes: • Constraint-based algorithms identify conditional independence constraints with statistical tests, and link nodes that are not found to be independent. PC [7], HITON-PC [1]. • Score-based algorithms are applications of general optimisation techniques; each candidate network is assigned a score to maximise as the objective function. Heuristics [19], MCMC [16], exact [22] • Hybrid algorithms have a restrict phase implementing a constraint-based strategy to reduce the space of candidate networks; and a maximise phase implementing a score-based strategy to find the optimal network in the restricted space. MMHC [23], H 2 PC [10].

Bayesian Networks and Structure Learning Marco Scutari, IDSIA Greedy Search is the Most Common Baseline Here we concentrate on score-based algorithms and in particular greedy search because • it is one of the most common algorithms in practical applications; • when used in combination with BIC, it has the appeal of being simple to reason about; • there is evidence it performs well compared to constraint-based and score-based algorithms [21]. We apply greedy search to modern data which can be • with a large sample size, but not necessarily a large number of variables ( n ≫ N ) or parameters ( n ≫ | Θ | ); and • heterogeneous, with both discrete and continuous variables.

Computational Complexity of Greedy Search

Computational Complexity of Greedy Search Marco Scutari, IDSIA Pseudocode for Greedy Search Input: a data set D , an initial DAG G , a score function Score( G , D ) . Output: the DAG G max that maximises Score( G , D ) . 1. Compute the score of G , S G = Score( G , D ) . 2. Set S max = S G and G max = G . 3. Hill climbing: repeat as long as S max increases: 3.1 for every valid arc addition, deletion or reversal in G max : 3.1.1 compute the score of the modified DAG G ∗ , S G ∗ = Score( G ∗ , D ) : 3.1.2 if S G ∗ > S max and S G ∗ > S G , set G = G ∗ and S G = S G ∗ . 3.2 if S G > S max , set S max = S G and G max = G . 4. Tabu search: for up to t 0 times: 4.1 repeat step 3 but choose the DAG G with the highest S G that has not been visited in the last t 1 steps regardless of S max ; 4.2 if S G > S max , set S 0 = S max = S G and G 0 = G max = G and restart the search from step 3. 5. Random restart: for up to r times, perturb G max with multiple arc additions, deletions and reversals to obtain a new DAG G ′ and: 5.1 set S 0 = S max = S G and G 0 = G max = G and restart the search from step 3; 5.2 if the new G max is the same as the previous G max , stop and return G max .

Computational Complexity of Greedy Search Marco Scutari, IDSIA Computational Complexity The following assumptions are standard in the literature: 1. Estimating each local distribution is O (1) ; that is, the overall computational complexity of an algorithm is measured by the number of estimated local distributions. 2. Model comparisons are assumed to always add, delete and reverse arcs correctly with respect to the underlying true model, since marginal likelihoods and BIC are globally and locally consistent [3]. 3. The true DAG is sparse and contains O ( cN ) , c ∈ [1 , 5] arcs. They resulting expression for the the computational complexity is: � � + r 0 ( r 1 N 2 + t 0 N 2 ) cN 3 + t 0 N 2 O ( g ( N )) = O �� steps 1–3 step 4 step 5 cN 3 + ( t 0 + r 0 ( r 1 + t 0 )) N 2 � � = O .

Computational Complexity of Greedy Search Marco Scutari, IDSIA Caching Local Distributions Caching local distributions reduces the leading term to O ( cN 2 ) because • Adding or removing an arc only alters a single P( X i | Π X i ) . • Reversing an arc X j → X i to X i → X j alters both P( X i | Π X i ) and P( X j | Π X j ) . Hence, we can keep a cache of the score values of the N local distributions for the current G max , and of the N 2 − N differences ∆ ij = Score G max ( X i , Π G max , D ) − Score G ∗ ( X i , Π G ∗ X i , D ) , i � = j ; X i so that we only have to estimate N or 2 N local distributions for the nodes whose parents changed in the previous iteration (instead of N 2 ).

Computational Complexity of Greedy Search Marco Scutari, IDSIA Are They Really All the Same? Estimating a local distribution in a discrete BN requires a single pass over the n samples for X i and the Π X i (taken to have l levels each): � � + l 1+ | Π Xi | O ( f Π Xi ( X i )) = O n (1 + | Π X i | ) . � �� probabilities counts In a GBN, a local distribution is essentially a linear regression model and thus is usually estimated by applying a QR decomposition on [1 Π X i ] : � n (1 + | Π X i | ) 2 � O ( f Π Xi ( X i )) = O + O ( n (1 + | Π X i | )) + � �� computing Q T X i QR decomposition � (1 + | Π X i | ) 2 � + O ( n (1 + | Π X i | )) + O (3 n ) O . � �� computing ˆ x i σ 2 backwards substitution computing ˆ X i

Bayesian Networks, Big Data and Greedy Search Efficient - PowerPoint PPT Presentation

Bayesian Networks, Big Data and Greedy Search Efficient Implementation with Classic Statistics Marco Scutari scutari@idsia.ch April 3, 2019 Marco Scutari, IDSIA Overview Learning the structure of Bayesian networks from data is known to be

Greedy On-Line Planning - abstract overview: what is greedy on-line planning? Part 1: - greedy

Greedy embedding of a graph Greedy embedding of a graph 99 Greedy embedding Greedy embedding

From greedy approximation to greedy optimization Vladimir Temlyakov July, 2014 Vladimir

From greedy approximation to greedy optimization Vladimir Temlyakov December 10, 2013 Vladimir

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

CS 170 Section 4 Greedy Algorithms I Owen Jow | owenjow@berkeley.edu Agenda Greedy

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Greedy algorithms Greedy algorithms Find the best solution to a local problem and (hope) it

Greedy Algorithms 1 The main idea of greedy algorithm is look some optimal solution locally

Greedy Algorithms Pedro Ribeiro DCC/FCUP 2018/2019 Pedro Ribeiro (DCC/FCUP) Greedy Algorithms

Greedy routing Greedy routing Other variations on greedy criterion Introduce

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

Informed search algorithms Outline Best-first search Greedy best-first search A *

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Greedy Algorithm and Matroid Intersections by Yan Alves Radtke July 2020 by Yan Alves Radtke

Maximum Likelihood vs. Least Squares for Estimating Mixtures of Truncated Exponentials Helge

Validation checks for CR track reconstruction in 3x1x1 V. Galymov SB Meeting 06.07.2016

Winsorized Importance Sampling Paulo Orenstein February 8, 2019 Stanford University Paulo

Mixture models of truncated data for estimating the number of species. Li-Thiao-T e S

Non-uniform Thickness in Two-dimensional Micromagnetic Simulation Don Porter Mike Donahue

Outline 1 The General Framework A Central Limit Theorem for Truncating A standard stochastic

Opening Plenary Tuesday September 29, 9:30-11:00am Presenter Slides Moderator: Kate Fish - ANCA

Is Behind the Meter Solar A Good Idea? 07.2020 Webinar will start shortly Thank you www.