 
              Intro Parameters Structures Summary L EARNING FROM D ATA : D ETECTING C ONDITIONAL I NDEPENDENCIES AND S CORE +S EARCH M ETHODS Pedro Larra˜ naga Computational Intelligence Group Artificial Intelligence Department Universidad Polit´ ecnica de Madrid Bayesian Networks: From Theory to Practice International Black Sea University Autumn School on Machine Learning 3-11 October 2019, Tbilisi, Georgia Pedro Larra˜ naga Learning from Data 1 / 69
Intro Parameters Structures Summary Outline 1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary Pedro Larra˜ naga Learning from Data 2 / 69
Intro Parameters Structures Summary Outline 1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary Pedro Larra˜ naga Learning from Data 3 / 69
Intro Parameters Structures Summary From data to Bayesian networks Learning structure and parameters Pedro Larra˜ naga Learning from Data 4 / 69
Intro Parameters Structures Summary From data to Bayesian networks Learning structure and parameters From raw data to other ways of representation (Bayesian networks) of the information more condensed: showing the essential of the data, and reducing the number of parameters necessary to specify the joint probability distribution more abstract: a model describing the joint probability distribution that generates the data more useful: a model able to make different types of reasoning Pedro Larra˜ naga Learning from Data 5 / 69
Intro Parameters Structures Summary Discovering associations The task of learning Bayesian networks from data Given a data set of cases D = { x ( 1 ) , ..., x ( N ) } drawn at random from a joint probability distribution p 0 ( x 1 , ..., x n ) over X 1 , ..., X n , and possibly some domain expert background knowledge The task consists of identifying (learning) a DAG (directed acyclic graph) structure S and a set of corresponding parameters Θ Pedro Larra˜ naga Learning from Data 6 / 69
Intro Parameters Structures Summary Discovering associations The task of learning Bayesian networks from data When discovering associations all the variables have the same treatment There is not a target variable, as in supervised classification There is not a hidden variable, as in clustering Pedro Larra˜ naga Learning from Data 7 / 69
Intro Parameters Structures Summary Outline 1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary Pedro Larra˜ naga Learning from Data 8 / 69
Intro Parameters Structures Summary Expert vs estimating from a data base Assuming we have the structure of the Bayesian network Direct estimation given by an expert Estimation from a data base of cases Example: If each variable X i has r i possible values, and variable Y has r Y values, the number of parameters to be learnt to specify P ( Y = y | X 1 = x 1 , ..., X n = x n ) are ( r Y − 1 ) � n i = 1 r i Pedro Larra˜ naga Learning from Data 9 / 69
Intro Parameters Structures Summary Maximum likelihood estimation Parameter space Let consider a variable X with r possible values: { 1 , 2 , ...., r } We have N observations (cases) of X : D = { x 1 , .., x N } , that is a sample of size N extracted from X Example: X variable measuring the result obtained after rolling a dice five times. D = { 1 , 6 , 4 , 3 , 1 } , r = 6, and N = 5 We are interested in estimating: P ( X = k ) The parametric space is Θ = { θ = ( θ 1 , ..., θ r ) | θ i ∈ [ 0 , 1 ] , θ r = 1 − � r − 1 i = 1 θ i } P ( X = k | θ 1 , ..., θ r ) = θ k Pedro Larra˜ naga Learning from Data 10 / 69
Intro Parameters Structures Summary Maximum likelihood estimation Likelihood function L ( D : θ ) = P ( D | θ ) = P ( X = x 1 , ..., X = x N ) | θ ) The likelihood function measures how probable is to obtain the data base of cases for a concrete value of the parameter θ Assuming that the cases are independent: N r � � θ N k P ( D | θ ) = P ( X = x i | θ ) = k i = 1 k = 1 N k denotes the number of cases in the data base for which X = k Pedro Larra˜ naga Learning from Data 11 / 69
Intro Parameters Structures Summary Likelihood function Example X 1 0 θ = P ( X = 1 ) = 1 2 0 4 L ( D : 1 4 ) = P ( D | 1 4 ) 3 0 5 1 5 = P ( X = 0 , ..., X = 1 ) | 1 4 ) = 3 4 0 4 4 5 0 6 1 θ = P ( X = 1 ) = 1 2 7 1 L ( D : 1 2 ) = P ( D | 1 2 ) 8 1 5 1 5 = P ( X = 0 , ..., X = 1 ) | 1 2 ) = 1 9 1 2 2 10 > 3 5 1 5 = 1 10 1 2 4 4 Pedro Larra˜ naga Learning from Data 12 / 69
Intro Parameters Structures Summary Maximum likelihood estimation Multinomial distribution: relative frequencies θ ∗ = ( θ ∗ 1 , θ ∗ 2 , ..., θ ∗ r − 1 ) = arg max ( θ 1 ,θ 2 ,...,θ r − 1 ) P ( D | θ ) In a multinomial distribution, the maximum likelihood estimator for P ( X = k ) is: k = N k θ ∗ N the relative frequency In the previous example, the maximum likelihood estimator of P ( X = 1 ) is θ ∗ = 5 10 Pedro Larra˜ naga Learning from Data 13 / 69
Intro Parameters Structures Summary Bayesian estimation Prior, posterior and predictive distributions It is assumed a prior knowledge expressed by means of a prior joint distribution over the parameters: ρ ( θ 1 , θ 2 , ..., θ r − 1 ) The posterior distribution of the parameters given D is denoted by ρ ( θ 1 , θ 2 , ..., θ r − 1 | D ) The predictive distribution P ( X = k | D ) is the average of the marginal over θ k of the posterior distribution of the parameters � P ( X = k | D ) = θ k ρ ( θ 1 , θ 2 , ..., θ r − 1 | D ) d θ 1 d θ 2 ... d θ r − 1 Θ Using the Bayes formula r 1 � N j � P ( X = k | D ) = j ρ ( θ 1 , θ 2 , ..., θ r − 1 ) d θ 1 d θ 2 ... d θ r − 1 θ k θ P ( D ) Θ j = 1 Pedro Larra˜ naga Learning from Data 14 / 69
Intro Parameters Structures Summary Bayesian estimation Dirichlet distribution The calculus of the previous integral depends on the prior distribution For a family of prior distributions, called Dirichlet distribution: ρ ( θ 1 , ..., θ r − 1 ) ≡ Dir ( θ 1 , ..., θ r − 1 ; a 1 , ..., a r ) = Γ( � r i = 1 a i ) i = 1 Γ( a i ) θ a 1 − 1 ...θ a r − 1 r � r 1 r � a i > 0 , 0 ≤ θ i ≤ 1 , θ i = 1 � ∞ i = 1 t u − 1 e − t dt Γ( u ) = if u ǫ N ⇒ Γ( u ) = ( u − 1 )! 0 the integral has an analytic solution The solution is obtained using a property of the Dirichlet distribution: If the prior is Dir ( θ 1 , ..., θ r − 1 ; a 1 , ..., a r ) , then the posterior is Dir ( θ 1 , ..., θ r − 1 ; a 1 + N 1 , ..., a r + N r ) Pedro Larra˜ naga Learning from Data 15 / 69
Intro Parameters Structures Summary Bayesian estimation Dirichlet distribution The Bayesian estimation is: N k + a k P ( X = k | D ) = N + � r i = 1 a i The value � r i = 1 a i is called the equivalent sample size Interpretation of Dirichlet as prior distribution: before obtaining the data base D = { x 1 , ..., x N } we had virtually observed a sample of size � r i = 1 a i , where X takes the value k a k times Pedro Larra˜ naga Learning from Data 16 / 69
Intro Parameters Structures Summary Bayesian estimation Lindstone rule for estimation Assuming a specific Dirichlet distribution as prior, where a i = λ for all i = 1 , .., r , that is Dir ( θ 1 , ..., θ r − 1 ; λ, ..., λ ) we obtain the Lindstone rule for estimation: P ( X = k | D ) = N k + λ N + r λ P ( X = k | D ) = N k + 1 Laplace rule ( λ = 1) N + r P ( X = k | D ) = N k + 0 . 5 Jeffreys-Perks rule ( λ = 0 . 5) N + r 2 r ) P ( X = k | D ) = N k + 1 Schurmann-Grassberger rule ( λ = 1 r N + 1 Pedro Larra˜ naga Learning from Data 17 / 69
Intro Parameters Structures Summary Estimation of parameters Parameters θ ijk Bayesian network structure S = ( X , A ) with X = ( X 1 , ..., X n ) and A denoting the set of arcs i , . . . , x r i Variable X i has r i possible values: x 1 i Local probability distribution P ( x i | pa j , S , θ i ) : i k | pa j , S P ( x i , θ i ) = θ x k ≡ θ ijk i | pa j i i The parameter θ ijk represents the conditional probability of variable X i being in its k -th value, knowing that the set of its parent variables is in its j -th value pa 1 , S , . . . , pa q i , S denotes the values of Pa S i , the set of parents of the variable X i i i in the structure S The term q i denotes the number of possible different instances of the parent variables of X i . Thus, q i = � X g ∈ Pa i r g The local parameters for variable X i are given by θ i = (( θ ijk ) r i k = 1 ) q i j = 1 ) Global parameters: θ = ( θ 1 , ..., θ n ) Pedro Larra˜ naga Learning from Data 18 / 69
Recommend
More recommend