l earning from d ata
play

L EARNING FROM D ATA : D ETECTING C ONDITIONAL I NDEPENDENCIES AND S - PowerPoint PPT Presentation

Intro Parameters Structures Summary L EARNING FROM D ATA : D ETECTING C ONDITIONAL I NDEPENDENCIES AND S CORE +S EARCH M ETHODS Pedro Larra naga Computational Intelligence Group Artificial Intelligence Department Universidad Polit


  1. Intro Parameters Structures Summary L EARNING FROM D ATA : D ETECTING C ONDITIONAL I NDEPENDENCIES AND S CORE +S EARCH M ETHODS Pedro Larra˜ naga Computational Intelligence Group Artificial Intelligence Department Universidad Polit´ ecnica de Madrid Bayesian Networks: From Theory to Practice International Black Sea University Autumn School on Machine Learning 3-11 October 2019, Tbilisi, Georgia Pedro Larra˜ naga Learning from Data 1 / 69

  2. Intro Parameters Structures Summary Outline 1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary Pedro Larra˜ naga Learning from Data 2 / 69

  3. Intro Parameters Structures Summary Outline 1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary Pedro Larra˜ naga Learning from Data 3 / 69

  4. Intro Parameters Structures Summary From data to Bayesian networks Learning structure and parameters Pedro Larra˜ naga Learning from Data 4 / 69

  5. Intro Parameters Structures Summary From data to Bayesian networks Learning structure and parameters From raw data to other ways of representation (Bayesian networks) of the information more condensed: showing the essential of the data, and reducing the number of parameters necessary to specify the joint probability distribution more abstract: a model describing the joint probability distribution that generates the data more useful: a model able to make different types of reasoning Pedro Larra˜ naga Learning from Data 5 / 69

  6. Intro Parameters Structures Summary Discovering associations The task of learning Bayesian networks from data Given a data set of cases D = { x ( 1 ) , ..., x ( N ) } drawn at random from a joint probability distribution p 0 ( x 1 , ..., x n ) over X 1 , ..., X n , and possibly some domain expert background knowledge The task consists of identifying (learning) a DAG (directed acyclic graph) structure S and a set of corresponding parameters Θ Pedro Larra˜ naga Learning from Data 6 / 69

  7. Intro Parameters Structures Summary Discovering associations The task of learning Bayesian networks from data When discovering associations all the variables have the same treatment There is not a target variable, as in supervised classification There is not a hidden variable, as in clustering Pedro Larra˜ naga Learning from Data 7 / 69

  8. Intro Parameters Structures Summary Outline 1 Introduction 2 Learning Parameters 3 Learning Structures 4 Summary Pedro Larra˜ naga Learning from Data 8 / 69

  9. Intro Parameters Structures Summary Expert vs estimating from a data base Assuming we have the structure of the Bayesian network Direct estimation given by an expert Estimation from a data base of cases Example: If each variable X i has r i possible values, and variable Y has r Y values, the number of parameters to be learnt to specify P ( Y = y | X 1 = x 1 , ..., X n = x n ) are ( r Y − 1 ) � n i = 1 r i Pedro Larra˜ naga Learning from Data 9 / 69

  10. Intro Parameters Structures Summary Maximum likelihood estimation Parameter space Let consider a variable X with r possible values: { 1 , 2 , ...., r } We have N observations (cases) of X : D = { x 1 , .., x N } , that is a sample of size N extracted from X Example: X variable measuring the result obtained after rolling a dice five times. D = { 1 , 6 , 4 , 3 , 1 } , r = 6, and N = 5 We are interested in estimating: P ( X = k ) The parametric space is Θ = { θ = ( θ 1 , ..., θ r ) | θ i ∈ [ 0 , 1 ] , θ r = 1 − � r − 1 i = 1 θ i } P ( X = k | θ 1 , ..., θ r ) = θ k Pedro Larra˜ naga Learning from Data 10 / 69

  11. Intro Parameters Structures Summary Maximum likelihood estimation Likelihood function L ( D : θ ) = P ( D | θ ) = P ( X = x 1 , ..., X = x N ) | θ ) The likelihood function measures how probable is to obtain the data base of cases for a concrete value of the parameter θ Assuming that the cases are independent: N r � � θ N k P ( D | θ ) = P ( X = x i | θ ) = k i = 1 k = 1 N k denotes the number of cases in the data base for which X = k Pedro Larra˜ naga Learning from Data 11 / 69

  12. Intro Parameters Structures Summary Likelihood function Example X 1 0 θ = P ( X = 1 ) = 1 2 0 4 L ( D : 1 4 ) = P ( D | 1 4 ) 3 0 5 1 5 = P ( X = 0 , ..., X = 1 ) | 1 4 ) = 3 4 0 4 4 5 0 6 1 θ = P ( X = 1 ) = 1 2 7 1 L ( D : 1 2 ) = P ( D | 1 2 ) 8 1 5 1 5 = P ( X = 0 , ..., X = 1 ) | 1 2 ) = 1 9 1 2 2 10 > 3 5 1 5 = 1 10 1 2 4 4 Pedro Larra˜ naga Learning from Data 12 / 69

  13. Intro Parameters Structures Summary Maximum likelihood estimation Multinomial distribution: relative frequencies θ ∗ = ( θ ∗ 1 , θ ∗ 2 , ..., θ ∗ r − 1 ) = arg max ( θ 1 ,θ 2 ,...,θ r − 1 ) P ( D | θ ) In a multinomial distribution, the maximum likelihood estimator for P ( X = k ) is: k = N k θ ∗ N the relative frequency In the previous example, the maximum likelihood estimator of P ( X = 1 ) is θ ∗ = 5 10 Pedro Larra˜ naga Learning from Data 13 / 69

  14. Intro Parameters Structures Summary Bayesian estimation Prior, posterior and predictive distributions It is assumed a prior knowledge expressed by means of a prior joint distribution over the parameters: ρ ( θ 1 , θ 2 , ..., θ r − 1 ) The posterior distribution of the parameters given D is denoted by ρ ( θ 1 , θ 2 , ..., θ r − 1 | D ) The predictive distribution P ( X = k | D ) is the average of the marginal over θ k of the posterior distribution of the parameters � P ( X = k | D ) = θ k ρ ( θ 1 , θ 2 , ..., θ r − 1 | D ) d θ 1 d θ 2 ... d θ r − 1 Θ Using the Bayes formula r 1 � N j � P ( X = k | D ) = j ρ ( θ 1 , θ 2 , ..., θ r − 1 ) d θ 1 d θ 2 ... d θ r − 1 θ k θ P ( D ) Θ j = 1 Pedro Larra˜ naga Learning from Data 14 / 69

  15. Intro Parameters Structures Summary Bayesian estimation Dirichlet distribution The calculus of the previous integral depends on the prior distribution For a family of prior distributions, called Dirichlet distribution: ρ ( θ 1 , ..., θ r − 1 ) ≡ Dir ( θ 1 , ..., θ r − 1 ; a 1 , ..., a r ) = Γ( � r i = 1 a i ) i = 1 Γ( a i ) θ a 1 − 1 ...θ a r − 1 r � r 1 r � a i > 0 , 0 ≤ θ i ≤ 1 , θ i = 1 � ∞ i = 1 t u − 1 e − t dt Γ( u ) = if u ǫ N ⇒ Γ( u ) = ( u − 1 )! 0 the integral has an analytic solution The solution is obtained using a property of the Dirichlet distribution: If the prior is Dir ( θ 1 , ..., θ r − 1 ; a 1 , ..., a r ) , then the posterior is Dir ( θ 1 , ..., θ r − 1 ; a 1 + N 1 , ..., a r + N r ) Pedro Larra˜ naga Learning from Data 15 / 69

  16. Intro Parameters Structures Summary Bayesian estimation Dirichlet distribution The Bayesian estimation is: N k + a k P ( X = k | D ) = N + � r i = 1 a i The value � r i = 1 a i is called the equivalent sample size Interpretation of Dirichlet as prior distribution: before obtaining the data base D = { x 1 , ..., x N } we had virtually observed a sample of size � r i = 1 a i , where X takes the value k a k times Pedro Larra˜ naga Learning from Data 16 / 69

  17. Intro Parameters Structures Summary Bayesian estimation Lindstone rule for estimation Assuming a specific Dirichlet distribution as prior, where a i = λ for all i = 1 , .., r , that is Dir ( θ 1 , ..., θ r − 1 ; λ, ..., λ ) we obtain the Lindstone rule for estimation: P ( X = k | D ) = N k + λ N + r λ P ( X = k | D ) = N k + 1 Laplace rule ( λ = 1) N + r P ( X = k | D ) = N k + 0 . 5 Jeffreys-Perks rule ( λ = 0 . 5) N + r 2 r ) P ( X = k | D ) = N k + 1 Schurmann-Grassberger rule ( λ = 1 r N + 1 Pedro Larra˜ naga Learning from Data 17 / 69

  18. Intro Parameters Structures Summary Estimation of parameters Parameters θ ijk Bayesian network structure S = ( X , A ) with X = ( X 1 , ..., X n ) and A denoting the set of arcs i , . . . , x r i Variable X i has r i possible values: x 1 i Local probability distribution P ( x i | pa j , S , θ i ) : i k | pa j , S P ( x i , θ i ) = θ x k ≡ θ ijk i | pa j i i The parameter θ ijk represents the conditional probability of variable X i being in its k -th value, knowing that the set of its parent variables is in its j -th value pa 1 , S , . . . , pa q i , S denotes the values of Pa S i , the set of parents of the variable X i i i in the structure S The term q i denotes the number of possible different instances of the parent variables of X i . Thus, q i = � X g ∈ Pa i r g The local parameters for variable X i are given by θ i = (( θ ijk ) r i k = 1 ) q i j = 1 ) Global parameters: θ = ( θ 1 , ..., θ n ) Pedro Larra˜ naga Learning from Data 18 / 69

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend