Learning Bayesian network : Given structure and completely observed - PowerPoint PPT Presentation

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

Learning problem  Target: true distribution 𝑄 ∗ that maybe correspond to ℳ ∗ = 𝒧 ∗ , 𝜾 ∗  Hypothesis space: specified probabilistic graphical models  Data: set of instances sampled from 𝑄 ∗  Learning goal: selecting a model ℳ to construct the best approximation to 𝑁 ∗ according to a performance metric 2

Learning tasks on graphical models  Parameter learning / structure learning  Completely observable / partially observable data  Directed model / undirected model 3

Parameter learning in directed models Complete data  We assume that the structure of the model is known  consider learning parameters for a BN with a given structure  Goal: estimate CPDs from a dataset 𝒠 = {𝒚 1 , . . . , 𝒚 (𝑂) } of 𝑂 independent, identically distributed (i.i.d.) training samples.  Each training sample 𝒚 𝑜 = 𝑦 1 (𝑜) , … , 𝑦 𝑀 (𝑜) is a vector that every (𝑜) is known (no missing values, no hidden variables) element 𝑦 𝑗 4

Density estimation review  We use density estimation to solve this learning problem  Density estimation: Estimating the probability density function 𝑂 𝑄(𝒚) , given a set of data points 𝒚 𝑗 drawn from it. 𝑗=1  Parametric methods: Assume that 𝑄(𝒚) in terms of a specific functional form which has a number of adjustable parameters  MLE and Bayesian estimate  MLE : Need to determine 𝜾 ∗ given {𝒚 1 , … , 𝒚 (𝑂) }  MLE overfitting problem  Bayesian estimate: Probability distribution 𝑄(𝜾) over spectrum of hypotheses  Needs prior distribution on parameters 5

Density estimation: Graphical model  i.i.d assumption 𝜾 𝜾 𝜾 𝑌 (𝑗) 𝑌 (2) 𝑌 (𝑂) 𝑌 (1) … 𝑗 = 1, . . . , 𝑂 𝜷 𝜷 𝜷 hyperparametrs 𝜾 𝜾 𝜾 𝑌 (𝑜) 𝑌 (2) 𝑌 (𝑂) 𝑌 (1) … 𝑜 = 1, . . . , 𝑂 6

Maximum Likelihood Estimation (MLE)  Likelihood is the conditional probability of observations 𝒠 = 𝒚 (1) , 𝒚 (2) , … , 𝒚 (𝑂) given the value of parameters 𝜾  Assuming i.i.d. ( independent, identically distributed ) samples 𝑂 𝑄 𝒠 𝜾 = 𝑄 𝒚 (1) , … , 𝒚 (𝑂) 𝜾 = 𝑄(𝒚 (𝑜) |𝜾) 𝑜=1 likelihood of 𝜾 w.r.t. the samples  Maximum Likelihood estimation 𝑂 𝑄(𝒚 (𝑜) |𝜾) 𝜾 𝑁𝑀 = argmax 𝑄 𝒠 𝜾 = argmax 𝜾 𝜾 𝑜=1 𝑂 ln 𝑞 𝒚 (𝑗) 𝜾 𝜾 𝑁𝑀 = argmax MLE has closed form solution for 𝜾 𝑗=1 many parametric distributions 7

MLE: Bernoulli distribution  Given: 𝒠 = 𝑦 (1) , 𝑦 (2) , … , 𝑦 (𝑂) , 𝑛 heads (1), 𝑂 − 𝑛 tails (0): 𝑞 𝑦 𝜄 = 𝜄 𝑦 1 − 𝜄 1−𝑦 𝑞 𝑦 = 1 𝜄 = 𝜄 𝑂 𝑂 𝜄 𝑦 𝑜 1 − 𝜄 1−𝑦 𝑜 𝑞(𝑦 𝑜 |𝜄) = 𝑞 𝒠 𝜄 = 𝑜=1 𝑜=1 𝑂 𝑂 ln 𝑞(𝑦 𝑜 |𝜄) = {𝑦 𝑜 ln 𝜄 + (1 − 𝑦 𝑜 ) ln 1 − 𝜄 } ln 𝑞 𝒠 𝜄 = 𝑜=1 𝑜=1 𝑂 𝑦 (𝑜) = 0 ⇒ 𝜄 𝑁𝑀 = 𝑗=1 𝜖 ln 𝑞 𝒠 𝜄 = 𝑛 𝜖𝜄 𝑂 𝑂

MLE: Multinomial distribution  Multinomial distribution (on variable with 𝐿 state): 𝐿 Parameter space: 𝜾 𝑦 𝑙 𝑄 𝒚 𝜾 = 𝜄 𝑙 = 𝜄 1 , … , 𝜄 𝐿 𝑙=1 𝜄 𝑗 ∈ 0,1 𝐿 𝑄 𝑦 𝑙 = 1 = 𝜄 𝑙 𝜄 𝑙 = 1 𝜄 2 𝑙=1 Variable: 1-of-K coding 𝒚 = 𝑦 1 , … , 𝑦 𝐿 𝑦 𝑙 ∈ {0,1} 𝐿 𝜄 1 𝑦 𝑙 = 1 𝑙=1 𝜄 3 𝜄 1 + 𝜄 2 + 𝜄 3 = 1 where 𝜄 𝑗 ∈ 0,1 that is a simplex showing the set of 9 valid parameters

MLE: Multinomial distribution 𝒠 = 𝒚 (1) , 𝒚 (2) , … , 𝒚 (𝑂) 𝑂 𝑂 𝐿 𝐿 (𝑜) (𝑜) 𝑂 𝑦 𝑙 𝑜=1 𝑦 𝑙 𝑄(𝒚 𝑜 |𝜾) = 𝑄 𝒠 𝜾 = 𝜄 𝑙 = 𝜄 𝑙 𝑙=1 𝑙=1 𝑜=1 𝑜=1 𝑂 (𝑜) 𝑛 𝑙 = 𝑦 𝑙 𝑜=1 𝐿 𝐿 𝑛 𝑙 = 𝑂 ℒ 𝜾, 𝜇 = ln 𝑞 𝒠 𝜾 + 𝜇(1 − 𝜄 𝑙 ) 𝑙=1 𝑙=1 (𝑜) 𝑂 𝑜=1 𝑦 𝑙 = 𝑛 𝑙 𝜄 𝑙 = 𝑂 𝑂 10

MLE: Gaussian with unknown 𝜈 ln 𝑄(𝑦 𝑜 |𝜈) = − 1 2𝜌𝜏 − 1 2 2𝜏 2 𝑦 𝑜 − 𝜈 2 ln 𝑂 𝜖 ln 𝑄 𝒠 𝜈 = 0 ⇒ 𝜖 𝑚𝑜 𝑞 𝑦 (𝑜) 𝜈 = 0 𝜖𝜈 𝜖𝜈 𝑜=1 𝑂 𝑂 1 𝜈 𝑁𝑀 = 1 𝜏 2 𝑦 (𝑜) − 𝜈 = 0 ⇒ 𝑦 𝑜 ⇒ 𝑂 𝑜=1 𝑜=1 12

Bayesian approach  Parameters 𝜾 as random variables with a priori distribution  utilizes the available prior information about the unknown parameter  As opposed to ML estimation, it does not seek a specific point estimate of the unknown parameter vector 𝜾  Samples 𝒠 convert the prior densities 𝑄 𝜾 into a posterior density 𝑄 𝜾|𝒠  Keep track of beliefs about 𝜾 ’ s values and uses these beliefs for reaching conclusions 13

Maximum A Posteriori (MAP) estimation  MAP estimation 𝜾 𝑁𝐵𝑄 = argmax 𝑞 𝜾 𝒠 𝜾  Since 𝑞 𝜾|𝒠 ∝ 𝑞 𝒠|𝜾 𝑞(𝜾) 𝜾 𝑁𝐵𝑄 = argmax 𝑞 𝒠 𝜾 𝑞(𝜾) 𝜾  Example of prior distribution: 𝑞 𝜄 = 𝒪(𝜄 0 , 𝜏 2 ) 14

Bayesian approach: Predictive distribution 𝑂 , a prior distribution on  Given a set of samples 𝒠 = 𝒚 𝑗 𝑗=1 the parameters 𝑄(𝜾) , and the form of the distribution 𝑄 𝒚 𝜾  We find 𝑄 𝜾|𝒠 and use it to specify 𝑄 𝒚 = 𝑄(𝒚|𝒠) on new data as an estimate of 𝑄(𝒚) : 𝑄 𝒚 𝒠 = 𝑄 𝒚, 𝜾|𝒠 𝑒𝜾 = 𝑄 𝒚 𝒠, 𝜾 𝑄 𝜾|𝒠 𝑒𝜾 = 𝑄 𝒚 𝜾 𝑄 𝜾|𝒠 𝑒𝜾 Predictive distribution  Analytical solutions exist for very special forms of the involved functions 15

Conjugate Priors  We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties  Choosing a prior such that the posterior distribution that is proportional to 𝑞(𝒠|𝜾)𝑞(𝜾) will have the same functional form as the prior. ∀𝜷, 𝒠 ∃𝜷 ′ 𝑄(𝜾|𝜷 ′ ) ∝ 𝑄 𝒠 𝜾 𝑄(𝜾|𝜷) Having the same functional form 16

Prior for Bernoulli Likelihood 𝛽 1 𝐹 𝜄 = 𝛽 0 + 𝛽 1  Beta distribution over 𝜄 ∈ [0,1] : 𝛽 1 − 1 𝜄 = 𝛽 0 − 1 + 𝛽 1 − 1 Beta 𝜄 𝛽 1 , 𝛽 0 ∝ 𝜄 𝛽 1 −1 1 − 𝜄 𝛽 0 −1 most probable 𝜄 Beta 𝜄 𝛽 1 , 𝛽 0 = Γ(𝛽 0 + 𝛽 1 ) Γ(𝛽 0 )Γ(𝛽 1 ) 𝜄 𝛽 1 −1 1 − 𝜄 𝛽 0 −1  Beta distribution is the conjugate prior of Bernoulli: 𝑄 𝑦 𝜄 = 𝜄 𝑦 1 − 𝜄 1−𝑦 17

Beta distribution 18

Benoulli likelihood: posterior Given: 𝒠 = 𝑦 (1) , 𝑦 (2) , … , 𝑦 (𝑂) , 𝑛 heads (1), 𝑂 − 𝑛 tails (0) 𝑞 𝜄 𝒠 ∝ 𝑞 𝒠 𝜄 𝑞(𝜄) 𝑂 𝜄 𝑦 𝑗 1 − 𝜄 1−𝑦 𝑗 = Beta 𝜄 𝛽 1 , 𝛽 0 𝑗=1 ∝ 𝜄 𝛽 1 −1 1 − 𝜄 𝛽 0 −1 ∝ 𝜄 𝑛+𝛽 1 −1 1 − 𝜄 𝑂−𝑛+𝛽 0 −1 𝑂 𝑦 (𝑗) 𝑛 = ′ , 𝛽 0 ′ ⇒ 𝑞 𝜄 𝒠 ∝ 𝐶𝑓𝑢𝑏 𝜄 𝛽 1 𝑗=1 ′ = 𝛽 1 + 𝑛 𝛽 1 ′ = 𝛽 0 + 𝑂 − 𝑛 𝛽 0 19

Example 𝑞 𝑦 𝜄 = 𝜄 𝑦 1 − 𝜄 1−𝑦 Prior Beta: 𝛽 0 = 𝛽 1 = 2 Bernoulli 𝑞 𝑦 = 1 𝜄 𝜄 𝜄 Given: 𝒠 = 𝑦 (1) , 𝑦 (2) , … , 𝑦 (𝑂) : Posterior 𝑛 heads (1), 𝑂 − 𝑛 tails (0) ′ = 5, 𝛽 0 ′ = 2 Beta: 𝛽 1 𝛽 0 = 𝛽 1 = 2 𝒠 = 1,1,1 ⇒ 𝑂 = 3, 𝑛 = 3 ′ − 1 𝛽 1 ′ − 1 = 4 𝜄 𝑁𝐵𝑄 = argmax 𝑄 𝜄 𝒠 = ′ − 1 + 𝛽 0 𝛽 1 5 𝜄 𝜄 20

Benoulli: Predictive distribution  Training samples: 𝒠 = 𝑦 (1) , … , 𝑦 (𝑂) 𝑄 𝜄 = 𝐶𝑓𝑢𝑏 𝜄 𝛽 1 , 𝛽 0 ∝ 𝜄 𝛽 1 −1 1 − 𝜄 𝛽 0 −1 𝑄 𝜄|𝒠 = 𝐶𝑓𝑢𝑏 𝜄 𝛽 1 + 𝑛, 𝛽 0 + 𝑂 − 𝑛 ∝ 𝜄 𝛽 1 +𝑛−1 1 − 𝜄 𝛽 0 + 𝑂−𝑛 −1 𝑄 𝑦|𝒠 = 𝑄 𝑦|𝜄 𝑄 𝜄|𝒠 𝑒𝜄 = 𝐹 𝑄 𝜄|𝒠 𝑄(𝑦|𝜄) 𝛽 1 + 𝑛 ⇒ 𝑄 𝑦 = 1|𝒠 = 𝐹 𝑄 𝜄|𝒠 𝜄 = 𝛽 0 + 𝛽 1 + 𝑂 21

Dirichlet distribution Input space: 𝜾 = 𝜄 1 , … , 𝜄 𝐿 𝑈 𝜄 𝑙 ∈ 0,1 𝐿 𝜄 𝑙 = 1 𝐿 𝛽 𝑙 −1 𝑄 𝜾 𝜷 ∝ 𝜄 𝑙 𝑙=1 𝑙=1 𝐿 Γ(𝛽) 𝛽 𝑙 −1 = Γ 𝛽 1 … Γ(𝛽 𝐿 ) 𝜄 𝑙 𝑙=1 𝐿 𝐹 𝜄 𝑙 = 𝛽 𝑙 𝛽 𝛽 = 𝛽 𝑙 𝜄 𝑙 = 𝛽 𝑙 − 1 𝛽 − 𝐿 𝑙=1 22

Dirichlet distribution: Examples 𝜷 = [10,10,10] 𝜷 = [0.1,0.1,0.1] 𝜷 = [1,1,1] Dirichlet parameters determine both the prior beliefs and their strength. The larger values of 𝛽 correspond to more confidence on the prior belief (i.e., more imaginary samples) 23

Dirichlet distribution: Example 𝜷 = [2,2,2] 𝜷 = [20,2,2] 24

Multinomial distribution: Prior  Dirichlet distribution is the conjugate prior of Multinomial 𝐿 𝑛 𝑙 +𝛽 𝑙 −1 𝑄 𝜾 𝒠, 𝜷 ∝ 𝑄 𝒠 𝜾 𝑄 𝜾 𝜷 ∝ 𝜄 𝑙 𝑙=1 𝒏 = 𝑛 1 , … , 𝑛 𝐿 𝑈 sufficient statistics of data 𝑄 𝜾 𝒠, 𝜷 = 𝐸𝑗𝑠 𝜾 𝜷 + 𝒏 𝑄 𝜾 𝒠, 𝜷 𝑄 𝜾 𝜷 𝜾~𝐸𝑗𝑠(𝛽 1 + 𝑛 1 , … , 𝛽 𝐿 + 𝑛 𝐿 ) 𝜾~𝐸𝑗𝑠(𝛽 1 , … , 𝛽 𝐿 ) 25

Learning Bayesian network : Given structure and completely observed - PowerPoint PPT Presentation

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution that maybe correspond to

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Building a Bayesian Network 223 / 385 The construction of a Bayesian network Construction of a

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Bayes Nets (Ch. 14) Announcements Homework 1 posted Bayesian Network A Bayesian network (Bayes

Exact inference (Ch. 14) Bayesian Network A Bayesian network (Bayes net) is: (1) a directed

Optimal Algorithms for Learning Bayesian Optimal Algorithms for Learning Bayesian Network

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

The Bayesian Network Framework 89 / 384 The network formalism, informal A Bayesian network

Statistical Learning: The Complex Cases Case 0: Bayesian Network structure known, all

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Selecting priors Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics

Introduction to Bayesian Statistics Lecture 3: Single Parameter (II) Rung-Ching Tsai Department

Automatic code rewriting in probabilistic programming Internship supervised by Hongseok Yang at

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course

Related to Bayesian Statistics by Atsuhide Mori (Osaka Dental University, Japan) Geometric

Tutorial 2 Monday 8 th August, 2016 Problem 1. Case for non-IID dataset: In the class, we

Bayesian Linear Regression Seung-Hoon Na Chonbuk National University Bayesian Linear Regression

Multiple co-clustering and its application Tomoki Tokuda, Okinawa Institute of Science and