learning bayesian network
play

Learning Bayesian network : Given structure and completely observed - PowerPoint PPT Presentation

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution that maybe correspond to


  1. Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani

  2. Learning problem  Target: true distribution 𝑄 βˆ— that maybe correspond to β„³ βˆ— = 𝒧 βˆ— , 𝜾 βˆ—  Hypothesis space: specified probabilistic graphical models  Data: set of instances sampled from 𝑄 βˆ—  Learning goal: selecting a model β„³ to construct the best approximation to 𝑁 βˆ— according to a performance metric 2

  3. Learning tasks on graphical models  Parameter learning / structure learning  Completely observable / partially observable data  Directed model / undirected model 3

  4. Parameter learning in directed models Complete data  We assume that the structure of the model is known  consider learning parameters for a BN with a given structure  Goal: estimate CPDs from a dataset 𝒠 = {π’š 1 , . . . , π’š (𝑂) } of 𝑂 independent, identically distributed (i.i.d.) training samples.  Each training sample π’š π‘œ = 𝑦 1 (π‘œ) , … , 𝑦 𝑀 (π‘œ) is a vector that every (π‘œ) is known (no missing values, no hidden variables) element 𝑦 𝑗 4

  5. Density estimation review  We use density estimation to solve this learning problem  Density estimation: Estimating the probability density function 𝑂 𝑄(π’š) , given a set of data points π’š 𝑗 drawn from it. 𝑗=1  Parametric methods: Assume that 𝑄(π’š) in terms of a specific functional form which has a number of adjustable parameters  MLE and Bayesian estimate  MLE : Need to determine 𝜾 βˆ— given {π’š 1 , … , π’š (𝑂) }  MLE overfitting problem  Bayesian estimate: Probability distribution 𝑄(𝜾) over spectrum of hypotheses  Needs prior distribution on parameters 5

  6. Density estimation: Graphical model  i.i.d assumption 𝜾 𝜾 𝜾 π‘Œ (𝑗) π‘Œ (2) π‘Œ (𝑂) π‘Œ (1) … 𝑗 = 1, . . . , 𝑂 𝜷 𝜷 𝜷 hyperparametrs 𝜾 𝜾 𝜾 π‘Œ (π‘œ) π‘Œ (2) π‘Œ (𝑂) π‘Œ (1) … π‘œ = 1, . . . , 𝑂 6

  7. Maximum Likelihood Estimation (MLE)  Likelihood is the conditional probability of observations 𝒠 = π’š (1) , π’š (2) , … , π’š (𝑂) given the value of parameters 𝜾  Assuming i.i.d. ( independent, identically distributed ) samples 𝑂 𝑄 𝒠 𝜾 = 𝑄 π’š (1) , … , π’š (𝑂) 𝜾 = 𝑄(π’š (π‘œ) |𝜾) π‘œ=1 likelihood of 𝜾 w.r.t. the samples  Maximum Likelihood estimation 𝑂 𝑄(π’š (π‘œ) |𝜾) 𝜾 𝑁𝑀 = argmax 𝑄 𝒠 𝜾 = argmax 𝜾 𝜾 π‘œ=1 𝑂 ln π‘ž π’š (𝑗) 𝜾 𝜾 𝑁𝑀 = argmax MLE has closed form solution for 𝜾 𝑗=1 many parametric distributions 7

  8. MLE: Bernoulli distribution  Given: 𝒠 = 𝑦 (1) , 𝑦 (2) , … , 𝑦 (𝑂) , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0): π‘ž 𝑦 πœ„ = πœ„ 𝑦 1 βˆ’ πœ„ 1βˆ’π‘¦ π‘ž 𝑦 = 1 πœ„ = πœ„ 𝑂 𝑂 πœ„ 𝑦 π‘œ 1 βˆ’ πœ„ 1βˆ’π‘¦ π‘œ π‘ž(𝑦 π‘œ |πœ„) = π‘ž 𝒠 πœ„ = π‘œ=1 π‘œ=1 𝑂 𝑂 ln π‘ž(𝑦 π‘œ |πœ„) = {𝑦 π‘œ ln πœ„ + (1 βˆ’ 𝑦 π‘œ ) ln 1 βˆ’ πœ„ } ln π‘ž 𝒠 πœ„ = π‘œ=1 π‘œ=1 𝑂 𝑦 (π‘œ) = 0 β‡’ πœ„ 𝑁𝑀 = 𝑗=1 πœ– ln π‘ž 𝒠 πœ„ = 𝑛 πœ–πœ„ 𝑂 𝑂

  9. MLE: Multinomial distribution  Multinomial distribution (on variable with 𝐿 state): 𝐿 Parameter space: 𝜾 𝑦 𝑙 𝑄 π’š 𝜾 = πœ„ 𝑙 = πœ„ 1 , … , πœ„ 𝐿 𝑙=1 πœ„ 𝑗 ∈ 0,1 𝐿 𝑄 𝑦 𝑙 = 1 = πœ„ 𝑙 πœ„ 𝑙 = 1 πœ„ 2 𝑙=1 Variable: 1-of-K coding π’š = 𝑦 1 , … , 𝑦 𝐿 𝑦 𝑙 ∈ {0,1} 𝐿 πœ„ 1 𝑦 𝑙 = 1 𝑙=1 πœ„ 3 πœ„ 1 + πœ„ 2 + πœ„ 3 = 1 where πœ„ 𝑗 ∈ 0,1 that is a simplex showing the set of 9 valid parameters

  10. MLE: Multinomial distribution 𝒠 = π’š (1) , π’š (2) , … , π’š (𝑂) 𝑂 𝑂 𝐿 𝐿 (π‘œ) (π‘œ) 𝑂 𝑦 𝑙 π‘œ=1 𝑦 𝑙 𝑄(π’š π‘œ |𝜾) = 𝑄 𝒠 𝜾 = πœ„ 𝑙 = πœ„ 𝑙 𝑙=1 𝑙=1 π‘œ=1 π‘œ=1 𝑂 (π‘œ) 𝑛 𝑙 = 𝑦 𝑙 π‘œ=1 𝐿 𝐿 𝑛 𝑙 = 𝑂 β„’ 𝜾, πœ‡ = ln π‘ž 𝒠 𝜾 + πœ‡(1 βˆ’ πœ„ 𝑙 ) 𝑙=1 𝑙=1 (π‘œ) 𝑂 π‘œ=1 𝑦 𝑙 = 𝑛 𝑙 πœ„ 𝑙 = 𝑂 𝑂 10

  11. MLE: Gaussian with unknown 𝜈 ln 𝑄(𝑦 π‘œ |𝜈) = βˆ’ 1 2𝜌𝜏 βˆ’ 1 2 2𝜏 2 𝑦 π‘œ βˆ’ 𝜈 2 ln 𝑂 πœ– ln 𝑄 𝒠 𝜈 = 0 β‡’ πœ– π‘šπ‘œ π‘ž 𝑦 (π‘œ) 𝜈 = 0 πœ–πœˆ πœ–πœˆ π‘œ=1 𝑂 𝑂 1 𝜈 𝑁𝑀 = 1 𝜏 2 𝑦 (π‘œ) βˆ’ 𝜈 = 0 β‡’ 𝑦 π‘œ β‡’ 𝑂 π‘œ=1 π‘œ=1 12

  12. Bayesian approach  Parameters 𝜾 as random variables with a priori distribution  utilizes the available prior information about the unknown parameter  As opposed to ML estimation, it does not seek a specific point estimate of the unknown parameter vector 𝜾  Samples 𝒠 convert the prior densities 𝑄 𝜾 into a posterior density 𝑄 𝜾|𝒠  Keep track of beliefs about 𝜾 ’ s values and uses these beliefs for reaching conclusions 13

  13. Maximum A Posteriori (MAP) estimation  MAP estimation 𝜾 𝑁𝐡𝑄 = argmax π‘ž 𝜾 𝒠 𝜾  Since π‘ž 𝜾|𝒠 ∝ π‘ž 𝒠|𝜾 π‘ž(𝜾) 𝜾 𝑁𝐡𝑄 = argmax π‘ž 𝒠 𝜾 π‘ž(𝜾) 𝜾  Example of prior distribution: π‘ž πœ„ = π’ͺ(πœ„ 0 , 𝜏 2 ) 14

  14. Bayesian approach: Predictive distribution 𝑂 , a prior distribution on  Given a set of samples 𝒠 = π’š 𝑗 𝑗=1 the parameters 𝑄(𝜾) , and the form of the distribution 𝑄 π’š 𝜾  We find 𝑄 𝜾|𝒠 and use it to specify 𝑄 π’š = 𝑄(π’š|𝒠) on new data as an estimate of 𝑄(π’š) : 𝑄 π’š 𝒠 = 𝑄 π’š, 𝜾|𝒠 π‘’πœΎ = 𝑄 π’š 𝒠, 𝜾 𝑄 𝜾|𝒠 π‘’πœΎ = 𝑄 π’š 𝜾 𝑄 𝜾|𝒠 π‘’πœΎ Predictive distribution  Analytical solutions exist for very special forms of the involved functions 15

  15. Conjugate Priors  We consider a form of prior distribution that has a simple interpretation as well as some useful analytical properties  Choosing a prior such that the posterior distribution that is proportional to π‘ž(𝒠|𝜾)π‘ž(𝜾) will have the same functional form as the prior. βˆ€πœ·, 𝒠 βˆƒπœ· β€² 𝑄(𝜾|𝜷 β€² ) ∝ 𝑄 𝒠 𝜾 𝑄(𝜾|𝜷) Having the same functional form 16

  16. Prior for Bernoulli Likelihood 𝛽 1 𝐹 πœ„ = 𝛽 0 + 𝛽 1  Beta distribution over πœ„ ∈ [0,1] : 𝛽 1 βˆ’ 1 πœ„ = 𝛽 0 βˆ’ 1 + 𝛽 1 βˆ’ 1 Beta πœ„ 𝛽 1 , 𝛽 0 ∝ πœ„ 𝛽 1 βˆ’1 1 βˆ’ πœ„ 𝛽 0 βˆ’1 most probable πœ„ Beta πœ„ 𝛽 1 , 𝛽 0 = Ξ“(𝛽 0 + 𝛽 1 ) Ξ“(𝛽 0 )Ξ“(𝛽 1 ) πœ„ 𝛽 1 βˆ’1 1 βˆ’ πœ„ 𝛽 0 βˆ’1  Beta distribution is the conjugate prior of Bernoulli: 𝑄 𝑦 πœ„ = πœ„ 𝑦 1 βˆ’ πœ„ 1βˆ’π‘¦ 17

  17. Beta distribution 18

  18. Benoulli likelihood: posterior Given: 𝒠 = 𝑦 (1) , 𝑦 (2) , … , 𝑦 (𝑂) , 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) π‘ž πœ„ 𝒠 ∝ π‘ž 𝒠 πœ„ π‘ž(πœ„) 𝑂 πœ„ 𝑦 𝑗 1 βˆ’ πœ„ 1βˆ’π‘¦ 𝑗 = Beta πœ„ 𝛽 1 , 𝛽 0 𝑗=1 ∝ πœ„ 𝛽 1 βˆ’1 1 βˆ’ πœ„ 𝛽 0 βˆ’1 ∝ πœ„ 𝑛+𝛽 1 βˆ’1 1 βˆ’ πœ„ π‘‚βˆ’π‘›+𝛽 0 βˆ’1 𝑂 𝑦 (𝑗) 𝑛 = β€² , 𝛽 0 β€² β‡’ π‘ž πœ„ 𝒠 ∝ 𝐢𝑓𝑒𝑏 πœ„ 𝛽 1 𝑗=1 β€² = 𝛽 1 + 𝑛 𝛽 1 β€² = 𝛽 0 + 𝑂 βˆ’ 𝑛 𝛽 0 19

  19. Example π‘ž 𝑦 πœ„ = πœ„ 𝑦 1 βˆ’ πœ„ 1βˆ’π‘¦ Prior Beta: 𝛽 0 = 𝛽 1 = 2 Bernoulli π‘ž 𝑦 = 1 πœ„ πœ„ πœ„ Given: 𝒠 = 𝑦 (1) , 𝑦 (2) , … , 𝑦 (𝑂) : Posterior 𝑛 heads (1), 𝑂 βˆ’ 𝑛 tails (0) β€² = 5, 𝛽 0 β€² = 2 Beta: 𝛽 1 𝛽 0 = 𝛽 1 = 2 𝒠 = 1,1,1 β‡’ 𝑂 = 3, 𝑛 = 3 β€² βˆ’ 1 𝛽 1 β€² βˆ’ 1 = 4 πœ„ 𝑁𝐡𝑄 = argmax 𝑄 πœ„ 𝒠 = β€² βˆ’ 1 + 𝛽 0 𝛽 1 5 πœ„ πœ„ 20

  20. Benoulli: Predictive distribution  Training samples: 𝒠 = 𝑦 (1) , … , 𝑦 (𝑂) 𝑄 πœ„ = 𝐢𝑓𝑒𝑏 πœ„ 𝛽 1 , 𝛽 0 ∝ πœ„ 𝛽 1 βˆ’1 1 βˆ’ πœ„ 𝛽 0 βˆ’1 𝑄 πœ„|𝒠 = 𝐢𝑓𝑒𝑏 πœ„ 𝛽 1 + 𝑛, 𝛽 0 + 𝑂 βˆ’ 𝑛 ∝ πœ„ 𝛽 1 +π‘›βˆ’1 1 βˆ’ πœ„ 𝛽 0 + π‘‚βˆ’π‘› βˆ’1 𝑄 𝑦|𝒠 = 𝑄 𝑦|πœ„ 𝑄 πœ„|𝒠 π‘’πœ„ = 𝐹 𝑄 πœ„|𝒠 𝑄(𝑦|πœ„) 𝛽 1 + 𝑛 β‡’ 𝑄 𝑦 = 1|𝒠 = 𝐹 𝑄 πœ„|𝒠 πœ„ = 𝛽 0 + 𝛽 1 + 𝑂 21

  21. Dirichlet distribution Input space: 𝜾 = πœ„ 1 , … , πœ„ 𝐿 π‘ˆ πœ„ 𝑙 ∈ 0,1 𝐿 πœ„ 𝑙 = 1 𝐿 𝛽 𝑙 βˆ’1 𝑄 𝜾 𝜷 ∝ πœ„ 𝑙 𝑙=1 𝑙=1 𝐿 Ξ“(𝛽) 𝛽 𝑙 βˆ’1 = Ξ“ 𝛽 1 … Ξ“(𝛽 𝐿 ) πœ„ 𝑙 𝑙=1 𝐿 𝐹 πœ„ 𝑙 = 𝛽 𝑙 𝛽 𝛽 = 𝛽 𝑙 πœ„ 𝑙 = 𝛽 𝑙 βˆ’ 1 𝛽 βˆ’ 𝐿 𝑙=1 22

  22. Dirichlet distribution: Examples 𝜷 = [10,10,10] 𝜷 = [0.1,0.1,0.1] 𝜷 = [1,1,1] Dirichlet parameters determine both the prior beliefs and their strength. The larger values of 𝛽 correspond to more confidence on the prior belief (i.e., more imaginary samples) 23

  23. Dirichlet distribution: Example 𝜷 = [2,2,2] 𝜷 = [20,2,2] 24

  24. Multinomial distribution: Prior  Dirichlet distribution is the conjugate prior of Multinomial 𝐿 𝑛 𝑙 +𝛽 𝑙 βˆ’1 𝑄 𝜾 𝒠, 𝜷 ∝ 𝑄 𝒠 𝜾 𝑄 𝜾 𝜷 ∝ πœ„ 𝑙 𝑙=1 𝒏 = 𝑛 1 , … , 𝑛 𝐿 π‘ˆ sufficient statistics of data 𝑄 𝜾 𝒠, 𝜷 = 𝐸𝑗𝑠 𝜾 𝜷 + 𝒏 𝑄 𝜾 𝒠, 𝜷 𝑄 𝜾 𝜷 𝜾~𝐸𝑗𝑠(𝛽 1 + 𝑛 1 , … , 𝛽 𝐿 + 𝑛 𝐿 ) 𝜾~𝐸𝑗𝑠(𝛽 1 , … , 𝛽 𝐿 ) 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend