Bayesian networks: basic parameter learning Machine Intelligence - PowerPoint PPT Presentation

Bayesian networks: basic parameter learning Machine Intelligence Thomas D. Nielsen September 2008 Basic parameter learning September 2008 1 / 24

Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. Basic parameter learning September 2008 2 / 24

Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. true mass random error Basic parameter learning September 2008 2 / 24

Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. Basic parameter learning September 2008 2 / 24

Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. Estimate of true mass: mean value of normal distribution that best “fits” the data. Basic parameter learning September 2008 2 / 24

Estimation Example: Coin Tossing Is the Euro fair? Basic parameter learning September 2008 3 / 24

Estimation Example: Coin Tossing Is the Euro fair? Toss Euro 1000 times and count number of heads and tails: . . . Basic parameter learning September 2008 3 / 24

Estimation Example: Coin Tossing Is the Euro fair? Result: heads: 521, tails: 479. Probability of Euro falling heads (estimate): value that best “fits” the data: 521/1000. Basic parameter learning September 2008 3 / 24

Estimation: Classical Structure of Estimation Problem Given: Data produced by some random process that is characterized by one or several numerical parameters. Wanted: Infer value of (some) parameters. (Classical) Method: Obtain estimate for parameter by a function that maps possible data sets into the parameter space. Basic parameter learning September 2008 4 / 24

Estimation: Classical Parametric Family Let W be a set, Θ ⊆ R k for some k ≥ 1. For every θ ∈ Θ let P θ be a probability distribution on W . Then { P θ | θ ∈ Θ } is called a parametric family (of distributions). Example 1 : W = { h , t } , Θ = [ 0 , 1 ] . P θ : distribution with P ( h ) = θ (and P ( t ) = 1 − θ ). Basic parameter learning September 2008 5 / 24

Estimation: Classical Parametric Family Let W be a set, Θ ⊆ R k for some k ≥ 1. For every θ ∈ Θ let P θ be a probability distribution on W . Then { P θ | θ ∈ Θ } is called a parametric family (of distributions). Example 1 : W = { h , t } , Θ = [ 0 , 1 ] . P θ : distribution with P ( h ) = θ (and P ( t ) = 1 − θ ). Example 2 : W = { w 1 , . . . , w k } , Θ = { θ = ( p 1 , . . . , p k ) ∈ [ 0 , 1 ] k | X p i = 1 } . P θ : distribution with P ( w i ) = p i . Basic parameter learning September 2008 5 / 24

Estimation: Classical Parametric Family Let W be a set, Θ ⊆ R k for some k ≥ 1. For every θ ∈ Θ let P θ be a probability distribution on W . Then { P θ | θ ∈ Θ } is called a parametric family (of distributions). Example 1 : W = { h , t } , Θ = [ 0 , 1 ] . P θ : distribution with P ( h ) = θ (and P ( t ) = 1 − θ ). Example 2 : W = { w 1 , . . . , w k } , Θ = { θ = ( p 1 , . . . , p k ) ∈ [ 0 , 1 ] k | X p i = 1 } . P θ : distribution with P ( w i ) = p i . Example 3 : W = R , Θ = R × R + . For θ = ( µ, σ ) ∈ Θ : P θ normal distribution with mean µ and standard deviation σ . Basic parameter learning September 2008 5 / 24

Estimation: Classical Sample A family X 1 , . . . , X N of random variables is called independent identically distributed (iid) if the family is independent, and P ( X i ) = P ( X j ) for all i , j . Basic parameter learning September 2008 6 / 24

Estimation: Classical Sample A family X 1 , . . . , X N of random variables is called independent identically distributed (iid) if the family is independent, and P ( X i ) = P ( X j ) for all i , j . A sample s 1 , . . . , s N ∈ W of observations (or data items) is interpreted as the observed values of an iid family of random variables with distribution P ( X i ) = P θ . Basic parameter learning September 2008 6 / 24

Estimation: Classical Sample A family X 1 , . . . , X N of random variables is called independent identically distributed (iid) if the family is independent, and P ( X i ) = P ( X j ) for all i , j . A sample s 1 , . . . , s N ∈ W of observations (or data items) is interpreted as the observed values of an iid family of random variables with distribution P ( X i ) = P θ . Likelihood Function Given a parametric family { P θ | θ ∈ Θ } of distributions on W , and a sample s = ( s 1 , . . . , s N ) ∈ W N . The function N N θ �→ P θ ( s ) := P θ ( s i ) , resp. θ �→ log P θ ( s ) = log P θ ( s i ) , Y X i = 1 i = 1 is called the likelihood function (resp. log-likelihood function) for θ given s . Basic parameter learning September 2008 6 / 24

Estimation: Classical Maximum Likelihood Estimator Given: parametric family and sample s . Every θ ∗ ∈ Θ with θ ∗ = arg max θ ∈ Θ P θ ( s ) is called a maximum likelihood estimate for θ (given s ). Basic parameter learning September 2008 7 / 24

Estimation: Classical Maximum Likelihood Estimator Given: parametric family and sample s . Every θ ∗ ∈ Θ with θ ∗ = arg max θ ∈ Θ P θ ( s ) is called a maximum likelihood estimate for θ (given s ). Since the logarithm is a strictly monotone function, maximum likelihood estimatates are also obtained by maximizing the log-likelihood: θ ∗ = arg max θ ∈ Θ log P θ ( s ) Basic parameter learning September 2008 7 / 24

Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model Basic parameter learning September 2008 8 / 24

Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 2 M 0 . 3 M 0 . 1 Model We can measure how well a model fits the data using: P ( D| M θ ) = P ( pinup , pinup , pindown , . . . , pinup | M θ ) = P ( pinup | M θ ) P ( pinup | M θ ) P ( pindown | M θ ) · . . . · P ( pinup | M θ ) This is also called the likelihood of M θ given D . Basic parameter learning September 2008 8 / 24

Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model We select the parameter ˆ θ that maximizes: ˆ P ( D| M θ ) θ = arg max θ 100 Y P ( d i | M θ ) = arg max θ i = 1 µ · θ 80 ( 1 − θ ) 20 . = arg max θ Basic parameter learning September 2008 8 / 24

Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model By setting: d d θ µ · θ 80 ( 1 − θ ) 20 = 0 we get the maximum likelihood estimate: ˆ θ = 0 . 8 . Basic parameter learning September 2008 8 / 24

Estimation: Classical Maximum Likelihood Estimates for Multinomial Distribution Consider the family of multinomial distribution defined as W = { w 1 , . . . , w k } , Θ = { θ = ( p 1 , . . . , p k ) ∈ [ 0 , 1 ] k | p i = 1 } . X P θ : distribution with P ( w i ) = p i . For { P θ | θ ∈ Θ } and s ∈ W N : there exists exactly one maximum likelihood estimate θ ∗ = ( p ∗ 1 , . . . , p ∗ k ) given by i = 1 p ∗ N |{ j ∈ { 1 , . . . , N } | s j = w i }| [i.e. θ ∗ is just the empirical distribution defined by the data on W ] Basic parameter learning September 2008 9 / 24

Estimation: Classical Proof: (for W = { w 1 , w 2 } ): p ∗ 1 / N |{ j ∈ { 1 , . . . , N } | s j = w 1 }| = 1 p ∗ 1 / N |{ j ∈ { 1 , . . . , N } | s j = w 2 }| (= 1 − p ∗ = 1 ) 2 N log P θ ( s ) = X log P θ ( s j ) = N · ( p ∗ 1 log ( p 1 ) + p ∗ 2 log ( p 2 )) j = 1 = N · ( p ∗ 1 log ( p 1 ) + ( 1 − p ∗ 1 ) log ( 1 − p 1 )) Differentiated w.r.t. p 1 : N · ( p ∗ 1 / p 1 − ( 1 − p ∗ 1 ) / ( 1 − p 1 )) Only root: p 1 = p ∗ 1 . Basic parameter learning September 2008 10 / 24

Estimation: Classical Consistency Let W = { w 1 , . . . , w k } , and the data s 1 , s 2 , . . . , s N be generated by the distribution P θ with parameters θ = ( p 1 , . . . , p k ) . Then for all ǫ > 0 and i = 1 , . . . , k : N →∞ P θ ( | p ∗ i − p i | ≥ ǫ ) = 0 lim Note: p ∗ is a function of s . The probability P θ ( | p ∗ i − p i | ≥ ǫ ) is the probability that by sampling from P θ a sample s will be obtained, so that for the p ∗ computed from s the inequality | p ∗ i − p i | ≥ ǫ holds. Similar consistency properties hold for many other types of maximum likelihood estimates. Basic parameter learning September 2008 11 / 24

Estimation: Classical Chebyshev’s Inequality A quantitative bound: 1 P θ ( | p ∗ i − p i | ≥ ǫ ) ≤ ǫ 2 N p i ( 1 − p i ) Basic parameter learning September 2008 12 / 24

Bayesian networks: basic parameter learning Machine Intelligence - PowerPoint PPT Presentation

Bayesian networks: basic parameter learning Machine Intelligence Thomas D. Nielsen September 2008 Basic parameter learning September 2008 1 / 24 Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Methods for Parameter Estimation Bayesian vs Frequentist Inference Frequentist Chris

Bayesian Methods for Neural Networks Readings: Bishop, Neural Networks for Pattern Recognition .

Chapter14 Probabilistic Reasoning (Bayesian Networks) Sec. 1 - 2 20070607 Chap14 1

Antimatter from Supernova Remnants Michael Kachelrie NTNU, Trondheim with S. Ostapchenko, R.

Simple Permutations R.L.F. Brignall joint work with Sophie Huczynska, Nik Rukuc and Vincent

Barriers to Preventing False Discovery in Interactive Data Analysis Jonathan Ullman (Northeastern

Forbidden Conjectures David Sumner, Professor Emeritus University of South Carolina Graph

Polynomial Julia sets with positive measure Why bother? Quasiconformal NILF Measure 0? Measure

The variety of nuclear implicative semilattices is locally finite Guram Bezhanishvili, Nick

Side conditions and revisionism David Asper o University of East Anglia 4th Arctic Set Theory

Cop and Robber Game and Hyperbolicity J. Chalopin 1 V. Chepoi 1 . Papasoglu 2 T. Pecatte 3 P 1 LIF