bayesian networks basic parameter learning
play

Bayesian networks: basic parameter learning Machine Intelligence - PowerPoint PPT Presentation

Bayesian networks: basic parameter learning Machine Intelligence Thomas D. Nielsen September 2008 Basic parameter learning September 2008 1 / 24 Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated


  1. Bayesian networks: basic parameter learning Machine Intelligence Thomas D. Nielsen September 2008 Basic parameter learning September 2008 1 / 24

  2. Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. Basic parameter learning September 2008 2 / 24

  3. Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. true mass random error Basic parameter learning September 2008 2 / 24

  4. Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. Basic parameter learning September 2008 2 / 24

  5. Estimation Example: Physical Measurements Mass of an atomic particle is measured in repeated experiments. Measurement results = true mass + random error. Estimate of true mass: mean value of normal distribution that best “fits” the data. Basic parameter learning September 2008 2 / 24

  6. Estimation Example: Coin Tossing Is the Euro fair? Basic parameter learning September 2008 3 / 24

  7. Estimation Example: Coin Tossing Is the Euro fair? Toss Euro 1000 times and count number of heads and tails: . . . Basic parameter learning September 2008 3 / 24

  8. Estimation Example: Coin Tossing Is the Euro fair? Result: heads: 521, tails: 479. Probability of Euro falling heads (estimate): value that best “fits” the data: 521/1000. Basic parameter learning September 2008 3 / 24

  9. Estimation: Classical Structure of Estimation Problem Given: Data produced by some random process that is characterized by one or several numerical parameters. Wanted: Infer value of (some) parameters. (Classical) Method: Obtain estimate for parameter by a function that maps possible data sets into the parameter space. Basic parameter learning September 2008 4 / 24

  10. Estimation: Classical Parametric Family Let W be a set, Θ ⊆ R k for some k ≥ 1. For every θ ∈ Θ let P θ be a probability distribution on W . Then { P θ | θ ∈ Θ } is called a parametric family (of distributions). Example 1 : W = { h , t } , Θ = [ 0 , 1 ] . P θ : distribution with P ( h ) = θ (and P ( t ) = 1 − θ ). Basic parameter learning September 2008 5 / 24

  11. Estimation: Classical Parametric Family Let W be a set, Θ ⊆ R k for some k ≥ 1. For every θ ∈ Θ let P θ be a probability distribution on W . Then { P θ | θ ∈ Θ } is called a parametric family (of distributions). Example 1 : W = { h , t } , Θ = [ 0 , 1 ] . P θ : distribution with P ( h ) = θ (and P ( t ) = 1 − θ ). Example 2 : W = { w 1 , . . . , w k } , Θ = { θ = ( p 1 , . . . , p k ) ∈ [ 0 , 1 ] k | X p i = 1 } . P θ : distribution with P ( w i ) = p i . Basic parameter learning September 2008 5 / 24

  12. Estimation: Classical Parametric Family Let W be a set, Θ ⊆ R k for some k ≥ 1. For every θ ∈ Θ let P θ be a probability distribution on W . Then { P θ | θ ∈ Θ } is called a parametric family (of distributions). Example 1 : W = { h , t } , Θ = [ 0 , 1 ] . P θ : distribution with P ( h ) = θ (and P ( t ) = 1 − θ ). Example 2 : W = { w 1 , . . . , w k } , Θ = { θ = ( p 1 , . . . , p k ) ∈ [ 0 , 1 ] k | X p i = 1 } . P θ : distribution with P ( w i ) = p i . Example 3 : W = R , Θ = R × R + . For θ = ( µ, σ ) ∈ Θ : P θ normal distribution with mean µ and standard deviation σ . Basic parameter learning September 2008 5 / 24

  13. Estimation: Classical Sample A family X 1 , . . . , X N of random variables is called independent identically distributed (iid) if the family is independent, and P ( X i ) = P ( X j ) for all i , j . Basic parameter learning September 2008 6 / 24

  14. Estimation: Classical Sample A family X 1 , . . . , X N of random variables is called independent identically distributed (iid) if the family is independent, and P ( X i ) = P ( X j ) for all i , j . A sample s 1 , . . . , s N ∈ W of observations (or data items) is interpreted as the observed values of an iid family of random variables with distribution P ( X i ) = P θ . Basic parameter learning September 2008 6 / 24

  15. Estimation: Classical Sample A family X 1 , . . . , X N of random variables is called independent identically distributed (iid) if the family is independent, and P ( X i ) = P ( X j ) for all i , j . A sample s 1 , . . . , s N ∈ W of observations (or data items) is interpreted as the observed values of an iid family of random variables with distribution P ( X i ) = P θ . Likelihood Function Given a parametric family { P θ | θ ∈ Θ } of distributions on W , and a sample s = ( s 1 , . . . , s N ) ∈ W N . The function N N θ �→ P θ ( s ) := P θ ( s i ) , resp. θ �→ log P θ ( s ) = log P θ ( s i ) , Y X i = 1 i = 1 is called the likelihood function (resp. log-likelihood function) for θ given s . Basic parameter learning September 2008 6 / 24

  16. Estimation: Classical Maximum Likelihood Estimator Given: parametric family and sample s . Every θ ∗ ∈ Θ with θ ∗ = arg max θ ∈ Θ P θ ( s ) is called a maximum likelihood estimate for θ (given s ). Basic parameter learning September 2008 7 / 24

  17. Estimation: Classical Maximum Likelihood Estimator Given: parametric family and sample s . Every θ ∗ ∈ Θ with θ ∗ = arg max θ ∈ Θ P θ ( s ) is called a maximum likelihood estimate for θ (given s ). Since the logarithm is a strictly monotone function, maximum likelihood estimatates are also obtained by maximizing the log-likelihood: θ ∗ = arg max θ ∈ Θ log P θ ( s ) Basic parameter learning September 2008 7 / 24

  18. Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model Basic parameter learning September 2008 8 / 24

  19. Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 2 M 0 . 3 M 0 . 1 Model We can measure how well a model fits the data using: P ( D| M θ ) = P ( pinup , pinup , pindown , . . . , pinup | M θ ) = P ( pinup | M θ ) P ( pinup | M θ ) P ( pindown | M θ ) · . . . · P ( pinup | M θ ) This is also called the likelihood of M θ given D . Basic parameter learning September 2008 8 / 24

  20. Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model We select the parameter ˆ θ that maximizes: ˆ P ( D| M θ ) θ = arg max θ 100 Y P ( d i | M θ ) = arg max θ i = 1 µ · θ 80 ( 1 − θ ) 20 . = arg max θ Basic parameter learning September 2008 8 / 24

  21. Estimation: classical Thumbtack example We have tossed a thumb tack 100 times. It has landed pin up 80 times, and we now look for the model that best fits the observations/data: T T T Structure Probability, P ( pinup ) = 0 . 1 0 . 2 0 . 3 M 0 . 1 M 0 . 2 M 0 . 3 Model By setting: d d θ µ · θ 80 ( 1 − θ ) 20 = 0 we get the maximum likelihood estimate: ˆ θ = 0 . 8 . Basic parameter learning September 2008 8 / 24

  22. Estimation: Classical Maximum Likelihood Estimates for Multinomial Distribution Consider the family of multinomial distribution defined as W = { w 1 , . . . , w k } , Θ = { θ = ( p 1 , . . . , p k ) ∈ [ 0 , 1 ] k | p i = 1 } . X P θ : distribution with P ( w i ) = p i . For { P θ | θ ∈ Θ } and s ∈ W N : there exists exactly one maximum likelihood estimate θ ∗ = ( p ∗ 1 , . . . , p ∗ k ) given by i = 1 p ∗ N |{ j ∈ { 1 , . . . , N } | s j = w i }| [i.e. θ ∗ is just the empirical distribution defined by the data on W ] Basic parameter learning September 2008 9 / 24

  23. Estimation: Classical Proof: (for W = { w 1 , w 2 } ): p ∗ 1 / N |{ j ∈ { 1 , . . . , N } | s j = w 1 }| = 1 p ∗ 1 / N |{ j ∈ { 1 , . . . , N } | s j = w 2 }| (= 1 − p ∗ = 1 ) 2 N log P θ ( s ) = X log P θ ( s j ) = N · ( p ∗ 1 log ( p 1 ) + p ∗ 2 log ( p 2 )) j = 1 = N · ( p ∗ 1 log ( p 1 ) + ( 1 − p ∗ 1 ) log ( 1 − p 1 )) Differentiated w.r.t. p 1 : N · ( p ∗ 1 / p 1 − ( 1 − p ∗ 1 ) / ( 1 − p 1 )) Only root: p 1 = p ∗ 1 . Basic parameter learning September 2008 10 / 24

  24. Estimation: Classical Consistency Let W = { w 1 , . . . , w k } , and the data s 1 , s 2 , . . . , s N be generated by the distribution P θ with parameters θ = ( p 1 , . . . , p k ) . Then for all ǫ > 0 and i = 1 , . . . , k : N →∞ P θ ( | p ∗ i − p i | ≥ ǫ ) = 0 lim Note: p ∗ is a function of s . The probability P θ ( | p ∗ i − p i | ≥ ǫ ) is the probability that by sampling from P θ a sample s will be obtained, so that for the p ∗ computed from s the inequality | p ∗ i − p i | ≥ ǫ holds. Similar consistency properties hold for many other types of maximum likelihood estimates. Basic parameter learning September 2008 11 / 24

  25. Estimation: Classical Chebyshev’s Inequality A quantitative bound: 1 P θ ( | p ∗ i − p i | ≥ ǫ ) ≤ ǫ 2 N p i ( 1 − p i ) Basic parameter learning September 2008 12 / 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend