bayesian methods 1
play

Bayesian Methods 1 Chris Williams School of Informatics, University - PowerPoint PPT Presentation

Bayesian Methods 1 Chris Williams School of Informatics, University of Edinburgh October 2015 1 / 23 Overview Introduction to Bayesian Statistics: Learning a Bernoulli probability Learning a discrete distribution Learning the mean


  1. Bayesian Methods 1 Chris Williams School of Informatics, University of Edinburgh October 2015 1 / 23

  2. Overview ◮ Introduction to Bayesian Statistics: Learning a Bernoulli probability ◮ Learning a discrete distribution ◮ Learning the mean of a Gaussian ◮ Exponential family ◮ Readings: Murphy § 3.3 (Beta), § 3.4 (Dirichlet), § 4.6.1 (Gaussian) Barber § 9.1.1, 9.1.3 (Beta), § 9.4.3 (no parents, Dirichlet), § 8.8.2 (Gaussian) 2 / 23

  3. Bayesian vs Frequentist Inference Frequentist ◮ Assumes that there is an unknown but fixed parameter θ ◮ Estimates θ with some confidence ◮ Prediction by using the estimated parameter value Bayesian ◮ Represents uncertainty about the unknown parameter ◮ Uses probability to quantify this uncertainty. Unknown parameters as random variables ◮ Prediction follows rules of probability 3 / 23

  4. Frequentist method ◮ Model p ( x | θ, M ) , data D = { x 1 , . . . , x N } ˆ θ = argmax θ p ( D | θ, M ) ◮ Prediction for x n +1 is based on p ( x n +1 | ˆ θ, M ) 4 / 23

  5. Bayesian method ◮ Prior distribution p ( θ | M ) ◮ Posterior distribution p ( θ | D, M ) p ( θ | D, M ) = p ( D | θ, M ) p ( θ | M ) p ( D | M ) ◮ Making predictions � p ( x N +1 | D, M ) = p ( x N +1 , θ | D, M ) dθ � p ( x N +1 | θ, D, M ) p ( θ | D, M ) dθ = � p ( x N +1 | θ, M ) p ( θ | D, M ) dθ = Interpretation: average of predictions p ( x N +1 | θ, M ) weighted by p ( θ | D, M ) ◮ Marginal likelihood (important for model comparison) � p ( D | M ) = P ( D | θ, M ) P ( θ | M ) dθ 5 / 23

  6. Bayes, MAP and Maximum Likelihood � p ( x N +1 | D, M ) = p ( x N +1 | θ, M ) p ( θ | D, M ) dθ ◮ Maximum a posteriori value of θ θ MAP = argmax θ p ( θ | D, M ) Note: not invariant to reparameterization (cf ML estimator); ex: variance and precision ( τ = 1 /σ 2 ) for a Gaussian ◮ If posterior is sharply peaked about the most probable value θ MAP then p ( x N +1 | D, M ) ≃ p ( x N +1 | θ MAP , M ) ◮ In the limit N → ∞ , θ MAP converges to ˆ θ (as long as p (ˆ θ ) � = 0 ) ◮ Bayesian approach most effective when data is limited, N is small 6 / 23

  7. Learning probabilities: thumbtack example Frequentist Approach ◮ The probability of heads θ is unknown heads tails ◮ Given iid data, estimate θ using an estimator with good properties (e.g. ML estimator) 7 / 23

  8. Likelihood ◮ Likelihood for a sequence of heads (1) and tails (0) p (1100 . . . 001 | θ ) = θ N 1 (1 − θ ) N 0 ◮ MLE N 1 ˆ θ = N 1 + N 0 8 / 23

  9. Learning probabilities: thumbtack example Bayesian Approach: (a) the prior ◮ Prior density p ( θ ) , use Beta distribution p ( θ ) = Beta( α, β ) ∝ θ α − 1 (1 − θ ) β − 1 for α, β > 0 ◮ Properties of the Beta distribution � α E [ θ ] = θp ( θ ) = α + β αβ var( θ ) = ( α + β ) 2 ( α + β + 1) 9 / 23

  10. Examples of the Beta distribution 3.5 2 1.8 3 1.6 1.4 2.5 1.2 2 1 0.8 1.5 0.6 0.4 1 0.2 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Beta(0.5,0.5) Beta(1,1) 1.8 4.5 1.6 4 1.4 3.5 1.2 3 1 2.5 0.8 2 0.6 1.5 0.4 1 0.2 0.5 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Beta(3,2) Beta(15,10) 10 / 23

  11. Bayesian Approach: (b) the posterior p ( θ | D ) ∝ p ( θ ) p ( D | θ ) ∝ θ α − 1 (1 − θ ) β − 1 θ N 1 (1 − θ ) N 0 ∝ θ α + N 1 − 1 (1 − θ ) β + N 0 − 1 ◮ Posterior is also a Beta distribution ∼ Beta( α + N 1 , β + N 0 ) ◮ The Beta prior is conjugate to the binomial likelihood (i.e. prior and posterior have the same parametric form) ◮ α and β can be thought of as imaginary counts, with α + β as the equivalent sample size [cointoss demo] 11 / 23

  12. Bayesian Approach: (c) making predictions � p ( X N +1 = heads | D, M ) = p ( X N +1 = heads | θ ) p ( θ | D, M ) dθ � = θ Beta( α + N 1 , β + N 0 ) dθ α + N 1 = α + β + N 12 / 23

  13. Beyond Conjugate Priors ◮ The thumbtack came from a magic shop → a mixture prior p ( θ ) = 0 . 4Beta(20 , 0 . 5) + 0 . 2Beta(2 , 2) + 0 . 4Beta(0 . 5 , 20) 13 / 23

  14. Generalization to multinomial variables ◮ Dirichlet prior r � θ α i − 1 p ( θ 1 , . . . , θ r ) = Dir( α 1 , . . . , α r ) ∝ i i =1 with � θ i = 1 , α i > 0 i ◮ α i ’s are imaginary counts, α = � i α i is equivalent sample size ◮ Properties E ( θ i ) = α i α ◮ Dirichlet distribution is conjugate to the multinomial likelihood 14 / 23

  15. Examples of Dirichlet Distributions [Source: https://projects.csail.mit.edu/church/wiki/Models_with_Unbounded_Complexity] 15 / 23

  16. ◮ Likelihood r � θ N i ∝ i i =1 ◮ Show that MLE ˆ θ i = N i /N ◮ Posterior distribution r � θ α i + N i − 1 p ( θ | N 1 , . . . , N r ) ∝ i i =1 ◮ Marginal likelihood r Γ( α ) Γ( α i + N i ) � p ( D | M ) = Γ( α + N ) Γ( α i ) i =1 16 / 23

  17. Inferring the mean of a Gaussian ◮ Likelihood p ( x | µ ) ∼ N ( µ, σ 2 ) ◮ Prior p ( µ ) ∼ N ( µ 0 , σ 2 0 ) ◮ Given data D = { x 1 , . . . , x N } , what is p ( µ | D ) ? 17 / 23

  18. p ( µ | D ) ∼ N ( µ N , σ 2 N ) with N x = 1 � x n N i =1 Nσ 2 σ 2 0 µ N = 0 + σ 2 x + 0 + σ 2 µ 0 Nσ 2 Nσ 2 1 = N σ 2 + 1 σ 2 σ 2 N 0 ◮ See Murphy § 4.6.1 or Barber § 8.8.2 for details 18 / 23

  19. The exponential family ◮ Any distribution over some x that can be written as η T u ( x ) � � P ( x | η ) = h ( x ) g ( η ) exp with h and g known, is in the exponential family of distributions. ◮ Many common distributions are in the exponential family. A notable exception is the t -distribution. ◮ The η are called the natural parameters of the distribution . ◮ For most distributions, the common representation (and parameterization) does not take the exponential family form. ◮ So sometimes useful to convert to exponential family representation and find the natural parameters. ◮ Exercise: Why not try this for some of the distributions that we’ve seen already! 19 / 23

  20. Conjugate exponential models ◮ If the prior takes the same functional form as the posterior for a given likelihood, a prior is said to be conjugate for that likelihood ◮ There is a conjugate prior for any exponential family distribution ◮ If the prior and likelihood are conjugate and exponential, then the the model is said to be conjugate exponential ◮ In conjugate exponential models, the Bayesian integrals can be done analytically 20 / 23

  21. Reflecting on Conjugacy ◮ All of the priors that we have seen so far are conjugate ◮ Good thing: easy to do the sums ◮ Bad thing: prior distribution should match beliefs. Does a Beta distribution match your beliefs? Is it good enough? ◮ Certainly not always ◮ Use of approximate inference methods for non-conjugate models (see later in MLPR) 21 / 23

  22. Comparing Bayesian and Frequentist approaches ◮ Frequentist : fix θ , consider all possible data sets generated with θ fixed ◮ Bayesian : fix D , consider all possible values of θ ◮ One view is that Bayesian and Frequentist approaches have different definitions of what it means to be a good estimator 22 / 23

  23. Summary of Bayesian Methods ◮ Maximum likelihood fails to capture prior or uncertainty ◮ Need to use a prior distribution (maximum likelihood equals MAP with uniform prior) ◮ Prior distribution might have its own parameters (usually called hyper-parameters) ◮ MAP fails to capture uncertainty, need full posterior distribution ◮ Prediction using MAP parameters does not capture uncertainty ◮ Do inference by marginalization. Inference and Learning are just using the rules of probability 23 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend