Overview Bayesian Methods for Parameter Estimation Introduction to - PowerPoint PPT Presentation

Overview Bayesian Methods for Parameter Estimation Introduction to Bayesian Statistics: Learning a Probability Learning the mean of a Gaussian Chris Williams Readings: Bishop §2.1 (Beta), §2.2 (Dirichlet), §2.3.6 School of Informatics, University of Edinburgh (Gaussian), Heckerman tutorial section 2 October 2007 1 / 18 2 / 18 Bayesian vs Frequentist Inference Frequentist method Frequentist Assumes that there is an unknown but fixed parameter θ Estimates θ with some confidence Model p ( x | θ, M ) , data D = { x 1 , . . . , x n } Prediction by using the estimated parameter value ˆ θ = argmax θ p ( D | θ, M ) Bayesian Prediction for x n + 1 is based on p ( x n + 1 | ˆ θ, M ) Represents uncertainty about the unknown parameter Uses probability to quantify this uncertainty. Unknown parameters as random variables Prediction follows rules of probability 3 / 18 4 / 18

Bayesian method Bayes, MAP and Maximum Likelihood Prior distribution p ( θ | M ) � Posterior distribution p ( θ | D , M ) p ( x n + 1 | D , M ) = p ( x n + 1 | θ, M ) p ( θ | D , M ) d θ p ( θ | D , M ) = p ( D | θ, M ) p ( θ | M ) p ( D | M ) Maximum a posteriori value of θ Making predictions θ MAP = argmax θ p ( θ | D , M ) � p ( x n + 1 | D , M ) = p ( x n + 1 , θ | D , M ) d θ Note: not invariant to reparameterization (cf ML estimator) � = p ( x n + 1 | θ, D , M ) p ( θ | D , M ) d θ If posterior is sharply peaked about the most probable value θ MAP then � p ( x n + 1 | D , M ) ≃ p ( x n + 1 | θ MAP , M ) = p ( x n + 1 | θ, M ) p ( θ | D , M ) d θ In the limit n → ∞ , θ MAP converges to ˆ θ (as long as p (ˆ θ ) � = 0) Interpretation: average of predictions p ( x n + 1 | θ, M ) Bayesian approach most effective when data is limited, n is small weighted by p ( θ | D , M ) Marginal likelihood (important for model comparison) 5 / 18 6 / 18 � Learning probabilities: thumbtack example Likelihood Frequentist Approach Likelihood for a sequence of heads and tails The probability of heads θ is unknown heads tails p ( hhth . . . tth | θ ) = θ n h ( 1 − θ ) n t Given iid data, estimate θ using an estimator MLE n h ˆ θ = with good properties n h + n t (e.g. ML estimator) 7 / 18 8 / 18

Learning probabilities: thumbtack example Examples of the Beta distribution 3.5 2 1.8 Bayesian Approach: (a) the prior 3 1.6 1.4 2.5 1.2 Prior density p ( θ ) , use beta distribution 2 1 0.8 1.5 0.6 0.4 1 p ( θ ) = Beta ( α h , α t ) ∝ θ α h − 1 ( 1 − θ ) α t − 1 0.2 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Beta(0.5,0.5) Beta(1,1) for α h , α t > 0 1.8 4.5 Properties of the beta distribution 1.6 4 1.4 3.5 1.2 3 1 2.5 � α h 0.8 2 E [ θ ] = θ p ( θ ) = 0.6 1.5 α h + α t 0.4 1 0.2 0.5 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Beta(3,2) Beta(15,10) 9 / 18 10 / 18 Bayesian Approach: (c) making predictions Bayesian Approach: (b) the posterior θ p ( θ | D ) ∝ p ( θ ) p ( D | θ ) ∝ θ α h − 1 ( 1 − θ ) α t − 1 θ n h ( 1 − θ ) n t ∝ θ α h + n h − 1 ( 1 − θ ) α t + n t − 1 x x x x n+1 1 2 n Posterior is also a Beta distribution ∼ Beta ( α h + n h , α t + n t ) The Beta prior is conjugate to the binomial likelihood (i.e. � p ( X n + 1 = heads | D , M ) = p ( X n + 1 = heads | θ ) p ( θ | D , M ) d θ they have the same parametric form) � α h and α t can be thought of as imaginary counts, with = θ Beta (( α h + n h , α t + n t ) d θ α = α h + α t as the equivalent sample size = α h + n h α + n 11 / 18 12 / 18

Beyond Conjugate Priors Generalization to multinomial variables Dirichlet prior r θ α i − 1 � p ( θ 1 , . . . , θ r ) = Dir ( α 1 , . . . , α r ) ∝ i i = 1 The thumbtack came from a magic shop → a mixture prior with � θ i = 1 , α i > 0 i p ( θ ) = 0 . 4 Beta ( 20 , 0 . 5 ) + 0 . 2 Beta ( 2 , 2 ) + 0 . 4 Beta ( 0 . 5 , 20 ) α i ’s are imaginary counts, α = � i α i is equivalent sample size Properties E ( θ i ) = α i α Dirichlet distribution is conjugate to the multinomial likelihood 13 / 18 14 / 18 Inferring the mean of a Gaussian Posterior distribution r � θ α i + n i − 1 p ( θ | n 1 , . . . , n r ) ∝ Likelihood i i = 1 p ( x | µ ) ∼ N ( µ, σ 2 ) Marginal likelihood Prior p ( µ ) ∼ N ( µ 0 , σ 2 0 ) r Γ( α ) Γ( α i + n i ) � p ( D | M ) = Given data D = { x 1 , . . . , x n } , what is p ( µ | D ) ? Γ( α + n ) Γ( α i ) i = 1 15 / 18 16 / 18

Comparing Bayesian and Frequentist approaches p ( µ | D ) ∼ N ( µ n , σ 2 n ) with n x = 1 � x i Frequentist : fix θ , consider all possible data sets n i = 1 generated with θ fixed n σ 2 σ 2 Bayesian : fix D , consider all possible values of θ 0 µ n = 0 + σ 2 x + 0 + σ 2 µ 0 n σ 2 n σ 2 One view is that Bayesian and Frequentist approaches have different definitions of what it means to be a good 1 = n σ 2 + 1 estimator σ 2 σ 2 n 0 See Bishop §2.3.6 for details 17 / 18 18 / 18

Overview Bayesian Methods for Parameter Estimation Introduction to - PowerPoint PPT Presentation

Overview Bayesian Methods for Parameter Estimation Introduction to Bayesian Statistics: Learning a Probability Learning the mean of a Gaussian Chris Williams Readings: Bishop 2.1 (Beta), 2.2 (Dirichlet), 2.3.6 School of Informatics,

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Bayesian Methods for Parameter Estimation Bayesian vs Frequentist Inference Frequentist Chris

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Maximum likelihood parameter estimation Maximum likelihood parameter estimation For an HMM

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

6. Parameter Passing Parameter Passing CS 381 Spring 2016 Example (Formal) Parameter void

10/16/19 Parameter Control Genetic Algorithms Motivation Parameter setting Tuning

Outline Introduction Knowledge Structures Parameter Estimation Maximum Likelihood Estimation

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Parameter estimation methods for fault detection and isolation LAAS-CNRS UPC Teresa Escobet

Bayesian Estimation of Low-rank Matrices Pierre Alquier Journes de Statistique du Sud,

Some Bayesian Approaches for ERGM Ranran Wang, UW MURI-UCI August 25, 2009 Some Bayesian

Introduction to Bayesian Statistics Lecture 9: Hierarchical Models Rung-Ching Tsai Department of

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 3: Probability Jan-Willem van de

Thanks to R Parr, C Guesterin

User Popula,ons Forgo=en usernames/ Distributed across networks; LOW-RATE passwords the

Learning Objectives At the end of the class you should be able to: derive Bayesian learning from

Identifying Parametric Prior Distributions Stephanie Kovalchik UCLA, Department of Biostatistics

Introduction to Bayesian Inference Frank Wood April 6, 2010 Introduction Overview of Topics

ML, MAP Estimation and Bayesian CE-717: Machine Learning Sharif University of Technology Fall