Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning - PowerPoint PPT Presentation

Bayesian Learning 1

Outline • MLE, MAP vs. Bayesian Learning • Bayesian Linear Regression • Bayesian Gaussian Mixture Models – Non-parametric Bayes 2

Take Away ... 1. Maximum Likelihood Estimate (MLE) • θ ∗ = arg max θ p ( D| θ ) • Use θ ∗ in future to predict y n +1 given x n +1 2. Maximum a posteriori estimation (MAP) • θ ∗ = arg max θ p ( θ |D , α ) = arg max θ p ( D| θ ) p ( θ | α ) – α is called Hyperparameter • Use θ ∗ in future to predict y n +1 given x n +1 3. Bayesian treatment • model p ( θ |D , α ) � • p ( y n +1 | x n +1 , D , α ) = θ p ( y n +1 | θ, x n + 1 ) p ( θ |D , α ) d θ 3

MLE Estimate θ ∗ = arg max θ p ( D| θ ) 4

MAP Estimate θ ∗ = arg max θ p ( D| θ ) p ( θ ) 5

Bayesian Learning � p ( y n +1 | x n +1 , D ) = θ p ( y n +1 | θ, x n + 1 ) p ( θ |D ) d θ 6

Bayesian Learning � p ( y n +1 | x n +1 , D ) = θ p ( y n +1 | θ, x n + 1 ) p ( θ |D ) d θ 7

Linear Regression • D = { ( x i , y i ) } i = 1 · · · N • Assume that y = f ( x , w ) + ǫ – ǫ ∼ N (0 , β − 1 ) • Linear models assume that – f ( x , w ) = � w , x � = w T x • The aim is to find the appropriate weight vector w 8

Maximum Likelihood Estimate (MLE) 1. Write the Likelihood • p ( y | x , w , β ) = N ( y | f ( x , w ) , β − 1 ) = N ( y | w T x , β − 1 ) L ( w ) = p ( y 1 ..y N | x 1 .. x N , w , β ) N N ( y i | w T x i , β − 1 ) � = i =1 √ β − β � 2 ( y i − w T x ) 2 � � = 2 π exp i � 2 � y i − w T x � w ∗ = arg min (1) w i 2. Solve for w ∗ and use it for future predictions. 9

MAP Estimate 1. Introduce Priors on the parameters • What are the parameters in this model ? • Conjugate Priors – Prior and Posterior have same form. – Beta is conjugate to Bernoulli dist. – Normal with known variance is conjugate to Normal dist. • Hyperparameter – The parameters of the prior distribution 2. Model the posterior distribution – p ( θ |D , α ) θ ∗ = arg max p ( θ |D , α ) = arg max p ( D| θ ) p ( θ | α ) θ θ 10

MAP Estimate With the above choice of prior, p ( w | y , X, α, β ) = N ( w | µ N , Σ N ) • Σ N = α I + βX T X N X T y • µ N = β Σ − 1 Since this is Gaussian, mode is same as the mean. N X T y MAP = µ N = β Σ − 1 w ∗ 12

Bayesian Treatment 1. Introduce prior on the parameters • For Linear Regression, p ( w | α ) = N ( w , 0 , α − 1 I ) 2. Model the posterior distribution of parameters • p ( w | y , X, α ) ∝ p ( y | w , X ) p ( w | α ) • For Linear Regression, p ( w | y , X, α, β ) = N ( w | µ N , Σ N ) 3. Predictive Distribution • p ( y n +1 | x n +1 , y , X, α, β ) The first two steps are common to the MAP estimate process. 13

Predictive Distribution � p ( y n +1 | x n +1 , y , X, α, β ) p ( y n +1 | w , x n +1 , β ) p ( w | y , X, α, β ) = w � � y | µ T N x , σ 2 N = N ( x n + 1 ) • The variance decreases with the N • In the limit, y n +1 = µ T N x n +1 = w T MAP x n +1 • Hyperparameter estimation – Put prior on the hyperparameters ? – Empirical Bayes or EM 15

Hyperparameter Estimation – Empirical Bayes � � � p ( y | y , X ) = p ( y | w , X, β ) p ( w | y , X, α, β ) p ( α, β | y ) dα dβ d w α β w • Relatively less sensitive to the hyperparameters • If posterior p ( α, β | y , X ) is sharply peaked, then � p ( y | y , X ) ≈ p ( y | y , X, α ∗ , β ∗ ) = p ( y | w , X, β ∗ ) p ( w | y , X, α ∗ , β ∗ ) w • If the prior is relatively flat, then – α ∗ and β ∗ are obtained by maximizing the likelihood. 16

Bayesian Treatment 1. Introduce prior distribution • Conjugacy 2. Model the posterior distribution • Hyperparameters can be estimated using Empirical Bayes – Avoids the Cross-validation step – Hence, we can use all the training data 3. Predictive Distribution • Integrate over the parameters • Draw few samples from posterior and sum over them. 17

Outline • MLE, MAP vs. Bayesian Learning • Bayesian Linear Regression • Bayesian Gaussian Mixture Models – Non-parametric Bayes 18

Mixture Models (Recap) • Finite Gaussian Mixture Model – z = 1 · · · K mixture components – parameters for each component ( µ k , β ). p ( x, z ) = p ( z ) p ( x | z ) � p ( z = k ) p ( x | µ k , β ) p ( x ) = z =1 ...K � φ k p ( x | µ k , β ) = k • What are the parameters in Gaussian Mixture Model ? 19

Bayesian treatment of Mixture Models Non-parametric Bayes • What should we do ? 20

Bayesian treatment of Mixture Models 1. Introduce prior distribution 2. Model the posterior distribution 3. Predictive Distribution • p ( x ) = � k φ k p ( x | µ k , β ) • For GMM, we keep the variance fixed. – p ( x | µ k , β ) = N ( µ k , β − 1 ) • Put prior on the mixing weights ( φ k ) and the mean parameters ( µ k ). 21

Dirichlet Process G ∼ DP ( α 0 , G 0 ) Treat this as a collection of samples { θ 1 , θ 2 , · · · } with weights { φ 1 , φ 2 , · · · } • θ i ∼ G 0 can be scalar or vector depending on G 0 – Countably infinite collection of i.i.d samples • � k φ k = 1 – Stick-breaking construction gives these weights. – φ k values depend on α 0 • θ ∼ G ⇒ choose a θ i with weight φ i 22

Dirichlet Process for GMM 1. Prior on the parameters • The base distribution G 0 be N ( ψ, γ I ) • µ i ∼ G 0 ⇒ µ i ∼ N ( ψ, γ I ) • Stick-breaking process is used as prior for φ i • Allows arbitrary number of mixing components. ∼ DP ( α, N ( ψ, γ I )) G µ i | G ∼ G N ( µ i , β − 1 ) x i | µ i , β ∼ • Chinese Restaurant Process 23

Dirichlet Process for GMM 1. Modeling the posterior • c i denote the cluster indicator of i th example • p ( c , µ | X ) ∝ p ( c | α ) p ( µ | c , X ) • Run Gibbs sampler. • Estimate the hyperparameters ( α and γ ) 2. Predictive distribution • Draw samples from the posterior. • Sum over those samples. • Doesn’t need to specify the number of components. 24

Non-parametric Bayes • Stick-breaking construction gives prior on mixing components. • Learns the number of components from the data. • Hyperparameters are estimated using Empirical Bayes • Hierarchical Dirichlet Process (HDP) – Possible to design hierarchical models 25

Take Away ... 1. Maximum Likelihood Estimate (MLE) • θ ∗ = arg max θ p ( D| θ ) • Use θ ∗ in future to predict y n +1 given x n +1 2. Maximum a posteriori estimation (MAP) • θ ∗ = arg max θ p ( θ |D , α ) = arg max θ p ( D| θ ) p ( θ | α ) – α is called Hyperparameter • Use θ ∗ in future to predict y n +1 given x n +1 3. Bayesian treatment • model p ( θ |D , α ) � • p ( y n +1 | x n +1 , D , α ) = θ p ( y n +1 | θ, x n + 1 ) p ( θ |D , α ) d θ 26

Questions ? 27

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning - PowerPoint PPT Presentation

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression Bayesian Gaussian Mixture Models Non-parametric Bayes 2 Take Away ... 1. Maximum Likelihood Estimate (MLE) = arg max p ( D| )

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Bayesian Networks Directed Acyclic Graph (DAG)

Beyond Uniform Priors in Bayesian Network Structure Learning (for Discrete Bayesian Networks)

A simple Bayesian regression model Alicia Johnson Associate Professor, Macalester College

Part 7 Bayesian hierarchical modelling, simulation and MCMC by Gero Walter 252 Bayesian

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT

Bayesian Networks Youve heard about how Bayesian networks have revolutionized AI

Meta-Bayesian Analysis A Bayesian decision-theoretic analysis of Bayesian inference under model

Lecture 6. Bayesian estimation Lecture 6. Bayesian estimation 1 (172) 6. Bayesian estimation

Bayesian networks (2) Lirong Xia Last class Bayesian networks compact, graphical

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Adversarial Approaches to Bayesian Learning and Bayesian Approaches to Adversarial Robustness

Bayesian decision theory Andrea Passerini passerini@disi.unitn.it Machine Learning Bayesian

Selecting priors Applied Bayesian Statistics Dr. Earvin Balderama Department of Mathematics

Introduction to Bayesian Statistics Lecture 3: Single Parameter (II) Rung-Ching Tsai Department

Conjugate Priors: Beta and Normal; Choosing Priors 18.05 Spring 2014 Jeremy Orloff and Jonathan

Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical

Automatic code rewriting in probabilistic programming Internship supervised by Hongseok Yang at

Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course

Related to Bayesian Statistics by Atsuhide Mori (Osaka Dental University, Japan) Geometric

Tutorial 2 Monday 8 th August, 2016 Problem 1. Case for non-IID dataset: In the class, we