Case Study: Bayesian Linear Regression and Sparse Bayesian Models - PowerPoint PPT Presentation

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course: lecture 2) Nov 05, 2015 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 1

Recap Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 2

Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data { x 1 , . . . , x N } MLE does this by finding θ that maximizes the (log)likelihood p ( X | θ ) N N ˆ � � log p ( X | θ ) = arg max p ( x n | θ ) = arg max log p ( x n | θ ) θ = arg max log θ θ θ n =1 n =1 MLE now reduces to solving an optimization problem w.r.t. θ Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 3

Maximum-a-Posteriori (MAP) Estimation Incorporating prior knowledge p ( θ ) about the parameters MAP estimation finds θ that maximizes the posterior p ( θ | X ) ∝ p ( X | θ ) p ( θ ) N N ˆ � � p ( x n | θ ) p ( θ ) = arg max log p ( x n | θ ) + log p ( θ ) θ = arg max log θ θ n =1 n =1 MAP now reduces to solving an optimization problem w.r.t. θ Objective function very similar to MLE, except for the log p ( θ ) term In some sense, MAP is just a “regularized” MLE Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 4

Bayesian Learning Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ ? Need to infer the full posterior distribution p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X | θ ) p ( θ ) = θ p ( X | θ ) p ( θ ) d θ ∝ Likelihood × Prior � p ( X ) Requires doing a “fully Bayesian” inference Inference sometimes a somewhat easy and sometimes a (very) hard problem Conjugate priors often make life easy when doing inference Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 5

Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Sum of squared error function N E ( w ) = 1 � | f ( x n , w ) − y n | 2 2 n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Sum of squared error function N E ( w ) = 1 � | f ( x n , w ) − y n | 2 2 n =1 w = arg min w E ( w ) = ( Φ ⊤ Φ ) − 1 Φ ⊤ y Classical solution: ˆ Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Sum of squared error function N E ( w ) = 1 � | f ( x n , w ) − y n | 2 2 n =1 w = arg min w E ( w ) = ( Φ ⊤ Φ ) − 1 Φ ⊤ y Classical solution: ˆ Classification: replace the least squares by some other loss (e.g., logistic) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

Regularization Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E ( w ) = E ( w ) + λ Ω( w ) Ω( w ): a measure of how complex w is (want it small) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7

Regularization Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E ( w ) = E ( w ) + λ Ω( w ) Ω( w ): a measure of how complex w is (want it small) Regularization parameter λ trades off data fit vs model simplicity Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7

Regularization Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E ( w ) = E ( w ) + λ Ω( w ) Ω( w ): a measure of how complex w is (want it small) Regularization parameter λ trades off data fit vs model simplicity w = arg min w ˜ For Ω( w ) = || w || 2 , the solution ˆ E ( w ) = ( Φ ⊤ Φ + λ I ) − 1 Φ ⊤ y Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7

A Probabilistic Framework for Regression Recall: y n = f ( x n , w ) + ǫ n Assume a zero-mean Gaussian error p ( ǫ | σ 2 ) = N ( ǫ | 0 , σ 2 ) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 8

A Probabilistic Framework for Regression Recall: y n = f ( x n , w ) + ǫ n Assume a zero-mean Gaussian error p ( ǫ | σ 2 ) = N ( ǫ | 0 , σ 2 ) Leads to a Gaussian likelihood model p ( y n | x n , w ) = N ( y n | f ( x n , w ) , σ 2 ) � 1 / 2 � 1 � − 1 � 2 σ 2 ( f ( x n , w ) − y n ) 2 p ( y n | x n , w ) = exp 2 πσ 2 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 8

A Probabilistic Framework for Regression Recall: y n = f ( x n , w ) + ǫ n Assume a zero-mean Gaussian error p ( ǫ | σ 2 ) = N ( ǫ | 0 , σ 2 ) Leads to a Gaussian likelihood model p ( y n | x n , w ) = N ( y n | f ( x n , w ) , σ 2 ) � 1 / 2 � 1 � − 1 � 2 σ 2 ( f ( x n , w ) − y n ) 2 p ( y n | x n , w ) = exp 2 πσ 2 Joint probability of the data (likelihood) � N / 2 N � � N � 1 − 1 � � ( f ( x n , w ) − y n ) 2 L ( w ) = p ( y n | x n , w ) = exp 2 πσ 2 2 σ 2 n =1 n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 8

A Probabilistic Framework for Regression Let’s look at the negative log-likelihood N − log L ( w ) = N 2 log σ 2 + N 1 � ( f ( x n , w ) − y n ) 2 2 log 2 π + 2 σ 2 n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9

A Probabilistic Framework for Regression Let’s look at the negative log-likelihood N − log L ( w ) = N 2 log σ 2 + N 1 � ( f ( x n , w ) − y n ) 2 2 log 2 π + 2 σ 2 n =1 Minimizing w.r.t. w leads to the same answer as the unregularized case w = ( Φ ⊤ Φ ) − 1 Φ ⊤ y ˆ Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9

A Probabilistic Framework for Regression Let’s look at the negative log-likelihood N − log L ( w ) = N 2 log σ 2 + N 1 � ( f ( x n , w ) − y n ) 2 2 log 2 π + 2 σ 2 n =1 Minimizing w.r.t. w leads to the same answer as the unregularized case w = ( Φ ⊤ Φ ) − 1 Φ ⊤ y ˆ Also get an estimate of error variance N σ 2 = 1 1 � w ) − y n ) 2 ( f ( x n , ˆ ˆ N n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9

Specifying a Prior and Computing the Posterior Let’s assume a Gaussian prior on the weight vector w = [ w 1 , . . . , w M ] � α M M � 1 / 2 − α � � � � 2 w 2 p ( w | α ) = p ( w m | α ) = exp m 2 π m =1 m =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 10

Case Study: Bayesian Linear Regression and Sparse Bayesian Models - PowerPoint PPT Presentation

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course: lecture 2) Nov 05, 2015 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 1 Recap Piyush Rai

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Bayesian Linear Regression Seung-Hoon Na Chonbuk National University Bayesian Linear Regression

Bayesian linear regression Dr. Jarad Niemi STAT 544 - Iowa State University April 23, 2019

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Bayesian linear regression (cont.) Dr. Jarad Niemi STAT 544 - Iowa State University April 20,

Linear regression How to measure the accuracy of linear regression models Linear Regression

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

CSC2541: Differentiable Inference and Generative Models Lecture 2: Variational autoencoders

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by

Reliability Assurance I nitiative (RAI ) Progress Report Jerry Hedrick, Director of Compliance

Content-Centric Networking at Internet Scale through The Integration of Name Resolution and

DeepIntent : Deep Icon-Behavior Learning for Detecting Intention-Behavior Discrepancy in Mobile

Bailouts, Time Inconsistency, and Optimal Regulation V. V. Chari University of Minnesota and

Multiobjective Parameter Optimization of Elliptic PDEs using the Reduced Basis Method S.