case study bayesian linear regression and sparse bayesian
play

Case Study: Bayesian Linear Regression and Sparse Bayesian Models - PowerPoint PPT Presentation

Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course: lecture 2) Nov 05, 2015 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 1 Recap Piyush Rai


  1. Case Study: Bayesian Linear Regression and Sparse Bayesian Models Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course: lecture 2) Nov 05, 2015 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 1

  2. Recap Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 2

  3. Maximum Likelihood Estimation (MLE) We wish to estimate parameters θ from observed data { x 1 , . . . , x N } MLE does this by finding θ that maximizes the (log)likelihood p ( X | θ ) N N ˆ � � log p ( X | θ ) = arg max p ( x n | θ ) = arg max log p ( x n | θ ) θ = arg max log θ θ θ n =1 n =1 MLE now reduces to solving an optimization problem w.r.t. θ Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 3

  4. Maximum-a-Posteriori (MAP) Estimation Incorporating prior knowledge p ( θ ) about the parameters MAP estimation finds θ that maximizes the posterior p ( θ | X ) ∝ p ( X | θ ) p ( θ ) N N ˆ � � p ( x n | θ ) p ( θ ) = arg max log p ( x n | θ ) + log p ( θ ) θ = arg max log θ θ n =1 n =1 MAP now reduces to solving an optimization problem w.r.t. θ Objective function very similar to MLE, except for the log p ( θ ) term In some sense, MAP is just a “regularized” MLE Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 4

  5. Bayesian Learning Both MLE and MAP only give a point estimate (single best answer) of θ How can we capture/quantify the uncertainty in θ ? Need to infer the full posterior distribution p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X | θ ) p ( θ ) = θ p ( X | θ ) p ( θ ) d θ ∝ Likelihood × Prior � p ( X ) Requires doing a “fully Bayesian” inference Inference sometimes a somewhat easy and sometimes a (very) hard problem Conjugate priors often make life easy when doing inference Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 5

  6. Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

  7. Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

  8. Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

  9. Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

  10. Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Sum of squared error function N E ( w ) = 1 � | f ( x n , w ) − y n | 2 2 n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

  11. Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Sum of squared error function N E ( w ) = 1 � | f ( x n , w ) − y n | 2 2 n =1 w = arg min w E ( w ) = ( Φ ⊤ Φ ) − 1 Φ ⊤ y Classical solution: ˆ Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

  12. Warm-up: Least Squares Regression Training data: { x n , y n } N n =1 . Response is a noisy function of the input y n = f ( x n , w ) + ǫ n Assume a data representation φ ( x n ) = [ φ 1 ( x n ) , . . . , φ M ( x n )] ∈ R M Denote y = [ y 1 , . . . , y N ] ⊤ ∈ R N , Φ = [ φ ( x 1 ) , . . . , φ ( x N )] ⊤ ∈ R N × M Assume linear (in the parameters) function: f ( x n , w ) = w ⊤ φ ( x n ) Sum of squared error function N E ( w ) = 1 � | f ( x n , w ) − y n | 2 2 n =1 w = arg min w E ( w ) = ( Φ ⊤ Φ ) − 1 Φ ⊤ y Classical solution: ˆ Classification: replace the least squares by some other loss (e.g., logistic) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 6

  13. Regularization Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E ( w ) = E ( w ) + λ Ω( w ) Ω( w ): a measure of how complex w is (want it small) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7

  14. Regularization Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E ( w ) = E ( w ) + λ Ω( w ) Ω( w ): a measure of how complex w is (want it small) Regularization parameter λ trades off data fit vs model simplicity Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7

  15. Regularization Want functions that are “simple” (and hence “generalize” to future data) How: penalize “complex” functions. Use a regularized loss function ˜ E ( w ) = E ( w ) + λ Ω( w ) Ω( w ): a measure of how complex w is (want it small) Regularization parameter λ trades off data fit vs model simplicity w = arg min w ˜ For Ω( w ) = || w || 2 , the solution ˆ E ( w ) = ( Φ ⊤ Φ + λ I ) − 1 Φ ⊤ y Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 7

  16. A Probabilistic Framework for Regression Recall: y n = f ( x n , w ) + ǫ n Assume a zero-mean Gaussian error p ( ǫ | σ 2 ) = N ( ǫ | 0 , σ 2 ) Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 8

  17. A Probabilistic Framework for Regression Recall: y n = f ( x n , w ) + ǫ n Assume a zero-mean Gaussian error p ( ǫ | σ 2 ) = N ( ǫ | 0 , σ 2 ) Leads to a Gaussian likelihood model p ( y n | x n , w ) = N ( y n | f ( x n , w ) , σ 2 ) � 1 / 2 � 1 � − 1 � 2 σ 2 ( f ( x n , w ) − y n ) 2 p ( y n | x n , w ) = exp 2 πσ 2 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 8

  18. A Probabilistic Framework for Regression Recall: y n = f ( x n , w ) + ǫ n Assume a zero-mean Gaussian error p ( ǫ | σ 2 ) = N ( ǫ | 0 , σ 2 ) Leads to a Gaussian likelihood model p ( y n | x n , w ) = N ( y n | f ( x n , w ) , σ 2 ) � 1 / 2 � 1 � − 1 � 2 σ 2 ( f ( x n , w ) − y n ) 2 p ( y n | x n , w ) = exp 2 πσ 2 Joint probability of the data (likelihood) � N / 2 N � � N � 1 − 1 � � ( f ( x n , w ) − y n ) 2 L ( w ) = p ( y n | x n , w ) = exp 2 πσ 2 2 σ 2 n =1 n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 8

  19. A Probabilistic Framework for Regression Let’s look at the negative log-likelihood N − log L ( w ) = N 2 log σ 2 + N 1 � ( f ( x n , w ) − y n ) 2 2 log 2 π + 2 σ 2 n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9

  20. A Probabilistic Framework for Regression Let’s look at the negative log-likelihood N − log L ( w ) = N 2 log σ 2 + N 1 � ( f ( x n , w ) − y n ) 2 2 log 2 π + 2 σ 2 n =1 Minimizing w.r.t. w leads to the same answer as the unregularized case w = ( Φ ⊤ Φ ) − 1 Φ ⊤ y ˆ Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9

  21. A Probabilistic Framework for Regression Let’s look at the negative log-likelihood N − log L ( w ) = N 2 log σ 2 + N 1 � ( f ( x n , w ) − y n ) 2 2 log 2 π + 2 σ 2 n =1 Minimizing w.r.t. w leads to the same answer as the unregularized case w = ( Φ ⊤ Φ ) − 1 Φ ⊤ y ˆ Also get an estimate of error variance N σ 2 = 1 1 � w ) − y n ) 2 ( f ( x n , ˆ ˆ N n =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 9

  22. Specifying a Prior and Computing the Posterior Let’s assume a Gaussian prior on the weight vector w = [ w 1 , . . . , w M ] � α M M � 1 / 2 − α � � � � 2 w 2 p ( w | α ) = p ( w m | α ) = exp m 2 π m =1 m =1 Piyush Rai (IIT Kanpur) Bayesian Linear Regression and Sparse Bayesian Models 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend