Bayesian Interpretations of Regularization Charlie Frogner 9.520 - PowerPoint PPT Presentation

Bayesian Interpretations of Regularization Charlie Frogner 9.520 Class 17 April 6, 2011 C. Frogner Bayesian Interpretations of Regularization

The Plan Regularized least squares maps { ( x i , y i ) } n i = 1 to a function that minimizes the regularized loss: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 f S = arg min H 2 f ∈H i = 1 Can we interpret RLS from a probabilistic point of view? C. Frogner Bayesian Interpretations of Regularization

Some notation Training set: S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . Inputs: X = { x 1 , . . . , x n } . Labels: Y = { y 1 , . . . , y n } . Parameters: θ ∈ R p . p ( Y | X , θ ) is the joint distribution over labels Y given inputs X and the parameters. C. Frogner Bayesian Interpretations of Regularization

Where do probabilities show up? n 1 V ( y i , f ( x i )) + λ � 2 � f � 2 H 2 i = 1 becomes p ( Y | f , X ) · p ( f ) Likelihood , a.k.a. noise model : p ( Y | f , X ) . � f ∗ ( x i ) , σ 2 � Gaussian: y i ∼ N i Poisson: y i ∼ Pois ( f ∗ ( x i )) Prior : p ( f ) . C. Frogner Bayesian Interpretations of Regularization

Estimation The estimation problem: i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N Find a good f to explain data. C. Frogner Bayesian Interpretations of Regularization

The Plan Maximum likelihood estimation for ERM MAP estimation for linear RLS MAP estimation for kernel RLS Transductive model Infinite dimensions get more complicated C. Frogner Bayesian Interpretations of Regularization

Maximum likelihood estimation i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N A good f is one that maximizes p ( Y | f , X ) . C. Frogner Bayesian Interpretations of Regularization

Maximum likelihood and least squares For least squares, noise model is: � f ( x i ) , σ 2 � y i | f , x i ∼ N a.k.a. � � Y | f , X ∼ N f ( X ) , σ 2 I So N � � 1 1 p ( Y | f , X ) = � σ 2 ( y i − f ( x i )) 2 − ( 2 πσ 2 ) N / 2 exp i = 1 C. Frogner Bayesian Interpretations of Regularization

Maximum likelihood and least squares Maximum likelihood: maximize � N � 1 1 p ( Y | f , X ) = � σ 2 ( y i − f ( x i ))) 2 ( 2 πσ 2 ) N / 2 exp − i = 1 Empirical risk minimization: minimize N � ( y i − f ( x i )) 2 i = 1 C. Frogner Bayesian Interpretations of Regularization

... N � ( y i − f ( x i )) 2 i = 1 C. Frogner Bayesian Interpretations of Regularization

... N 1 σ 2 ( y i − f ( x i )) 2 − � e i = 1 C. Frogner Bayesian Interpretations of Regularization

What about regularization? RLS: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 arg min H 2 f i = 1 Is there a model of Y and f that yields RLS? Yes. � n � 1 ( y i − f ( x i )) 2 − λ 2 � f � 2 � − 2 σ 2 H e ε i = 1 p ( Y | f , X ) · p ( f ) C. Frogner Bayesian Interpretations of Regularization

What about regularization? RLS: n 1 ( y i − f ( x i )) 2 + λ � 2 � f � 2 arg min H 2 f i = 1 Is there a model of Y and f that yields RLS? Yes. � n � 1 ( y i − f ( x i )) 2 � − · e − λ 2 σ 2 2 � f � 2 e ε i = 1 H p ( Y | f , X ) · p ( f ) C. Frogner Bayesian Interpretations of Regularization

Posterior function estimates i = 1 and model p ( Y | f , X ) , p ( f ) . Given data { ( x i , y i ) } N Find a good f to explain data. (If we can get p ( f | Y , X ) ) Bayes least squares estimate : ˆ f BLS = E ( f | X , Y ) [ f ] i.e. the mean of the posterior. MAP estimate : f MAP ( Y | X ) = arg max p ( f | X , Y ) ˆ f i.e. a mode of the posterior. C. Frogner Bayesian Interpretations of Regularization

A posterior on functions? Functions vs. parameters: H ∼ = R p Represent functions in H by their coordinates w.r.t. a basis: f ∈ H ↔ θ ∈ R p Assume (for the moment): p < ∞ C. Frogner Bayesian Interpretations of Regularization

A posterior on functions? Mercer’s theorem : � K ( x i , x j ) = ν k ψ k ( x i ) ψ k ( x j ) k � where ν k ψ k ( · ) = K ( · , y ) ψ k ( y ) dy for all k . The functions {√ ν k ψ k ( · ) } form an orthonormal basis for H K . Let φ ( · ) = [ √ ν 1 ψ 1 ( · ) , . . . , √ ν p ψ p ( · )] . Then: H K = { φ ( · ) θ | θ ∈ R p } C. Frogner Bayesian Interpretations of Regularization

Prior on infinite-dimensional space Problem: there’s no such thing as θ ∼ N ( 0 , I ) when θ ∈ R ∞ ! C. Frogner Bayesian Interpretations of Regularization

Posterior for linear RLS Linear function: f ( x ) = � x , θ � Noise model: � � Y | X , θ ∼ N X θ, σ 2 ε I Add a prior : θ ∼ N ( 0 , I ) C. Frogner Bayesian Interpretations of Regularization

Posterior for linear RLS Model: Y | X , θ ∼ N � X θ, σ 2 � ε I , θ ∼ N ( 0 , I ) Joint over Y and θ : � Y � XX T + σ 2 �� 0 X � � �� ε I ∼ N , X T θ 0 I Condition on Y . C. Frogner Bayesian Interpretations of Regularization

Linear RLS as a MAP estimator Model: � � Y | X , θ ∼ N X θ, σ 2 θ ∼ N ( 0 , I ) ε I , θ MAP ( Y | X ) = X T ( XX T + σ 2 ε I ) − 1 Y ˆ Recall the linear RLS solution: N 1 ( y i − � x i , θ � ) 2 + λ θ RLS ( Y | X ) = arg min ˆ � 2 � θ � 2 2 θ i = 1 = X T ( XX T + λ 2 I ) − 1 Y So what’s λ ? C. Frogner Bayesian Interpretations of Regularization

Bayesian Interpretations of Regularization Charlie Frogner 9.520 - PowerPoint PPT Presentation

Bayesian Interpretations of Regularization Charlie Frogner 9.520 Class 17 April 6, 2011 C. Frogner Bayesian Interpretations of Regularization The Plan Regularized least squares maps { ( x i , y i ) } n i = 1 to a function that minimizes the

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Computational Interpretations of Differential Logic Jim Laird (University of Bath) May 30, 2013

On the status of astrophysical interpretations astrophysical interpretations On the status of of

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

- uncertainty and spatio-temporal trends Juha Aalto, Pentti Pirinen, Kirsti Jylh 10th EUMETNET

Point Estimates and Sampling Variability August 19, 2019 August 19, 2019 1 / 46 Final Exam

MATLAB Mathcad function file %solved in Mathcad function dydt = oscillator6(t,y,m,c,k)

Announcements Class is 170. Matlab Grader homework, 1 and 2 (of less than 9) homeworks Due 22

THE EXPLODING DOMAIN OF SIMULATION OPTIMIZATION Jay April Fred Glover James P. Kelly Manuel

New developments in WRF-SFIRE Jan Mandel, Jonathan D. Beezley, Adam K. Kochanski, Volodymyr Y.

DETERMINANTAL POINT PROCESSES FOR NATURAL LANGUAGE PROCESSING Jennifer Gillenwater Joint work

GDPP Learning Diverse Generations using Determinantal Point Process Mohamed Elfeki , Camille

Bayesian Interpretations of Regularization Charlie Frogner 9.520 - PowerPoint PPT Presentation

Bayesian Interpretations of Regularization Charlie Frogner 9.520 Class 17 April 6, 2011 C. Frogner Bayesian Interpretations of Regularization The Plan Regularized least squares maps { ( x i , y i ) } n i = 1 to a function that minimizes the

Function Space Priors in Bayesian Deep Learning Roger Grosse Motivation Today Bayesian deep

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Computational Interpretations of Differential Logic Jim Laird (University of Bath) May 30, 2013

On the status of astrophysical interpretations astrophysical interpretations On the status of of

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

- uncertainty and spatio-temporal trends Juha Aalto, Pentti Pirinen, Kirsti Jylh 10th EUMETNET

Point Estimates and Sampling Variability August 19, 2019 August 19, 2019 1 / 46 Final Exam

MATLAB Mathcad function file %solved in Mathcad function dydt = oscillator6(t,y,m,c,k)

Announcements Class is 170. Matlab Grader homework, 1 and 2 (of less than 9) homeworks Due 22

THE EXPLODING DOMAIN OF SIMULATION OPTIMIZATION Jay April Fred Glover James P. Kelly Manuel

New developments in WRF-SFIRE Jan Mandel, Jonathan D. Beezley, Adam K. Kochanski, Volodymyr Y.

DETERMINANTAL POINT PROCESSES FOR NATURAL LANGUAGE PROCESSING Jennifer Gillenwater Joint work

GDPP Learning Diverse Generations using Determinantal Point Process Mohamed Elfeki , Camille

Regularization Overview Regularization Overview Problems & Multicollinearity We will