Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 7 - Linear Regression - Bayesian Inference and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Building on questions on Least Squares Linear Regression 1 Is there a probabilistic interpretation? Gaussian Error, Maximum Likelihood Estimate 2 Addressing overfitting Bayesian and Maximum Aposteriori Estimates, Regularization 3 How to minimize the resultant and more complex error functions? Level Curves and Surfaces, Gradient Vector, Directional Derivative, Gradient Descent Algorithm, Convexity, Necessary and Sufficient Conditions for Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Prior Distribution over w for Linear Regression y = w T φ ( x ) + ε ε ∼ N (0 , σ 2 ) We saw that when we try to maximize log-likelihood we end w MLE = (Φ T Φ) − 1 Φ T y up with ˆ We can use a Prior distribution on w to avoid over-fitting w i ∼ N (0 , 1 λ ) (that is, each component w i is approximately bounded within ± 3 λ by the 3 − σ rule) √ We want to find P ( w | D ) = N ( µ m , Σ m ) Invoking the Bayes Estimation results from before: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Prior Distribution over w for Linear Regression y = w T φ ( x ) + ε ε ∼ N (0 , σ 2 ) We saw that when we try to maximize log-likelihood we end w MLE = (Φ T Φ) − 1 Φ T y up with ˆ We can use a Prior distribution on w to avoid over-fitting w i ∼ N (0 , 1 λ ) (that is, each component w i is approximately bounded within ± 3 λ by the 3 − σ rule) √ We want to find P ( w | D ) = N ( µ m , Σ m ) Invoking the Bayes Estimation results from before: Σ − 1 m µ m = Σ − 1 0 µ 0 + Φ T y /σ 2 + 1 Σ − 1 m = Σ − 1 σ 2 Φ T Φ 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Finding µ m & Σ m for w Setting Σ 0 = 1 λ I and µ 0 = 0 Σ − 1 m µ m = Φ T y /σ 2 Σ − 1 m = λ I + Φ T Φ /σ 2 µ m = ( λ I + Φ T Φ /σ 2 ) − 1 Φ T y σ 2 or µ m = ( λσ 2 I + Φ T Φ) − 1 Φ T y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

MAP and Bayes Estimates Pr ( w | D ) = N ( w | µ m , Σ m ) The MAP estimate or mode under the Gaussian posterior is the mode of the posterior ⇒ w MAP = argmax ˆ N ( w | µ m , Σ m ) = µ m w Similarly, the Bayes Estimate , or the expected value under the Gaussian posterior is the mean ⇒ w Bayes = E Pr( w |D ) [ w ] = E N ( µ m , Σ m ) [ w ] = µ m ˆ Summarily: µ MAP = µ Bayes = µ m = ( λσ 2 I + Φ T Φ) − 1 Φ T y m = λ I + Φ T Φ Σ − 1 σ 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Predictive distribution for linear Regression w MAP helps avoid overfitting as it takes regularization into ˆ account But we miss the modeling of uncertainty when we consider only ˆ w MAP Eg: While predicting diagnostic results on a new patient x , along with the value y , we would also like to know the uncertainty of the prediction Pr( y | x , D ). Recall that y = w T φ ( x ) + ε and ε ∼ N (0 , σ 2 ) Pr( y | x , D ) = Pr( y | x , < x 1 , y 1 > ... < x m , y m > ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Pure Bayesian Regression Summarized By definition, regression is about finding ( y | x , < x 1 , y 1 > ... < x m , y m > ) By Bayes Rule Pr( y | x , D ) = Pr( y | x , < x 1 , y 1 > ... < x m , y m ∫ Pr( y | w ; x ) Pr( w | D ) d w = w ( m φ ( x ) , σ 2 + φ T ( x )Σ m φ ( x ) µ T ∼ N where y = w T φ ( x ) + ε and ε ∼ N (0 , σ 2 ) w ∼ N (0 , α I ) and w | D ∼ N ( µ m , Σ m ) µ m = ( λσ 2 I + Φ T Φ) − 1 Φ T y and Σ − 1 m = λ I + Φ T Φ /σ 2 Finally y ∼ N ( µ T m φ ( x ) , φ T ( x )Σ m φ ( x )) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Penalized Regularized Least Squares Regression The Bayes and MAP estimates for Linear Regression coincide with Regularized Ridge Regression || Φ w − y || 2 2 + λσ 2 || w || 2 w Ridge = arg min 2 w Intuition: To discourage redundancy and/or stop coefficients of w from becoming too large in magnitude, add a penalty to the error term used to estimate parameters of the model. The general Penalized Regularized L.S Problem : || Φ w − y || 2 w Reg = arg min 2 + λ Ω( w ) w Ω( w ) = || w || 2 2 ⇒ Ridge Regression Ω( w ) = || w || 1 ⇒ Lasso Ω( w ) = || w || 0 ⇒ Support-based penalty Some Ω( w ) correspond to priors that can be expressed in close form. Some give good working solutions. However, for mathematical convenience, some norms are easier to handle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Constrained Regularized Least Squares Regression Intuition: To discourage redundancy and/or stop coefficients of w from becoming too large in magnitude, constrain the error minimizing estimate using a penalty The general Constrained Regularized L.S. Problem : || Φ w − y || 2 w Reg = arg min 2 w such that Ω( w ) ≤ θ Claim: For any Penalized formulation with a particular λ , there exists a corresponding Constrained formulation with a corresponding θ Ω( w ) = || w || 2 2 ⇒ Ridge Regression Ω( w ) = || w || 1 ⇒ Lasso Ω( w ) = || w || 0 ⇒ Support-based penalty Proof of Equivalence: Requires tools of Optimization/duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Polynomial regression Consider a degree 3 polynomial regression model as shown in the figure Each bend in the curve corresponds to increase in ∥ w ∥ Eigen values of (Φ ⊤ Φ + λ I ) are indicative of curvature. Increasing λ reduces the curvature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Do Closed-form solutions Always Exist? Linear regression and Ridge regression both have closed-form solutions For linear regression, w ∗ = (Φ ⊤ Φ) − 1 Φ ⊤ y For ridge regression, w ∗ = (Φ ⊤ Φ + λ I ) − 1 Φ ⊤ y (for linear regression, λ = 0) What about optimizing the formulations (constrained/penalized) of Lasso ( L 1 norm)? And support-based penalty ( L 0 norm)?: Also requires tools of Optimization/duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Why is Lasso Interesting? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Support Vector Regression One more formulation before we look at Tools of Optimization/duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 7 - Linear Regression - Bayesian Inference and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 4 - Linear

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Overview of Linear

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 13 - KKT

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Pipeline MACH IN E LEARN IN G W ITH P YS PARK Andrew Collier Data Scientist, Exegetic Analytics

L EARNING FROM D ATA : D ETECTING C ONDITIONAL I NDEPENDENCIES AND S CORE +S EARCH M ETHODS Pedro

rt ts rs r

Coherent detection and reconstruction of burst events in S5 data S.Klimenko, University of

Notes on Penalized Estimation and GAMs Introduction Generalized additive models (GAMs) extend

Highly Efficient Gradient Computation for Highly Efficient Gradient Computation for Density-

A Practical Approach to Quantum Annealing GOTO CHICAGO 2020 AGENDA Practical Quantum Annealing

16. Review of convex optimization Convex sets and functions Convex programming models