Linear regression . Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Linear regression . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1

Linear models • Linear combination of input features functions 2 y ( x , w ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D with x = ( x 1 , . . . , x D ) • Linear function of parameters w • Linear function of features x . Extension to linear combination of base M − 1 ∑ y ( x , w ) = w 0 + w j φ j ( x ) j =1 • Let φ 0 ( x ) = 1 , then y ( x , w ) = w T φ ( x )

Base functions • Sigmoid (local) • Hyperbolic tangent (local) • Many types: 3 • Polynomial (global functions) • Gaussian (local) φ j ( x ) = x j − ( x − µ j ) 2 ( ) φ j ( x ) = exp 2 s 2 ( x − µ j ) 1 φ j ( x ) = σ = x − µj s 1 + e − s x − µj φ j ( x ) = tanh( x ) = 2 σ ( x ) − 1 = 1 − e − s x − µj 1 + e − s

Maximum likelihood and least squares • Assume an additional gaussian noise with • Then, and the expectation of the conditional distribution is 4 t = y ( x , w ) + ε p ( ε ) = N ( ε | 0 , β − 1 ) β = 1 σ 2 is the precision. p ( t | x , w , β ) = N ( t | y ( x , w ) , β − 1 ) ∫ E [ t | x ] = tp ( t | x ) dt = y ( x , w )

Maximum likelihood and least squares • The corresponding log-likelihood is then where 5 • The likelihood of a given training set X , t is N N ( t i | w T φ ( x i ) , β − 1 ) ∏ p ( t | X , w , β ) = i =1 N ln N ( t i | w T φ ( x i ) , β − 1 ) = N 2 ln β − N ∑ ln p ( t | X , w , β ) = 2 ln(2 π ) − βE D ( w ) i =1 N E D ( w ) = 1 ) 2 = 1 ( t i − w T φ ( x i ) 2( Φw − y ) T ( Φw − y ) ∑ 2 i =1

Maximum likelihood and least squares • Result . . . ... . . . . . . squares normal equations for least 6 • Maximization performed by setting the gradient to 0 • Maximizing the log-likelihood w.r.t. w is equivalent to minimizing the error function E D ( w ) N ∂ ( t i − w T φ ( x i ) ) ∑ φ ( x i ) T ∂ w ln p ( t | X , w , β ) = 0 = i =1 ( N N ) t i φ ( x i ) T − w T ∑ ∑ φ ( x i ) φ ( x i ) T = i =1 i =1 w ML = ( Φ T Φ ) − 1 Φ T t  · · ·  φ 0 ( x 1 ) φ 1 ( x 1 ) φ M − 1 ( x 1 ) φ 0 ( x 2 ) φ 1 ( x 2 ) · · · φ M − 1 ( x 2 )     Φ =       N ML = 1 ) 2 φ 0 ( x N ) φ 1 ( x N ) · · · φ M − 1 ( x N ) ( β − 1 ∑ t i − w T ML φ ( x i ) N i =1

Least squares geometry 7 • t = ( t 1 , . . . , t N ) T is a vector in I R N • Each basis function φ j applied to x 1 , . . . , x N is a vector ϕ j = ( φ j ( x 1 ) , . . . , φ j ( x N )) T ∈ I R N • If M < N , vectors ϕ 0 , . . . , ϕ M − 1 define a subspace S of dimension (at most) M • y = ( y ( x 1 , w ) , . . . , y ( x N , w )) T is a vector in I R N : it can be represented as linear combination y = ∑ M − 1 i =0 w i φ ( x i ) . Hence, it belongs to S R N , y ∈ I R N is the vector in subspace S at minimal squared • Given t ∈ I distance from t R N and vectors φ 0 , . . . , φ M − 1 , w ML is such that y is the • Given t ∈ I vector on S nearest to t

Gradient descent error value is updated as follows: 8 gradient descent methods • The minimum of E D ( w ) may be computed numerically, by means of • Initial assignment w (0) = ( w (0) 0 , w (0) 1 , . . . , w (0) D ) , with a corresponding N E D ( w (0) ) = 1 ) 2 ( t i − ( w (0) ) T φ ( x i ) ∑ 2 i =1 • Iteratively, the current value w ( i − 1) is modified in the direction of steepest descent of di E D ( w ) • At step i , w ( i − 1) j � − η ∂E D ( w ) � w ( i ) := w ( i − 1) � j j ∂w j � � w ( i − 1)

Gradient descent • In matrix notation: 9 w ( i ) := w ( i − 1) − η ∂E D ( w ) � � ∂ w � w ( i − 1) • By definition of E D ( w ) : w ( i ) := w ( i − 1) − η ( t i − w ( i − 1) φ ( x i )) φ ( x i )

Regularized least squares two terms. with solution • Regularization term in the cost function • Simple form 10 dependent from the parameters alone. E D ( w ) + λE W ( w ) E D ( w ) dependent from the dataset (and the parameters), E W ( w ) • The regularization coefficient controls the relative importance of the M − 1 E W ( w ) = 1 2 w T w = 1 ∑ w 2 i 2 i =0 • Sum-of squares cost function: weight decay N E ( w ) = 1 { t i − w T φ ( x i ) } 2 + λ 2 w T w = 1 2( Φw − y ) T ( Φw − y )+ λ 2 w T w ∑ 2 i =1 w = ( λ I + Φ T Φ ) − 1 Φ T t

Regularization • A more general form level curves of the cost function) 11 N M − 1 E ( w ) = 1 { t i − w T φ ( x i ) } 2 + λ ∑ ∑ | w j | q 2 2 i =1 j =0 • The case q = 1 is denoted as lasso: sparse models are favored (in blue,

Bias vs variance: an example regularized cost function 12 • Consider the case of function y = sin 2 πx and assume L = 100 training sets T 1 , . . . , T L are available, each of size n = 25 . • Given M = 24 gaussian basis functions φ 1 ( x ) , . . . , φ M ( x ) , from each training set T i a prediction function y i ( x ) is derived by minimizing the E D ( w ) = 1 2( Φw − t ) T ( Φw − t ) + λ 2 w T w

An example variance), but their expecation is a bad approximation of the unknown function (large bias). 13 1 1 ln λ = 2 . 6 t t 0 0 −1 −1 0 1 0 1 x x Left, a possible plot of prediction functions y i ( x ) ( i = 1 , . . . , 100 ), as derived, respectively, by training sets T i , i = 1 , . . . , 100 setting ln λ = 2 . 6 . Right, their expectation, with the unknown function y = sin 2 πx . The prediction functions y i ( x ) do not differ much between them (small

An example 14 1 1 ln λ = − 0 . 31 t t 0 0 −1 −1 0 1 0 1 x x Plot of the prediction functions obtained with ln λ = − 0 . 31 .

An example other), while bias decreases (their expectation is a better approximation of 15 1 1 ln λ = − 2 . 4 t t 0 0 −1 −1 0 1 0 1 x x Plot of the prediction functions obtained with ln λ = − 2 . 4 . As λ decreases, the variance increases (prediction functions y i ( x ) are more different each y = sin 2 πx ).

An example • Plot of (bias) bias increases and varinace decreases. Their sum has a minimum in 16 2 , variance and their sum as unctions of λ : las λ increases, correspondance to the optimal value of λ . • The term E x [ σ 2 y | x ] shows an inherent limit to the approximability of y = sin 2 πx .

Bayesian approach to regression • Applying maximum likelihood to determine the values of model • In order control model complexity, a bayesian approach assumes a prior distribution of parameter values. 17 parameters is prone to overfitting: need of a regularization term E ( w ) .

Prior distribution Conjugate of gaussian is gaussian: choosing a gaussian prior distribution of where results into a gaussian posterior distribution Posterior proportional to prior times likelihood: likelihood is gaussian 18 (gaussian noise). n N ( t i | w T φ ( x i ) , β − 1 ) ∏ p ( t | Φ , w , β ) = i =1 w p ( w ) = N ( w | m 0 , S 0 ) p ( w | t , Φ ) = N ( w | m N , S N ) ∝ p ( t , Φ | w ) p ( w ) 0 m 0 + β Φ T t ) m N = S N ( S − 1 + β Φ T Φ S − 1 N = S − 1 0

Prior distribution null covariance. proportional to the variance. 19 A common approach: zero-mean isotropic gaussian prior distribution of w ( α M − 1 ) 1 / 2 2 w 2 e − α ∏ p ( w | α ) = i 2 π i =0 • Parameters in w are assumed independent e identically distributed, according to a gaussian with mean 0 , uniform variance σ 2 = α − 1 and • Prior distribution defined with a hyper-parameter α , inversely

In this case It is possible to show that, assuming the posterior distribution is itself a gaussian with case, 21 p ( w ) = N ( w | 0 , α − 1 I ) p ( t | w , Φ ) = N ( t | w T Φ , β − 1 I ) p ( w | t , Φ , α, σ ) = N ( w | m N , S N ) S N = ( α I + β Φ T Φ ) − 1 m N = β S N Φ T t Note that if α → 0 the prior tends to have infinite variance, and we have minimum information on w before the training set is considered. In this m N → ( Φ T β IΦ ) − 1 ( Φ T β It ) = ( Φ T Φ ) − 1 ( Φ T t ) that is w ML , the ML estimation of w .

Maximum a Posteriori • This is equivalent to maximizing its logarithm that is, 22 • Given the posterior distribution p ( w | Φ , t , α, β ) , we may derive the value of w MAP which makes it maximum (the mode of the distribution) log p ( w | Φ , t , α, β ) = log p ( t | w , Φ , β ) + log p ( w | α ) − log p ( t | Φ , β ) and, since p ( t | Φ , β ) is a constant wrt w w MAP = argmax log p ( w | Φ , t , α, β ) = argmax (log p ( t | w , Φ , β ) + log p ( w | α )) w w ( − log p ( t | Φ , w , β ) − log p ( w | α )) w MAP = argmin w

Derivation of MAP By considering the assumptions on prior and likelihood, this is equivalent to considering a cost function 23 ( n M − 1 ) β ( t i − w T φ ( x i )) 2 + α ∑ ∑ w 2 w MAP = argmin i + constants 2 2 w i =1 i =0 ( n M − 1 ) ( t i − w T φ ( x i )) 2 + α ∑ ∑ w 2 = argmin i β w i =1 i =0 n ( y i − w T φ ( x i )) + α β w T w ∑ E MAP ( w ) = i =1 that is to a regularized min square function with λ = α β

Linear regression . Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Linear regression . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1 Linear models Linear combination of input features functions 2 y ( x , w ) = w 0 + w

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Characteristics of Mobile Web Content Felix Nwaobasi Text Block #3 Contact, Department, or

Research Methods - Tools Overview References Motivation Outline Chia-Hui Huang Manage Your

HyperText Markup Language and Cascading Style Sheets Web and Apps 1) HTML - CSS Introduction

Web and Apps 1) HTML - CSS Emmanuel Benoist Spring Term 2020 Berner Fachhochschule | Haute

Linear Regression Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid

CAN WE TRUST THE GOSPELS? DR PETER J. WILLIAMS @DRPJWILLIAMS MAIN SOURCES ABOUT EMPEROR TIBERIUS

Konstantin Tretjakov (kt@ut.ee) 22. november 2005 RVM & Eponine p.1/ ??

Information Rates for Phase Noise Channels Luca Barletta Politecnico di Milano

Linear regression . Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Linear regression . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1 Linear models Linear combination of input features functions 2 y ( x , w ) = w 0 + w

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Characteristics of Mobile Web Content Felix Nwaobasi Text Block #3 Contact, Department, or

Research Methods - Tools Overview References Motivation Outline Chia-Hui Huang Manage Your

HyperText Markup Language and Cascading Style Sheets Web and Apps 1) HTML - CSS Introduction

Web and Apps 1) HTML - CSS Emmanuel Benoist Spring Term 2020 Berner Fachhochschule | Haute

Linear Regression Machine Learning Hamid Beigy Sharif University of Technology Fall 1393 Hamid

CAN WE TRUST THE GOSPELS? DR PETER J. WILLIAMS @DRPJWILLIAMS MAIN SOURCES ABOUT EMPEROR TIBERIUS

Konstantin Tretjakov (kt@ut.ee) 22. november 2005 RVM &amp; Eponine p.1/ ??

Information Rates for Phase Noise Channels Luca Barletta Politecnico di Milano

Konstantin Tretjakov (kt@ut.ee) 22. november 2005 RVM & Eponine p.1/ ??