linear regression
play

Linear regression . Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Linear regression . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1 Linear models Linear combination of input features functions 2 y ( x , w ) = w 0 + w


  1. Linear regression . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1

  2. Linear models • Linear combination of input features functions 2 y ( x , w ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D with x = ( x 1 , . . . , x D ) • Linear function of parameters w • Linear function of features x . Extension to linear combination of base M − 1 ∑ y ( x , w ) = w 0 + w j φ j ( x ) j =1 • Let φ 0 ( x ) = 1 , then y ( x , w ) = w T φ ( x )

  3. Base functions • Sigmoid (local) • Hyperbolic tangent (local) • Many types: 3 • Polynomial (global functions) • Gaussian (local) φ j ( x ) = x j − ( x − µ j ) 2 ( ) φ j ( x ) = exp 2 s 2 ( x − µ j ) 1 φ j ( x ) = σ = x − µj s 1 + e − s x − µj φ j ( x ) = tanh( x ) = 2 σ ( x ) − 1 = 1 − e − s x − µj 1 + e − s

  4. Maximum likelihood and least squares • Assume an additional gaussian noise with • Then, and the expectation of the conditional distribution is 4 t = y ( x , w ) + ε p ( ε ) = N ( ε | 0 , β − 1 ) β = 1 σ 2 is the precision. p ( t | x , w , β ) = N ( t | y ( x , w ) , β − 1 ) ∫ E [ t | x ] = tp ( t | x ) dt = y ( x , w )

  5. Maximum likelihood and least squares • The corresponding log-likelihood is then where 5 • The likelihood of a given training set X , t is N N ( t i | w T φ ( x i ) , β − 1 ) ∏ p ( t | X , w , β ) = i =1 N ln N ( t i | w T φ ( x i ) , β − 1 ) = N 2 ln β − N ∑ ln p ( t | X , w , β ) = 2 ln(2 π ) − βE D ( w ) i =1 N E D ( w ) = 1 ) 2 = 1 ( t i − w T φ ( x i ) 2( Φw − y ) T ( Φw − y ) ∑ 2 i =1

  6. Maximum likelihood and least squares • Result . . . ... . . . . . . squares normal equations for least 6 • Maximization performed by setting the gradient to 0 • Maximizing the log-likelihood w.r.t. w is equivalent to minimizing the error function E D ( w ) N ∂ ( t i − w T φ ( x i ) ) ∑ φ ( x i ) T ∂ w ln p ( t | X , w , β ) = 0 = i =1 ( N N ) t i φ ( x i ) T − w T ∑ ∑ φ ( x i ) φ ( x i ) T = i =1 i =1 w ML = ( Φ T Φ ) − 1 Φ T t  · · ·  φ 0 ( x 1 ) φ 1 ( x 1 ) φ M − 1 ( x 1 ) φ 0 ( x 2 ) φ 1 ( x 2 ) · · · φ M − 1 ( x 2 )     Φ =       N ML = 1 ) 2 φ 0 ( x N ) φ 1 ( x N ) · · · φ M − 1 ( x N ) ( β − 1 ∑ t i − w T ML φ ( x i ) N i =1

  7. Least squares geometry 7 • t = ( t 1 , . . . , t N ) T is a vector in I R N • Each basis function φ j applied to x 1 , . . . , x N is a vector ϕ j = ( φ j ( x 1 ) , . . . , φ j ( x N )) T ∈ I R N • If M < N , vectors ϕ 0 , . . . , ϕ M − 1 define a subspace S of dimension (at most) M • y = ( y ( x 1 , w ) , . . . , y ( x N , w )) T is a vector in I R N : it can be represented as linear combination y = ∑ M − 1 i =0 w i φ ( x i ) . Hence, it belongs to S R N , y ∈ I R N is the vector in subspace S at minimal squared • Given t ∈ I distance from t R N and vectors φ 0 , . . . , φ M − 1 , w ML is such that y is the • Given t ∈ I vector on S nearest to t

  8. Gradient descent error value is updated as follows: 8 gradient descent methods • The minimum of E D ( w ) may be computed numerically, by means of • Initial assignment w (0) = ( w (0) 0 , w (0) 1 , . . . , w (0) D ) , with a corresponding N E D ( w (0) ) = 1 ) 2 ( t i − ( w (0) ) T φ ( x i ) ∑ 2 i =1 • Iteratively, the current value w ( i − 1) is modified in the direction of steepest descent of di E D ( w ) • At step i , w ( i − 1) j � − η ∂E D ( w ) � w ( i ) := w ( i − 1) � j j ∂w j � � w ( i − 1)

  9. Gradient descent • In matrix notation: 9 w ( i ) := w ( i − 1) − η ∂E D ( w ) � � ∂ w � w ( i − 1) • By definition of E D ( w ) : w ( i ) := w ( i − 1) − η ( t i − w ( i − 1) φ ( x i )) φ ( x i )

  10. Regularized least squares two terms. with solution • Regularization term in the cost function • Simple form 10 dependent from the parameters alone. E D ( w ) + λE W ( w ) E D ( w ) dependent from the dataset (and the parameters), E W ( w ) • The regularization coefficient controls the relative importance of the M − 1 E W ( w ) = 1 2 w T w = 1 ∑ w 2 i 2 i =0 • Sum-of squares cost function: weight decay N E ( w ) = 1 { t i − w T φ ( x i ) } 2 + λ 2 w T w = 1 2( Φw − y ) T ( Φw − y )+ λ 2 w T w ∑ 2 i =1 w = ( λ I + Φ T Φ ) − 1 Φ T t

  11. Regularization • A more general form level curves of the cost function) 11 N M − 1 E ( w ) = 1 { t i − w T φ ( x i ) } 2 + λ ∑ ∑ | w j | q 2 2 i =1 j =0 • The case q = 1 is denoted as lasso: sparse models are favored (in blue,

  12. Bias vs variance: an example regularized cost function 12 • Consider the case of function y = sin 2 πx and assume L = 100 training sets T 1 , . . . , T L are available, each of size n = 25 . • Given M = 24 gaussian basis functions φ 1 ( x ) , . . . , φ M ( x ) , from each training set T i a prediction function y i ( x ) is derived by minimizing the E D ( w ) = 1 2( Φw − t ) T ( Φw − t ) + λ 2 w T w

  13. An example variance), but their expecation is a bad approximation of the unknown function (large bias). 13 1 1 ln λ = 2 . 6 t t 0 0 −1 −1 0 1 0 1 x x Left, a possible plot of prediction functions y i ( x ) ( i = 1 , . . . , 100 ), as derived, respectively, by training sets T i , i = 1 , . . . , 100 setting ln λ = 2 . 6 . Right, their expectation, with the unknown function y = sin 2 πx . The prediction functions y i ( x ) do not differ much between them (small

  14. An example 14 1 1 ln λ = − 0 . 31 t t 0 0 −1 −1 0 1 0 1 x x Plot of the prediction functions obtained with ln λ = − 0 . 31 .

  15. An example other), while bias decreases (their expectation is a better approximation of 15 1 1 ln λ = − 2 . 4 t t 0 0 −1 −1 0 1 0 1 x x Plot of the prediction functions obtained with ln λ = − 2 . 4 . As λ decreases, the variance increases (prediction functions y i ( x ) are more different each y = sin 2 πx ).

  16. An example • Plot of (bias) bias increases and varinace decreases. Their sum has a minimum in 16 2 , variance and their sum as unctions of λ : las λ increases, correspondance to the optimal value of λ . • The term E x [ σ 2 y | x ] shows an inherent limit to the approximability of y = sin 2 πx .

  17. Bayesian approach to regression • Applying maximum likelihood to determine the values of model • In order control model complexity, a bayesian approach assumes a prior distribution of parameter values. 17 parameters is prone to overfitting: need of a regularization term E ( w ) .

  18. Prior distribution Conjugate of gaussian is gaussian: choosing a gaussian prior distribution of where results into a gaussian posterior distribution Posterior proportional to prior times likelihood: likelihood is gaussian 18 (gaussian noise). n N ( t i | w T φ ( x i ) , β − 1 ) ∏ p ( t | Φ , w , β ) = i =1 w p ( w ) = N ( w | m 0 , S 0 ) p ( w | t , Φ ) = N ( w | m N , S N ) ∝ p ( t , Φ | w ) p ( w ) 0 m 0 + β Φ T t ) m N = S N ( S − 1 + β Φ T Φ S − 1 N = S − 1 0

  19. Prior distribution null covariance. proportional to the variance. 19 A common approach: zero-mean isotropic gaussian prior distribution of w ( α M − 1 ) 1 / 2 2 w 2 e − α ∏ p ( w | α ) = i 2 π i =0 • Parameters in w are assumed independent e identically distributed, according to a gaussian with mean 0 , uniform variance σ 2 = α − 1 and • Prior distribution defined with a hyper-parameter α , inversely

  20. Posterior distribution Given the likelihood 20 n e − β 2 ( t i − w T φ ( x i )) 2 ∏ p ( t | Φ , w , β ) = i =1 the posterior distribution for w derives from Bayes' rule p ( w | t , Φ , α, σ ) = p ( t | Φ , w , σ ) p ( w | α ) ∝ p ( t | Φ , w , σ ) p ( w | α ) p ( t | Φ , α, σ )

  21. In this case It is possible to show that, assuming the posterior distribution is itself a gaussian with case, 21 p ( w ) = N ( w | 0 , α − 1 I ) p ( t | w , Φ ) = N ( t | w T Φ , β − 1 I ) p ( w | t , Φ , α, σ ) = N ( w | m N , S N ) S N = ( α I + β Φ T Φ ) − 1 m N = β S N Φ T t Note that if α → 0 the prior tends to have infinite variance, and we have minimum information on w before the training set is considered. In this m N → ( Φ T β IΦ ) − 1 ( Φ T β It ) = ( Φ T Φ ) − 1 ( Φ T t ) that is w ML , the ML estimation of w .

  22. Maximum a Posteriori • This is equivalent to maximizing its logarithm that is, 22 • Given the posterior distribution p ( w | Φ , t , α, β ) , we may derive the value of w MAP which makes it maximum (the mode of the distribution) log p ( w | Φ , t , α, β ) = log p ( t | w , Φ , β ) + log p ( w | α ) − log p ( t | Φ , β ) and, since p ( t | Φ , β ) is a constant wrt w w MAP = argmax log p ( w | Φ , t , α, β ) = argmax (log p ( t | w , Φ , β ) + log p ( w | α )) w w ( − log p ( t | Φ , w , β ) − log p ( w | α )) w MAP = argmin w

  23. Derivation of MAP By considering the assumptions on prior and likelihood, this is equivalent to considering a cost function 23 ( n M − 1 ) β ( t i − w T φ ( x i )) 2 + α ∑ ∑ w 2 w MAP = argmin i + constants 2 2 w i =1 i =0 ( n M − 1 ) ( t i − w T φ ( x i )) 2 + α ∑ ∑ w 2 = argmin i β w i =1 i =0 n ( y i − w T φ ( x i )) + α β w T w ∑ E MAP ( w ) = i =1 that is to a regularized min square function with λ = α β

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend