Lecture 5: Linear regression (contd.) Regularization ML - PowerPoint PPT Presentation

Lecture 5: − Linear regression (cont’d.) − Regularization − ML Methodology − Learning theory Aykut Erdem October 2017 Hacettepe University

About class projects • This semester the theme is machine learning and the city. • To be done in groups of 3 people. • Deliverables: Proposal, blog posts, progress report, project presentations (classroom + video (new) presentations), final report and code • For more details please check the project webpage:   http://web.cs.hacettepe.edu.tr/~aykut/classes/fall2017/bbm406/project.html. 3

Recall from last time… Linear Regression y ( x ) = w 0 + w 1 x w = ( w 0 , w 1 ) Gradient Descent Update Rule: ⇣ ⌘ N t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) ` ( w ) = n =1 Closed Form Solution: � − 1 X T t X T X � w = 4

  Today • Linear regression (cont’d.) • Regularization • Machine Learning Methodology - validation - cross-validation (k-fold, leave-one-out) - model selection   • Learning theory   5

    Multi-dimensional Inputs • One method of extending the model is to consider other input dimensions   y ( x ) = w 0 + w 1 x 1 + w 2 x 2 • In the Boston housing example, we can look at the number of rooms slide by Sanja Fidler 6

Linear Regression with   Multi-dimensional Inputs • Imagine now we want to predict the median house price from these multi-dimensional observations • Each house is a data point n , with observations indexed by j : ⇣ ⌘ x ( n ) 1 , . . . , x ( n ) , . . . , x ( n ) x ( n ) = j d • We can incorporate the bias w 0 into w , by using x 0 = 1 , then d X w j x j = w T x y ( x ) = w 0 + j =1 • We can then solve for w = ( w 0 , w 1 ,…, w d ) . How? slide by Sanja Fidler • We can use gradient descent to solve for each coe ffi cient, or compute w analytically (how does the solution change?) 7

More Powerful Models? • What if our linear model is not good? How can we create a more complicated model? slide by Sanja Fidler 8

        Fitting a Polynomial • What if our linear model is not good? How can we create a more complicated model? • We can create a more complicated model by defining input variables that are combinations of components of x • Example: an M -th order polynomial function of one dimensional feature x:   M X w j x j y ( x, w ) = w 0 + j =1 where x j is the j -th power of x • We can use the same approach to optimize for the weights w slide by Sanja Fidler • How do we do that? 9

Some types of basis functions in 1-D Sigmoids Gaussians Polynomials � − ( x − µ j ) 2 � � x − µ j � φ j ( x ) = σ φ j ( x ) = exp slide by Erik Sudderth 2 s 2 s 1 σ ( a ) = 1 + exp( − a ) . 10

Two types of linear model that are equivalent with respect to learning bias T y ( x, w ) w w x w x ... w x = + + + = 0 1 1 2 2 T y ( x, w ) w w ( x ) w ( x ) ... w ( x ) = + φ + φ + = Φ 0 1 1 2 2 • The first model has the same number of adaptive coe ffi cients as the dimensionality of the data +1. • The second model has the same number of adaptive coe ffi cients as the number of basis functions +1. slide by Erik Sudderth • Once we have replaced the data by the outputs of the basis functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick) 11

        General linear regression problem • Using our new notations for the basis function linear regression can be written as   • notations for the basis • n � � y � w j � j ( x ) � � j � 0 � where can be either x j for multivariate regression Where � j (x) can • � • or one of the nonlinear basis we defined non linear bas • Once ¡again ¡we ¡can ¡use ¡‘least ¡squares’ ¡to ¡find ¡the ¡optimal ¡solution. ฀ � • Once ¡again ¡we ¡can ¡use ¡‘least ¡squares’ ¡to ¡find ¡the ¡optimal ¡solution. ฀ � • Once again we can use “least squares” to find the optimal solution. slide by E. P . Xing 12

LMS for the general linear regression problem regression problem n � y � w j � j ( x ) Our goal is to minimize the following loss function: j � 0 ( y i � � � 2 J (w) � w j � j ( x i ) ) w – vector of dimension k+1 � (x i ) – vector of dimension k+1 i j y i – a scaler Moving to vector notations we get: ฀ � ( y i � w T � ( x i )) 2 � J (w) � ฀ � i We take the derivative w.r.t w � ( y i � w T � ( x i )) 2 ( y i � w T � ( x i )) � � � 2 � ( x i ) T � w ฀ � i i ( y i � w T � ( x i )) � ( x i ) T � 0 � � 2 Equating to 0 we get slide by E. P i �� ( x i ) T � w T � � ฀ � � ( x i ) � ( x i ) T y i �� . Xing �� 13 i i ฀ �

LMS for the general linear regression problem ( y i � w T � ( x i )) 2 � J (w) � We take the derivative w.r.t w i � ( y i � w T � ( x i )) 2 ( y i � w T � ( x i )) � � � 2 � ( x i ) T � w i i ฀ � ( y i � w T � ( x i )) � ( x i ) T � 0 � � Equating to 0 we get 2 i �� ( x i ) T � w T � � ฀ � � ( x i ) � ( x i ) T y i �� i i �� 0 ( x 1 ) � 1 ( x 1 ) � m ( x 1 ) Define: �� 0 ( x 2 ) � 1 ( x 2 ) � m ( x 2 ) �� ฀ � �� 0 ( x n ) � 1 ( x n ) � m ( x n ) �� slide by E. P Then deriving w w � ( � T � ) � 1 � T y ฀ � . Xing we get: 14 ฀ �

LMS for the general linear regression problem ( y i � w T � ( x i )) 2 � J (w) � i w � ( � T � ) � 1 � T y Deriving w we get: ฀ � n entries vector k+1 entries vector ฀ � n by k+1 matrix This solution is slide by E. P also known as ‘psuedo ¡inverse’ . Xing 15

0 th order polynomial slide by Erik Sudderth 16

1 st order polynomial slide by Erik Sudderth 17

3 rd order polynomial slide by Erik Sudderth 18

9 th order polynomial slide by Erik Sudderth 19

Which Fit is Best? from Bishop slide by Sanja Fidler 20

Root Mean Square (RMS) Error N � 1 M = 0 1 M = 1 E ( w ) = 1 { y ( x n , w ) − t n } 2 t t 2 0 0 n =1 − 1 − 1 � E RMS = 2 E ( w ⋆ ) /N 0 1 0 1 x x The division by N allows us to M = 3 M = 9 1 1 t t compare di ff erent sizes of data 0 0 sets on an equal footing, and   the square root ensures that − 1 − 1 E RMS is measured on the same 0 1 0 1 x x scale (and in the same units) as the target variable t slide by Erik Sudderth 21

Root Mean Square (RMS) Error 1 Training inde- Test E RMS 0.5 0 0 3 6 9 M Root>Mean>Square'(RMS)'Error:' N slide by Erik Sudderth E ( w ) = 1 ( t n − φ ( x n ) T w ) 2 = 1 X 2 || t − Φ w || 2 2 n =1 22

Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) 1 Training inde- Test E RMS 0.5 slide by Sanja Fidler 0 0 3 6 9 M 23

Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) • Not a problem if we have lots of training examples slide by Sanja Fidler 24

Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) • Let’s look at the estimated weights for various M in the case of fewer examples slide by Sanja Fidler 25

Generalization • Generalization = model’s ability to predict the held out data • What is happening? • Our model with M = 9 overfits the data (it models also noise) • Let’s look at the estimated weights for various M in the case of fewer examples • The weights are becoming huge to compensate for the noise • One way of dealing with this is to encourage the weights to be small (this way no input dimension will have too much influence on prediction). This is called regularization . slide by Sanja Fidler 26

1-D regression illustrates key concepts • Data fits – is linear model best ( model selection )? − Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data   (fit not only the signal but also the noise in the data), especially if not enough data to constrain model • One method of assessing fit: − test generalization = model’s ability to predict   the held out data • Optimization is essential: stochastic and batch 1 Training inde- iterative approaches; analytic when available Test slide by Richard Zemel E RMS 0.5 0 0 3 6 9 M 27

Regularized Least Squares • A technique to control the overfitting phenomenon • Add a penalty term to the error function in order to discourage the coe ffi cients from reaching large values Ridge N � E ( w ) = 1 � { y ( x n , w ) − t n } 2 + λ � 2 ∥ w ∥ 2 � regression 2 n =1 where ∥ w ∥ 2 ≡ w T w = w 2 0 + w 2 1 + . . . + w 2 M , slide by Erik Sudderth importance of the regularization term compared which is minimized by ' 28

Lecture 5: Linear regression (contd.) Regularization ML - PowerPoint PPT Presentation

Lecture 5: Linear regression (contd.) Regularization ML Methodology Learning theory Aykut Erdem October 2017 Hacettepe University About class projects This semester the theme is machine learning and the city. To be done

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

An Anomaly Detection Mechanism for IEC 60870-5-104 Panagioti s Sari gianni dis Uni versi

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall

Porting Tizen-IVI 3.0 to an ARM based SoC Platform Damian Hobson-Garcia Automotive Linux Summit

OpenRadio A programmable wireless dataplane Manu Bansal Stanford University Joint work with

Recent Developments for the MX Beamline Control Toolkit William M. Lavender Illinois Institute

Parallel Programming with Spark Qin Liu The Chinese University of Hong Kong 1 Previously on

Hadronic particles made of many vector mesons Luis Roca (in collaboration with

About Nuclear Change II UNIT 7 DAY 2 What are we going to learn today? Types of Nuclear Changes

Lecture 5: Linear regression (contd.) Regularization ML - PowerPoint PPT Presentation

Lecture 5: Linear regression (contd.) Regularization ML Methodology Learning theory Aykut Erdem October 2017 Hacettepe University About class projects This semester the theme is machine learning and the city. To be done

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

An Anomaly Detection Mechanism for IEC 60870-5-104 Panagioti s Sari gianni dis Uni versi

Data Mining Linear &amp; nonlinear classifiers Hamid Beigy Sharif University of Technology Fall

Porting Tizen-IVI 3.0 to an ARM based SoC Platform Damian Hobson-Garcia Automotive Linux Summit

OpenRadio A programmable wireless dataplane Manu Bansal Stanford University Joint work with

Recent Developments for the MX Beamline Control Toolkit William M. Lavender Illinois Institute

Parallel Programming with Spark Qin Liu The Chinese University of Hong Kong 1 Previously on

Hadronic particles made of many vector mesons Luis Roca (in collaboration with

About Nuclear Change II UNIT 7 DAY 2 What are we going to learn today? Types of Nuclear Changes

Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall