Lecture 4: Linear Regression (contd.) Optimization Generalization - PowerPoint PPT Presentation

Lecture 4: − Linear Regression (cont’d.) − Optimization − Generalization − Model complexity − Regularization Aykut Erdem October 2017 Hacettepe University

Administrative • Assignment 1 is out! • It is due October 20 (i.e. in two weeks). • It includes − Pencil-and-paper derivations − Implementing kNN classifier − numpy/Python code 2

Classifying Bird Species Hooded Oriole (Icterus cucullatus) • Caltech-UCSD Birds 200 dataset (200 bird species) − 5033 train, 1000 test images • You may want to split the training set into train and validation (more on this next week) • Do not use test data for training or parameter tuning adapted from Sanja Fidler • Features: − Attributes, − Color histogram, − HOG features − Deep CNN features • Report performance on test data 3

𝑦 1 , 𝑧 1 , … , 𝑦 𝑜 , 𝑧 𝑜 • 𝑦 𝑗 ∈ 𝑌 – 𝑧 𝑗 ∈ ℜ – • 𝐿 ∶ 𝑌 × 𝑌 → ℜ – – • Recall from last time… Kernel Regression – x’ ′ 𝐿 𝑦 𝑗 , 𝑦 – Here, this is Here, this is the closest the closest Here, this is the closest Here, this is the closest y x 1-NN for Regression Weighted K-NN for Regression ! 1 /p n X | x i − y i | p D = w i = exp(-d(x i , query) 2 / σ 2 ) i =1 Distance metrics Kernel width 4

Recall from last time… Least-Squares Regression y ( x ) = function( x, w ) slide by Sanja Fidler 5

    Recall from last time… Least-Squares Regression • Define a model   Linear: y ( x ) = function( x, w ) • Standard loss/cost/objective function measures the squared error between y and the true value t   slide by Sanja Fidler • For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 6

    Recall from last time… Least-Squares Regression • Define a model   Linear: y ( x ) = w 0 + w 1 x • Standard loss/cost/objective function measures the squared error between y and the true value t   N i 2 h X t ( n ) − y ( x ( n ) ) ` ( w ) = slide by Sanja Fidler n =1 • For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 7

    Recall from last time… Least-Squares Regression • Define a model   Linear: y ( x ) = w 0 + w 1 x • Standard loss/cost/objective function measures the squared error between y and the true value t   N i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 • The loss for the red hypothesis is the sum of the squared vertical errors (squared lengths of green vertical lines) 8

    Recall from last time… Least-Squares Regression • Define a model   Linear: y ( x ) = w 0 + w 1 x • Standard loss/cost/objective function measures the squared error between y and the true value t   N i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 • How do we obtain weights ?   w = ( w 0 , w 1 ) 9

    Recall from last time… Least-Squares Regression • Define a model   Linear: y ( x ) = w 0 + w 1 x • Standard loss/cost/objective function measures the squared error between y and the true value t   N i 2 h X t ( n ) − ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 • How do we obtain weights ? Find w that minimizes   w = ( w 0 , w 1 ) loss ` ( w ) 10

Optimizing the Objective • One straightforward method: gradient descent − initialize w (e.g., randomly) − repeatedly update w based on the gradient   w ← w − � @` @ w • λ is the learning rate • For a single training case, this gives the LMS update rule: ⇣ ⌘ t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ { error slide by Sanja Fidler • Note: As error approaches zero, so does the update   ( w stops changing) 11

12 Optimizing the Objective slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

13 Optimizing the Objective slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

E ff ect of learning rate λ ` ( w ) ` ( w ) w 0 w 0 • Large λ => Fast convergence but larger residual error   Also possible oscillations slide by Erik Sudderth • Small λ => Slow convergence but small residual error 14

Optimizing Across Training Set • Two ways to generalize this for all examples in training set: 1. Batch updates: sum or average updates across every example n, then change the parameter values ⇣ ⌘ t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ 2. Stochastic/online updates: update the parameters for each training case in turn, according to its own gradients Algorithm 1 Stochastic gradient descent 1: Randomly shu ffl e examples in the training set 2: for i = 1 to N do Update: 3: slide by Sanja Fidler w ← w + 2 λ ( t ( i ) − y ( x ( i ) )) x ( i ) (update for a linear model) 4: end for 15

Optimizing Across Training Set • Two ways to generalize this for all examples in training set: 1. Batch updates: sum or average updates across every example n, then change the parameter values ⇣ ⌘ t ( n ) − y ( x ( n ) ) x ( n ) w ← w + 2 λ 2. Stochastic/online updates: update the parameters for each training case in turn, according to its own gradients • Underlying assumption: sample is independent and identically distributed (i.i.d.) slide by Sanja Fidler 16

Analytical Solution • For some objectives we can also find the optimal solution analytically • This is the case for linear least-squares regression • How? slide by Sanja Fidler 17

Vectorization • Consider our model: y ( x ) = w 0 + w 1 x • Let  w 0 � x T = [1 x ] w = w 1 • Can write the model in vectorized form as y ( x ) = w T x 18

  Vectorization • Consider our model with N instances:   R N × 1 t (1) , t (2) , . . . , t ( N ) i T h t = T 1 , x (1) 2 3 R N × 2 1 , x (2) 6 7 X = 6 7 . . . 4 5 1 , x ( N ) R 2 × 1  w 0 � w = w 1 • Then:   N w T x ( n ) − t ( n ) i 2 h X ` ( w ) = slide by Sanja Fidler n =1 = ( Xw − t ) T ( Xw − t ) { { R 1 × N R N × 1 19

      Analytical Solution • Instead of using GD, solve for optimal w analytically @ − Notice the solution is when @ w ` ( w ) = 0 • Derivation:   1x1 ` ( w ) = ( Xw − t ) T ( Xw − t ) = w T X T Xw − t T Xw − w T X T t + t T t = w T X T Xw − 2 w T X T t + t T t − Take derivative and set equal to 0, then solve for   If X T X is not inver-ble (i.e., ∂ w T X T Xw − 2 w T X T t + t T t � � = 0 singular), may need to: ∂ w • Use pseudo-inverse instead of X T X w − X T t = 0 � � the inverse − In Python, X T X w = X T t � � numpy.linalg.pinv(a) • Remove redundant (not linearly independent) features • Remove extra features to � − 1 X T t X T X Closed Form Solution: � w = ensure that d ≤ N 20

    Multi-dimensional Inputs • One method of extending the model is to consider other input dimensions   y ( x ) = w 0 + w 1 x 1 + w 2 x 2 • In the Boston housing example, we can look at the number of rooms slide by Sanja Fidler 22

Linear Regression with   Multi-dimensional Inputs • Imagine now we want to predict the median house price from these multi-dimensional observations • Each house is a data point n , with observations indexed by j : ⇣ ⌘ x ( n ) 1 , . . . , x ( n ) , . . . , x ( n ) x ( n ) = j d • We can incorporate the bias w 0 into w , by using x 0 = 1 , then d X w j x j = w T x y ( x ) = w 0 + j =1 • We can then solve for w = ( w 0 , w 1 ,…, w d ) . How? slide by Sanja Fidler • We can use gradient descent to solve for each coe ffi cient, or compute w analytically (how does the solution change?) � − 1 X T t X T X recall: � w = 23

More Powerful Models? • What if our linear model is not good? How can we create a more complicated model? slide by Sanja Fidler 24

        Fitting a Polynomial • What if our linear model is not good? How can we create a more complicated model? • We can create a more complicated model by defining input variables that are combinations of components of x • Example: an M -th order polynomial function of one dimensional feature x:   M X w j x j y ( x, w ) = w 0 + j =1 where x j is the j -th power of x • We can use the same approach to optimize for the weights w slide by Sanja Fidler • How do we do that? 25

Some types of basis functions in 1-D Sigmoids Gaussians Polynomials � − ( x − µ j ) 2 � � x − µ j � φ j ( x ) = σ φ j ( x ) = exp slide by Erik Sudderth 2 s 2 s 1 σ ( a ) = 1 + exp( − a ) . 26

Two types of linear model that are equivalent with respect to learning bias T y ( x, w ) w w x w x ... w x = + + + = 0 1 1 2 2 T y ( x, w ) w w ( x ) w ( x ) ... w ( x ) = + φ + φ + = Φ 0 1 1 2 2 • The first model has the same number of adaptive coe ffi cients as the dimensionality of the data +1. • The second model has the same number of adaptive coe ffi cients as the number of basis functions +1. slide by Erik Sudderth • Once we have replaced the data by the outputs of the basis functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick) 27

Lecture 4: Linear Regression (contd.) Optimization Generalization - PowerPoint PPT Presentation

Lecture 4: Linear Regression (contd.) Optimization Generalization Model complexity Regularization Aykut Erdem October 2017 Hacettepe University Administrative Assignment 1 is out! It is due October 20 (i.e. in two

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Study 128 (WAVES) EVG-COBI-TDF-FTC versus ATV + RTV + TDF-FTC (in Women) WAVES Study: Design

Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments Presenter: Tyler

Computational Semantics and Pragmatics Raquel Fernndez Institute for Logic, Language &

Towards a constructicon using patterns and frames Florent Perek Amanda Patten University of

The REDUCE MRSA Trial R andomized E valuation of D ecolonization vs. U niversal C learance to E

and Addresses Colin Strutt, Interisle Consulting Group Greg Aaron, Illumintel Presented at

Sepsis Webinar Series 2018 Presenter: Angela Craig, APN, MS, CCNS Tennessee Center for Patient

Mapping Locations for NHSN Surveillance: Preparing for 2013 Maggie Dudeck, MPH, CPH NHSN Training

Lecture 4: Linear Regression (contd.) Optimization Generalization - PowerPoint PPT Presentation

Lecture 4: Linear Regression (contd.) Optimization Generalization Model complexity Regularization Aykut Erdem October 2017 Hacettepe University Administrative Assignment 1 is out! It is due October 20 (i.e. in two

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Study 128 (WAVES) EVG-COBI-TDF-FTC versus ATV + RTV + TDF-FTC (in Women) WAVES Study: Design

Why Is That Relevant? Collecting Annotator Rationales for Relevance Judgments Presenter: Tyler

Computational Semantics and Pragmatics Raquel Fernndez Institute for Logic, Language &amp;

Towards a constructicon using patterns and frames Florent Perek Amanda Patten University of

The REDUCE MRSA Trial R andomized E valuation of D ecolonization vs. U niversal C learance to E

and Addresses Colin Strutt, Interisle Consulting Group Greg Aaron, Illumintel Presented at

Sepsis Webinar Series 2018 Presenter: Angela Craig, APN, MS, CCNS Tennessee Center for Patient

Mapping Locations for NHSN Surveillance: Preparing for 2013 Maggie Dudeck, MPH, CPH NHSN Training

Computational Semantics and Pragmatics Raquel Fernndez Institute for Logic, Language &