bbm406
play

BBM406 Fundamentals of Machine Learning Lecture 4: Linear - PowerPoint PPT Presentation

illustration: detail from xkcd strip #2048 BBM406 Fundamentals of Machine Learning Lecture 4: Linear Regression, Optimization, Generalization, Model complexity, Regularization Aykut Erdem // Hacettepe University // Fall 2019 1 ,


  1. illustration: detail from xkcd strip #2048 BBM406 Fundamentals of 
 Machine Learning Lecture 4: Linear Regression, Optimization, Generalization, Model complexity, Regularization Aykut Erdem // Hacettepe University // Fall 2019

  2. 𝑦 1 , 𝑧 1 , … , 𝑦 π‘œ , 𝑧 π‘œ β€’ 𝑦 𝑗 ∈ π‘Œ – 𝑧 𝑗 ∈ β„œ – β€’ 𝐿 ∢ π‘Œ Γ— π‘Œ β†’ β„œ – – β€’ Recall from last time… Kernel Regression – x’ β€² 𝐿 𝑦 𝑗 , 𝑦 – Here, this is Here, this is the closest the closest Here, this is the closest Here, this is the closest y x 1-NN for Regression Weighted K-NN for Regression ! 1 /p n X | x i βˆ’ y i | p D = w i = exp(-d(x i , query) 2 / Οƒ 2 ) i =1 Distance metrics Kernel width 2

  3. Linear Regression 3

  4. 
 
 Simple 1-D Regression β€’ Circles are data points (i.e., training examples) that are given to us β€’ The data points are uniform in x , but may be displaced in y 
 t ( x ) = f ( x ) + Ξ΅ 
 with Ξ΅ some noise slide by Sanja Fidler β€’ In green is the β€œtrue” curve that we don’t know 
 β€’ Goal: We want to fit a curve to these points 4

  5. Simple 1-D Regression β€’ Key Questions: βˆ’ How do we parametrize the model (the curve)? βˆ’ What loss (objective) function should we use to judge fit? βˆ’ How do we optimize fit to unseen test data slide by Sanja Fidler ( generalization )? 
 5

  6. Example: Boston House Prizes β€’ Estimate median house price in a neighborhood based on neighborhood statistics β€’ Look at first (of 13) attributes: per capita crime rate slide by Sanja Fidler β€’ Use this to predict house prices in other neighborhoods Is this a good input (attribute) to predict house prices? β€’ https://archive.ics.uci.edu/ml/datasets/Housing 6

  7. 
 Represent the data β€’ Data described as pairs D = {( x (1) , t (1) ), ( x (2) , t (2) ),..., ( x ( N ) , t ( N ) )} βˆ’ x is the input feature (per capita crime rate) βˆ’ t is the target output (median house price) βˆ’ ( i ) simply indicates the training examples (we have N in this case) β€’ Here t is continuous, so this is a regression problem β€’ Model outputs y, an estimate of t 
 y ( x ) = w 0 + w 1 x β€’ What type of model did we choose? β€’ Divide the dataset into training and testing examples slide by Sanja Fidler βˆ’ Use the training examples to construct hypothesis, or function approximator, that maps x to predicted y βˆ’ Evaluate hypothesis on test set 7

  8. Noise β€’ A simple model typically does not exactly fit the data β€” lack of fit can be considered noise 
 β€’ Sources of noise: βˆ’ Imprecision in data attributes (input noise, e.g. noise in per-capita crime) βˆ’ Errors in data targets (mislabeling, e.g. noise in house prices) βˆ’ Additional attributes not taken into account by data attributes, a ff ect target values (latent variables). In the example, what else could a ff ect house prices? βˆ’ Model may be too simple to account for data targets slide by Sanja Fidler 8

  9. Least-Squares Regression y ( x ) = function( x, w ) slide by Sanja Fidler 9

  10. 
 
 Least-Squares Regression β€’ Define a model 
 Linear: y ( x ) = function( x, w ) β€’ Standard loss/cost/objective function measures the squared error between y and the true value t 
 slide by Sanja Fidler β€’ For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 10

  11. 
 
 Least-Squares Regression β€’ Define a model 
 Linear: y ( x ) = w 0 + w 1 x β€’ Standard loss/cost/objective function measures the squared error between y and the true value t 
 slide by Sanja Fidler β€’ For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 11

  12. 
 
 Least-Squares Regression β€’ Define a model 
 Linear: y ( x ) = w 0 + w 1 x β€’ Standard loss/cost/objective function measures the squared error between y and the true value t 
 N i 2 h X t ( n ) βˆ’ y ( x ( n ) ) ` ( w ) = slide by Sanja Fidler n =1 β€’ For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 12

  13. 
 
 Least-Squares Regression β€’ Define a model 
 Linear: y ( x ) = w 0 + w 1 x β€’ Standard loss/cost/objective function measures the squared error between y and the true value t 
 N i 2 h X t ( n ) βˆ’ ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 β€’ For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 13

  14. 
 
 Least-Squares Regression β€’ Define a model 
 Linear: y ( x ) = w 0 + w 1 x β€’ Standard loss/cost/objective function measures the squared error between y and the true value t 
 N i 2 h X t ( n ) βˆ’ ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 β€’ For a particular hypothesis ( y ( x ) defined by a choice of w , drawn in red), what does the loss represent geometrically? 14

  15. 
 
 Least-Squares Regression β€’ Define a model 
 Linear: y ( x ) = w 0 + w 1 x β€’ Standard loss/cost/objective function measures the squared error between y and the true value t 
 N i 2 h X t ( n ) βˆ’ ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 β€’ The loss for the red hypothesis is the sum of the squared vertical errors (squared lengths of green vertical lines) 15

  16. 
 
 Least-Squares Regression β€’ Define a model 
 Linear: y ( x ) = w 0 + w 1 x β€’ Standard loss/cost/objective function measures the squared error between y and the true value t 
 N i 2 h X t ( n ) βˆ’ ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 β€’ How do we obtain weights ? 
 w = ( w 0 , w 1 ) 16

  17. 
 
 Least-Squares Regression β€’ Define a model 
 Linear: y ( x ) = w 0 + w 1 x β€’ Standard loss/cost/objective function measures the squared error between y and the true value t 
 N i 2 h X t ( n ) βˆ’ ( w 0 + w 1 x ( n ) ) Linear model: ` ( w ) = slide by Sanja Fidler n =1 β€’ How do we obtain weights ? Find w that minimizes 
 w = ( w 0 , w 1 ) loss ` ( w ) 17

  18. Optimizing the Objective β€’ One straightforward method: gradient descent βˆ’ initialize w (e.g., randomly) βˆ’ repeatedly update w based on the gradient 
 w ← w βˆ’ οΏ½ @` @ w β€’ Ξ» is the learning rate β€’ For a single training case, this gives the LMS update rule: ⇣ ⌘ t ( n ) βˆ’ y ( x ( n ) ) x ( n ) w ← w + 2 Ξ» { error slide by Sanja Fidler β€’ Note: As error approaches zero, so does the update 
 ( w stops changing) 18

  19. 19 Optimizing the Objective slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  20. 20 Optimizing the Objective slide by Fei-Fei Li & Andrej Karpathy & Justin Johnson

  21. E ff ect of learning rate Ξ» ` ( w ) ` ( w ) w 0 w 0 β€’ Large Ξ» => Fast convergence but larger residual error 
 Also possible oscillations slide by Erik Sudderth β€’ Small Ξ» => Slow convergence but small residual error 21

  22. Optimizing Across Training Set β€’ Two ways to generalize this for all examples in training set: 1. Batch updates: sum or average updates across every example n, then change the parameter values ⇣ ⌘ t ( n ) βˆ’ y ( x ( n ) ) x ( n ) w ← w + 2 Ξ» 2. Stochastic/online updates: update the parameters for each training case in turn, according to its own gradients Algorithm 1 Stochastic gradient descent 1: Randomly shu ffl e examples in the training set 2: for i = 1 to N do Update: 3: slide by Sanja Fidler w ← w + 2 Ξ» ( t ( i ) βˆ’ y ( x ( i ) )) x ( i ) (update for a linear model) 4: end for 22

  23. Optimizing Across Training Set β€’ Two ways to generalize this for all examples in training set: 1. Batch updates: sum or average updates across every example n, then change the parameter values ⇣ ⌘ t ( n ) βˆ’ y ( x ( n ) ) x ( n ) w ← w + 2 Ξ» 2. Stochastic/online updates: update the parameters for each training case in turn, according to its own gradients β€’ Underlying assumption: sample is independent and identically distributed (i.i.d.) slide by Sanja Fidler 23

  24. Analytical Solution β€’ For some objectives we can also find the optimal solution analytically β€’ This is the case for linear least-squares regression β€’ How? slide by Sanja Fidler 24

  25. Vectorization β€’ Consider our model: y ( x ) = w 0 + w 1 x β€’ Let ο£Ώ w 0 οΏ½ x T = [1 x ] w = w 1 β€’ Can write the model in vectorized form as y ( x ) = w T x 25

  26. 
 Vectorization β€’ Consider our model with N instances: 
 R N Γ— 1 t (1) , t (2) , . . . , t ( N ) i T h t = T 1 , x (1) 2 3 R N Γ— 2 1 , x (2) 6 7 X = 6 7 . . . 4 5 1 , x ( N ) R 2 Γ— 1 ο£Ώ w 0 οΏ½ w = w 1 β€’ Then: 
 N w T x ( n ) βˆ’ t ( n ) i 2 h X ` ( w ) = slide by Sanja Fidler n =1 = ( Xw βˆ’ t ) T ( Xw βˆ’ t ) { { R 1 Γ— N R N Γ— 1 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend