CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun - - PowerPoint PPT Presentation

csc 411 lecture 02 linear regression
SMART_READER_LITE
LIVE PREVIEW

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun - - PowerPoint PPT Presentation

CSC 411: Lecture 02: Linear Regression Class based on Raquel Urtasun & Rich Zemels lectures Sanja Fidler University of Toronto Jan 13, 2016 (Most plots in this lecture are from Bishops book) Urtasun, Zemel, Fidler (UofT) CSC 411:


slide-1
SLIDE 1

CSC 411: Lecture 02: Linear Regression

Class based on Raquel Urtasun & Rich Zemel’s lectures Sanja Fidler

University of Toronto

Jan 13, 2016

(Most plots in this lecture are from Bishop’s book)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 1 / 22

slide-2
SLIDE 2

Problems for Today

What should I watch this Friday?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 2 / 22

slide-3
SLIDE 3

Problems for Today

What should I watch this Friday?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 2 / 22

slide-4
SLIDE 4

Problems for Today

Goal: Predict movie rating automatically!

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 2 / 22

slide-5
SLIDE 5

Problems for Today

Goal: How many followers will I get?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 2 / 22

slide-6
SLIDE 6

Problems for Today

Goal: Predict the price of the house

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 2 / 22

slide-7
SLIDE 7

Regression

What do all these problems have in common?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 3 / 22

slide-8
SLIDE 8

Regression

What do all these problems have in common?

◮ Continuous outputs, we’ll call these t

(eg, a rating: a real number between 0-10, # of followers, house price)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 3 / 22

slide-9
SLIDE 9

Regression

What do all these problems have in common?

◮ Continuous outputs, we’ll call these t

(eg, a rating: a real number between 0-10, # of followers, house price) What do I need in order to predict these outputs? Predicting continuous outputs is called regression

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 3 / 22

slide-10
SLIDE 10

Regression

What do all these problems have in common?

◮ Continuous outputs, we’ll call these t

(eg, a rating: a real number between 0-10, # of followers, house price) What do I need in order to predict these outputs? Predicting continuous outputs is called regression

◮ Features (inputs), we’ll call these x (or x if vectors) Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 3 / 22

slide-11
SLIDE 11

Regression

What do all these problems have in common?

◮ Continuous outputs, we’ll call these t

(eg, a rating: a real number between 0-10, # of followers, house price) What do I need in order to predict these outputs? Predicting continuous outputs is called regression

◮ Features (inputs), we’ll call these x (or x if vectors) ◮ Training examples, many x(i) for which t(i) is known (eg, many movies

for which we know the rating)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 3 / 22

slide-12
SLIDE 12

Regression

What do all these problems have in common?

◮ Continuous outputs, we’ll call these t

(eg, a rating: a real number between 0-10, # of followers, house price) What do I need in order to predict these outputs? Predicting continuous outputs is called regression

◮ Features (inputs), we’ll call these x (or x if vectors) ◮ Training examples, many x(i) for which t(i) is known (eg, many movies

for which we know the rating)

◮ A model, a function that represents the relationship between x and t Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 3 / 22

slide-13
SLIDE 13

Regression

What do all these problems have in common?

◮ Continuous outputs, we’ll call these t

(eg, a rating: a real number between 0-10, # of followers, house price) What do I need in order to predict these outputs? Predicting continuous outputs is called regression

◮ Features (inputs), we’ll call these x (or x if vectors) ◮ Training examples, many x(i) for which t(i) is known (eg, many movies

for which we know the rating)

◮ A model, a function that represents the relationship between x and t ◮ A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 3 / 22

slide-14
SLIDE 14

Regression

What do all these problems have in common?

◮ Continuous outputs, we’ll call these t

(eg, a rating: a real number between 0-10, # of followers, house price) What do I need in order to predict these outputs? Predicting continuous outputs is called regression

◮ Features (inputs), we’ll call these x (or x if vectors) ◮ Training examples, many x(i) for which t(i) is known (eg, many movies

for which we know the rating)

◮ A model, a function that represents the relationship between x and t ◮ A loss or a cost or an objective function, which tells us how well our

model approximates the training examples

◮ Optimization, a way of finding the parameters of our model that

minimizes the loss function

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 3 / 22

slide-15
SLIDE 15

Today: Linear Regression

Linear regression

◮ continuous outputs ◮ simple model (linear)

Introduce key concepts:

◮ loss functions ◮ generalization ◮ optimization ◮ model complexity ◮ regularization Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 4 / 22

slide-16
SLIDE 16

Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 5 / 22

slide-17
SLIDE 17

Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us The data points are uniform in x, but may be displaced in y t(x) = f (x) + ǫ with ǫ some noise

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 5 / 22

slide-18
SLIDE 18

Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us The data points are uniform in x, but may be displaced in y t(x) = f (x) + ǫ with ǫ some noise In green is the ”true” curve that we don’t know

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 5 / 22

slide-19
SLIDE 19

Simple 1-D regression

Circles are data points (i.e., training examples) that are given to us The data points are uniform in x, but may be displaced in y t(x) = f (x) + ǫ with ǫ some noise In green is the ”true” curve that we don’t know Goal: We want to fit a curve to these points

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 5 / 22

slide-20
SLIDE 20

Simple 1-D regression

Key Questions:

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 6 / 22

slide-21
SLIDE 21

Simple 1-D regression

Key Questions:

◮ How do we parametrize the model? Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 6 / 22

slide-22
SLIDE 22

Simple 1-D regression

Key Questions:

◮ How do we parametrize the model? ◮ What loss (objective) function should we use to judge the fit? Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 6 / 22

slide-23
SLIDE 23

Simple 1-D regression

Key Questions:

◮ How do we parametrize the model? ◮ What loss (objective) function should we use to judge the fit? ◮ How do we optimize fit to unseen test data (generalization)? Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 6 / 22

slide-24
SLIDE 24

Example: Boston Housing data

Estimate median house price in a neighborhood based on neighborhood statistics

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 7 / 22

slide-25
SLIDE 25

Example: Boston Housing data

Estimate median house price in a neighborhood based on neighborhood statistics Look at first possible attribute (feature): per capita crime rate

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 7 / 22

slide-26
SLIDE 26

Example: Boston Housing data

Estimate median house price in a neighborhood based on neighborhood statistics Look at first possible attribute (feature): per capita crime rate Use this to predict house prices in other neighborhoods

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 7 / 22

slide-27
SLIDE 27

Example: Boston Housing data

Estimate median house price in a neighborhood based on neighborhood statistics Look at first possible attribute (feature): per capita crime rate Use this to predict house prices in other neighborhoods Is this a good input (attribute) to predict house prices?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 7 / 22

slide-28
SLIDE 28

Represent the Data

Data is described as pairs D = {(x(1), t(1)), · · · , (x(N), t(N))}

◮ x ∈ R is the input feature (per capita crime rate) ◮ t ∈ R is the target output (median house price) ◮ (i) simply indicates the training examples (we have N in this case) Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 8 / 22

slide-29
SLIDE 29

Represent the Data

Data is described as pairs D = {(x(1), t(1)), · · · , (x(N), t(N))}

◮ x ∈ R is the input feature (per capita crime rate) ◮ t ∈ R is the target output (median house price) ◮ (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 8 / 22

slide-30
SLIDE 30

Represent the Data

Data is described as pairs D = {(x(1), t(1)), · · · , (x(N), t(N))}

◮ x ∈ R is the input feature (per capita crime rate) ◮ t ∈ R is the target output (median house price) ◮ (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem Model outputs y, an estimate of t y(x) = w0 + w1x

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 8 / 22

slide-31
SLIDE 31

Represent the Data

Data is described as pairs D = {(x(1), t(1)), · · · , (x(N), t(N))}

◮ x ∈ R is the input feature (per capita crime rate) ◮ t ∈ R is the target output (median house price) ◮ (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem Model outputs y, an estimate of t y(x) = w0 + w1x What type of model did we choose?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 8 / 22

slide-32
SLIDE 32

Represent the Data

Data is described as pairs D = {(x(1), t(1)), · · · , (x(N), t(N))}

◮ x ∈ R is the input feature (per capita crime rate) ◮ t ∈ R is the target output (median house price) ◮ (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem Model outputs y, an estimate of t y(x) = w0 + w1x What type of model did we choose? Divide the dataset into training and testing examples

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 8 / 22

slide-33
SLIDE 33

Represent the Data

Data is described as pairs D = {(x(1), t(1)), · · · , (x(N), t(N))}

◮ x ∈ R is the input feature (per capita crime rate) ◮ t ∈ R is the target output (median house price) ◮ (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem Model outputs y, an estimate of t y(x) = w0 + w1x What type of model did we choose? Divide the dataset into training and testing examples

◮ Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 8 / 22

slide-34
SLIDE 34

Represent the Data

Data is described as pairs D = {(x(1), t(1)), · · · , (x(N), t(N))}

◮ x ∈ R is the input feature (per capita crime rate) ◮ t ∈ R is the target output (median house price) ◮ (i) simply indicates the training examples (we have N in this case)

Here t is continuous, so this is a regression problem Model outputs y, an estimate of t y(x) = w0 + w1x What type of model did we choose? Divide the dataset into training and testing examples

◮ Use the training examples to construct hypothesis, or function

approximator, that maps x to predicted y

◮ Evaluate hypothesis on test set Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 8 / 22

slide-35
SLIDE 35

Noise

A simple model typically does not exactly fit the data – lack of fit can be considered noise

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 9 / 22

slide-36
SLIDE 36

Noise

A simple model typically does not exactly fit the data – lack of fit can be considered noise Sources of noise:

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 9 / 22

slide-37
SLIDE 37

Noise

A simple model typically does not exactly fit the data – lack of fit can be considered noise Sources of noise:

◮ Imprecision in data attributes (input noise, eg noise in per-capita crime) Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 9 / 22

slide-38
SLIDE 38

Noise

A simple model typically does not exactly fit the data – lack of fit can be considered noise Sources of noise:

◮ Imprecision in data attributes (input noise, eg noise in per-capita crime) ◮ Errors in data targets (mis-labeling, eg noise in house prices) Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 9 / 22

slide-39
SLIDE 39

Noise

A simple model typically does not exactly fit the data – lack of fit can be considered noise Sources of noise:

◮ Imprecision in data attributes (input noise, eg noise in per-capita crime) ◮ Errors in data targets (mis-labeling, eg noise in house prices) ◮ Additional attributes not taken into account by data attributes, affect

target values (latent variables). In the example, what else could affect house prices?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 9 / 22

slide-40
SLIDE 40

Noise

A simple model typically does not exactly fit the data – lack of fit can be considered noise Sources of noise:

◮ Imprecision in data attributes (input noise, eg noise in per-capita crime) ◮ Errors in data targets (mis-labeling, eg noise in house prices) ◮ Additional attributes not taken into account by data attributes, affect

target values (latent variables). In the example, what else could affect house prices?

◮ Model may be too simple to account for data targets Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 9 / 22

slide-41
SLIDE 41

Least-Squares Regression

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 10 / 22

slide-42
SLIDE 42

Least-Squares Regression

Define a model y(x) = function(x, w)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 10 / 22

slide-43
SLIDE 43

Least-Squares Regression

Define a model Linear: y(x) = w0 + w1x

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 10 / 22

slide-44
SLIDE 44

Least-Squares Regression

Define a model Linear: y(x) = w0 + w1x Standard loss/cost/objective function measures the squared error between y and the true value t ℓ(w) =

N

  • n=1

[t(n) − y(x(n))]2

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 10 / 22

slide-45
SLIDE 45

Least-Squares Regression

Define a model Linear: y(x) = w0 + w1x Standard loss/cost/objective function measures the squared error between y and the true value t Linear model: ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 10 / 22

slide-46
SLIDE 46

Least-Squares Regression

Define a model Linear: y(x) = w0 + w1x Standard loss/cost/objective function measures the squared error between y and the true value t Linear model: ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2 For a particular hypothesis (y(x) defined by a choice of w, drawn in red), what does the loss represent geometrically?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 10 / 22

slide-47
SLIDE 47

Least-Squares Regression

Define a model Linear: y(x) = w0 + w1x Standard loss/cost/objective function measures the squared error between y and the true value t Linear model: ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2 The loss for the red hypothesis is the sum of the squared vertical errors (squared lengths of green vertical lines)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 10 / 22

slide-48
SLIDE 48

Least-Squares Regression

Define a model Linear: y(x) = w0 + w1x Standard loss/cost/objective function measures the squared error between y and the true value t Linear model: ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2 How do we obtain weights w = (w0, w1)?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 10 / 22

slide-49
SLIDE 49

Least-Squares Regression

Define a model Linear: y(x) = w0 + w1x Standard loss/cost/objective function measures the squared error between y and the true value t Linear model: ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2 How do we obtain weights w = (w0, w1)? Find w that minimizes loss ℓ(w)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 10 / 22

slide-50
SLIDE 50

Least-Squares Regression

Define a model Linear: y(x) = w0 + w1x Standard loss/cost/objective function measures the squared error between y and the true value t Linear model: ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2 How do we obtain weights w = (w0, w1)? For the linear model, what kind of a function is ℓ(w)?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 10 / 22

slide-51
SLIDE 51

Optimizing the Objective

One straightforward method: gradient descent

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 11 / 22

slide-52
SLIDE 52

Optimizing the Objective

One straightforward method: gradient descent

◮ initialize w (e.g., randomly) Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 11 / 22

slide-53
SLIDE 53

Optimizing the Objective

One straightforward method: gradient descent

◮ initialize w (e.g., randomly) ◮ repeatedly update w based on the gradient

w ← w − λ ∂ℓ ∂w

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 11 / 22

slide-54
SLIDE 54

Optimizing the Objective

One straightforward method: gradient descent

◮ initialize w (e.g., randomly) ◮ repeatedly update w based on the gradient

w ← w − λ ∂ℓ ∂w

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 11 / 22

slide-55
SLIDE 55

Optimizing the Objective

One straightforward method: gradient descent

◮ initialize w (e.g., randomly) ◮ repeatedly update w based on the gradient

w ← w − λ ∂ℓ ∂w λ is the learning rate

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 11 / 22

slide-56
SLIDE 56

Optimizing the Objective

One straightforward method: gradient descent

◮ initialize w (e.g., randomly) ◮ repeatedly update w based on the gradient

w ← w − λ ∂ℓ ∂w λ is the learning rate For a single training case, this gives the LMS update rule: w ← w + 2λ(t(n) − y(x(n)))x(n)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 11 / 22

slide-57
SLIDE 57

Optimizing the Objective

One straightforward method: gradient descent

◮ initialize w (e.g., randomly) ◮ repeatedly update w based on the gradient

w ← w − λ ∂ℓ ∂w λ is the learning rate For a single training case, this gives the LMS update rule: w ← w + 2λ (t(n) − y(x(n)))

  • error

x(n) Note: As error approaches zero, so does the update (w stops changing)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 11 / 22

slide-58
SLIDE 58

Optimizing Across Training Set

Two ways to generalize this for all examples in training set:

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 12 / 22

slide-59
SLIDE 59

Optimizing Across Training Set

Two ways to generalize this for all examples in training set:

  • 1. Batch updates: sum or average updates across every example n, then

change the parameter values w ← w + 2λ

N

  • n=1

(t(n) − y(x(n)))x(n)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 12 / 22

slide-60
SLIDE 60

Optimizing Across Training Set

Two ways to generalize this for all examples in training set:

  • 1. Batch updates: sum or average updates across every example n, then

change the parameter values w ← w + 2λ

N

  • n=1

(t(n) − y(x(n)))x(n)

  • 2. Stochastic/online updates: update the parameters for each training

case in turn, according to its own gradients

Algorithm 1 Stochastic gradient descent

1: Randomly shuffle examples in the training set 2: for i = 1 to N do 3:

Update: w ← w + 2λ(t(i) − y(x(i)))x(i) (update for a linear model)

4: end for

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 12 / 22

slide-61
SLIDE 61

Optimizing Across Training Set

Two ways to generalize this for all examples in training set:

  • 1. Batch updates: sum or average updates across every example n, then

change the parameter values w ← w + 2λ

N

  • n=1

(t(n) − y(x(n)))x(n)

  • 2. Stochastic/online updates: update the parameters for each training

case in turn, according to its own gradients

◮ Underlying assumption: sample is independent and identically

distributed (i.i.d.)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 12 / 22

slide-62
SLIDE 62

Analytical Solution?

For some objectives we can also find the optimal solution analytically This is the case for linear least-squares regression How?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 13 / 22

slide-63
SLIDE 63

Analytical Solution?

For some objectives we can also find the optimal solution analytically This is the case for linear least-squares regression How? Compute the derivatives of the objective wrt w and equate with 0

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 13 / 22

slide-64
SLIDE 64

Analytical Solution?

For some objectives we can also find the optimal solution analytically This is the case for linear least-squares regression How? Compute the derivatives of the objective wrt w and equate with 0 Define: t = [t(1), t(2), . . . , t(N)]T X =     1, x(1) 1, x(2) . . . 1, x(N)     Then: w = (XTX)−1XTt (work it out!)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 13 / 22

slide-65
SLIDE 65

Multi-dimensional Inputs

One method of extending the model is to consider other input dimensions y(x) = w0 + w1x1 + w2x2

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 14 / 22

slide-66
SLIDE 66

Multi-dimensional Inputs

One method of extending the model is to consider other input dimensions y(x) = w0 + w1x1 + w2x2 In the Boston housing example, we can look at the number of rooms

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 14 / 22

slide-67
SLIDE 67

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from these multi-dimensional observations

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 15 / 22

slide-68
SLIDE 68

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from these multi-dimensional observations Each house is a data point n, with observations indexed by j: x(n) =

  • x(n)

1 , · · · , x(n) j

, · · · , x(n)

d

  • Urtasun, Zemel, Fidler (UofT)

CSC 411: 02-Regression Jan 13, 2016 15 / 22

slide-69
SLIDE 69

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from these multi-dimensional observations Each house is a data point n, with observations indexed by j: x(n) =

  • x(n)

1 , · · · , x(n) j

, · · · , x(n)

d

  • We can incorporate the bias w0 into w, by using x0 = 1, then

y(x) = w0 +

d

  • j=1

wjxj = wTx

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 15 / 22

slide-70
SLIDE 70

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from these multi-dimensional observations Each house is a data point n, with observations indexed by j: x(n) =

  • x(n)

1 , · · · , x(n) j

, · · · , x(n)

d

  • We can incorporate the bias w0 into w, by using x0 = 1, then

y(x) = w0 +

d

  • j=1

wjxj = wTx We can then solve for w = (w0, w1, · · · , wd). How?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 15 / 22

slide-71
SLIDE 71

Linear Regression with Multi-dimensional Inputs

Imagine now we want to predict the median house price from these multi-dimensional observations Each house is a data point n, with observations indexed by j: x(n) =

  • x(n)

1 , · · · , x(n) j

, · · · , x(n)

d

  • We can incorporate the bias w0 into w, by using x0 = 1, then

y(x) = w0 +

d

  • j=1

wjxj = wTx We can then solve for w = (w0, w1, · · · , wd). How? We can use gradient descent to solve for each coefficient, or compute w analytically (how does the solution change?)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 15 / 22

slide-72
SLIDE 72

More Powerful Models?

What if our linear model is not good? How can we create a more complicated model?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 16 / 22

slide-73
SLIDE 73

Fitting a Polynomial

What if our linear model is not good? How can we create a more complicated model? We can create a more complicated model by defining input variables that are combinations of components of x

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 16 / 22

slide-74
SLIDE 74

Fitting a Polynomial

What if our linear model is not good? How can we create a more complicated model? We can create a more complicated model by defining input variables that are combinations of components of x Example: an M-th order polynomial function of one dimensional feature x: y(x, w) = w0 +

M

  • j=1

wjxj where xj is the j-th power of x

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 16 / 22

slide-75
SLIDE 75

Fitting a Polynomial

What if our linear model is not good? How can we create a more complicated model? We can create a more complicated model by defining input variables that are combinations of components of x Example: an M-th order polynomial function of one dimensional feature x: y(x, w) = w0 +

M

  • j=1

wjxj where xj is the j-th power of x We can use the same approach to optimize for the weights w

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 16 / 22

slide-76
SLIDE 76

Fitting a Polynomial

What if our linear model is not good? How can we create a more complicated model? We can create a more complicated model by defining input variables that are combinations of components of x Example: an M-th order polynomial function of one dimensional feature x: y(x, w) = w0 +

M

  • j=1

wjxj where xj is the j-th power of x We can use the same approach to optimize for the weights w How do we do that?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 16 / 22

slide-77
SLIDE 77

Which Fit is Best?

from Bishop Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 17 / 22

slide-78
SLIDE 78

Generalization

Generalization = model’s ability to predict the held out data What is happening?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 18 / 22

slide-79
SLIDE 79

Generalization

Generalization = model’s ability to predict the held out data What is happening? Our model with M = 9 overfits the data (it models also noise)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 18 / 22

slide-80
SLIDE 80

Generalization

Generalization = model’s ability to predict the held out data What is happening? Our model with M = 9 overfits the data (it models also noise) Not a problem if we have lots of training examples

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 18 / 22

slide-81
SLIDE 81

Generalization

Generalization = model’s ability to predict the held out data What is happening? Our model with M = 9 overfits the data (it models also noise) Let’s look at the estimated weights for various M in the case of fewer examples

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 18 / 22

slide-82
SLIDE 82

Generalization

Generalization = model’s ability to predict the held out data What is happening? Our model with M = 9 overfits the data (it models also noise) Let’s look at the estimated weights for various M in the case of fewer examples The weights are becoming huge to compensate for the noise

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 18 / 22

slide-83
SLIDE 83

Generalization

Generalization = model’s ability to predict the held out data What is happening? Our model with M = 9 overfits the data (it models also noise) Let’s look at the estimated weights for various M in the case of fewer examples The weights are becoming huge to compensate for the noise One way of dealing with this is to encourage the weights to be small (this way no input dimension will have too much influence on prediction). This is called regularization.

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 18 / 22

slide-84
SLIDE 84

Regularized Least Squares

Increasing the input features this way can complicate the model considerably

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 19 / 22

slide-85
SLIDE 85

Regularized Least Squares

Increasing the input features this way can complicate the model considerably Goal: select the appropriate model complexity automatically

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 19 / 22

slide-86
SLIDE 86

Regularized Least Squares

Increasing the input features this way can complicate the model considerably Goal: select the appropriate model complexity automatically Standard approach: regularization ˜ ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2 + αwTw

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 19 / 22

slide-87
SLIDE 87

Regularized Least Squares

Increasing the input features this way can complicate the model considerably Goal: select the appropriate model complexity automatically Standard approach: regularization ˜ ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2 + αwTw Intuition: Since we are minimizing the loss, the second term will encourage smaller values in w

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 19 / 22

slide-88
SLIDE 88

Regularized Least Squares

Increasing the input features this way can complicate the model considerably Goal: select the appropriate model complexity automatically Standard approach: regularization ˜ ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2 + αwTw Intuition: Since we are minimizing the loss, the second term will encourage smaller values in w The penalty on the squared weights is known as ridge regression in statistics

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 19 / 22

slide-89
SLIDE 89

Regularized Least Squares

Increasing the input features this way can complicate the model considerably Goal: select the appropriate model complexity automatically Standard approach: regularization ˜ ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2 + αwTw Intuition: Since we are minimizing the loss, the second term will encourage smaller values in w The penalty on the squared weights is known as ridge regression in statistics Leads to a modified update rule for gradient descent: w ← w + 2λ[

N

  • n=1

(t(n) − y(x(n)))x(n) − αw]

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 19 / 22

slide-90
SLIDE 90

Regularized Least Squares

Increasing the input features this way can complicate the model considerably Goal: select the appropriate model complexity automatically Standard approach: regularization ˜ ℓ(w) =

N

  • n=1

[t(n) − (w0 + w1x(n))]2 + αwTw Intuition: Since we are minimizing the loss, the second term will encourage smaller values in w The penalty on the squared weights is known as ridge regression in statistics Leads to a modified update rule for gradient descent: w ← w + 2λ[

N

  • n=1

(t(n) − y(x(n)))x(n) − αw] Also has an analytical solution: w = (XTX + α I)−1XTt (verify!)

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 19 / 22

slide-91
SLIDE 91

Regularized least squares

Better generalization Choose α carefully

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 20 / 22

slide-92
SLIDE 92

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 21 / 22

slide-93
SLIDE 93

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

◮ Simple models may not capture all the important variations (signal) in

the data: underfit

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 21 / 22

slide-94
SLIDE 94

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

◮ Simple models may not capture all the important variations (signal) in

the data: underfit

◮ More complex models may overfit the training data (fit not only the

signal but also the noise in the data), especially if not enough data to constrain model

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 21 / 22

slide-95
SLIDE 95

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

◮ Simple models may not capture all the important variations (signal) in

the data: underfit

◮ More complex models may overfit the training data (fit not only the

signal but also the noise in the data), especially if not enough data to constrain model One method of assessing fit: test generalization = model’s ability to predict the held out data

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 21 / 22

slide-96
SLIDE 96

1-D regression illustrates key concepts

Data fits – is linear model best (model selection)?

◮ Simple models may not capture all the important variations (signal) in

the data: underfit

◮ More complex models may overfit the training data (fit not only the

signal but also the noise in the data), especially if not enough data to constrain model One method of assessing fit: test generalization = model’s ability to predict the held out data Optimization is essential: stochastic and batch iterative approaches; analytic when available

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 21 / 22

slide-97
SLIDE 97

So...

Which movie will you watch?

Urtasun, Zemel, Fidler (UofT) CSC 411: 02-Regression Jan 13, 2016 22 / 22