Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, - - PowerPoint PPT Presentation

machine learning 2 nonlinear regression
SMART_READER_LITE
LIVE PREVIEW

Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, - - PowerPoint PPT Presentation

Machine Learning 2: Nonlinear Regression Stefano Ermon April 13, 2016 Stefano Ermon April 13, 2016 1 / 51 Machine Learning 2: Nonlinear Regression Non-linear regression 3 Peak Hourly Demand (GW) 2.5 2 1.5 0 20 40 60 80 100 High


slide-1
SLIDE 1

Machine Learning 2: Nonlinear Regression

Stefano Ermon April 13, 2016

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 1 / 51

slide-2
SLIDE 2

Non-linear regression

20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) High temperature / peak demand observations for all days in 2008-2011

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 2 / 51

slide-3
SLIDE 3

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 3 / 51

slide-4
SLIDE 4

Central idea of non-linear regression: same as linear regression, just with non-linear features E.g. φ(xi) =   x2

i

xi 1   Two ways to construct non-linear features: explicitly (construct actual feature vector), or implicitly (using kernels)

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 4 / 51

slide-5
SLIDE 5

20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data d = 2 Degree 2 polynomial

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 5 / 51

slide-6
SLIDE 6

20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data d = 3 Degree 3 polynomial

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 6 / 51

slide-7
SLIDE 7

20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data d = 4 Degree 4 polynomial

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 7 / 51

slide-8
SLIDE 8

Constructing explicit feature vectors

Polynomial features (max degree d) Special case, n=1: φ(z) =        zd zd−1 . . . z 1        ∈ Rd+1 General case: φ(z) = n

  • i=1

zbi

i : n

  • i=1

bi ≤ d

  • ∈ R(n+d

d ) Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 8 / 51

slide-9
SLIDE 9

−1 −0.5 0.5 1 −1 −0.5 0.5 1 x φi(x) 1 x x2 x3

Plot of polynomial bases

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 9 / 51

slide-10
SLIDE 10

Radial basis function (RBF) features

Defined by bandwidth σ and k RBF centers µj ∈ Rn, j = 1, . . . , k

φj(z) = exp −z − µj2 2σ2

  • 0.2

0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Input Feature Value

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 10 / 51

slide-11
SLIDE 11

Difficulties with non-linear features

Problem #1: Computational difficulties

Polynomial features, k = n + d d

  • = O(dn)

RBF features; suppose we want centers in uniform grid over input space (w/ d centers along each dimension) k = dn In both cases, exponential in the size of the input dimension; quickly intractable to even store in memory

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 11 / 51

slide-12
SLIDE 12

Problem #2: Representational difficulties

With many features, our prediction function becomes very expressive Can lead to overfitting (low error on input data points, but high error nearby) Let’s see an intuitive example

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 12 / 51

slide-13
SLIDE 13

20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data d = 1 20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data d = 2 20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data d = 4 20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data d = 50

Least-squares fits for polynomial features of different degrees

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 13 / 51

slide-14
SLIDE 14

20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data num RBFs = 2 20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data num RBFs = 4 20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data num RBFs = 10 20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data num RBFs = 50, λ = 0

Least-squares fits for different numbers of RBFs

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 14 / 51

slide-15
SLIDE 15

A few ways to deal with representational problem:

Choose less expressive function (e.g., lower degree polynomial, fewer RBF centers, larger RBF bandwidth) Regularization: penalize large parameters θ minimize

θ m

  • i=1

ℓ(ˆ yi, yi) + λθ2

2

λ: regularization parameter, trades off between low loss and small values

  • f θ (often, don’t regularize constant term)

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 15 / 51

slide-16
SLIDE 16

2 4 6 8 10 12 1000 2000 3000 4000 5000 6000 ||θ||2 J(θ)

Pareto optimal surface for 20 RBF functions

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 16 / 51

slide-17
SLIDE 17

20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data num RBFs = 50, λ = 0 20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data num RBFs = 50, λ = 2 20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data num RBFs = 50, λ = 50 20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed Data num RBFs = 50, λ = 1000

RBF fits varying regularization parameter (not regularizing constant term)

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 17 / 51

slide-18
SLIDE 18

Regularization: penalize large parameters θ minimize

θ m

  • i=1

ℓ(ˆ yi, yi) + λθ2

2

λ: regularization parameter, trades off between low loss and small values of θ (often, don’t regularize constant term) Solve with normal equations like before minimize

θ

Φθ − y2

2 + λθT θ

minimize

θ

θT ΦT Φθ − 2yT Φθ + yT y + λθT θ minimize

θ

θT ΦT Φ + λI

  • θ − 2yT Φθ + yT y

Setting gradient to zero θ⋆ = (ΦT Φ + λI)−1ΦT y

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 18 / 51

slide-19
SLIDE 19

Evaluating algorithms

How do we determine when an algorithm achieves “good” performance? How should we tune the parameters of the learning algorithms (regularization parameter, choice of features, etc?) How do we report the performance of learning algorithms?

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 19 / 51

slide-20
SLIDE 20

One possibility: just look at the loss function J(θ) =

m

  • i=1

ℓ(θT φ(xi), yi) The problem: adding more features will always decrease the loss Example example: random outputs, random features, we can get zero loss for enough features

m = 500; y = randn(m,1); Phi = randn(m,m); theta = (Phi´ * Phi) \ (Phi´ * y); norm(Phi*theta - y)^2 ans = 2.3722e-22

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 20 / 51

slide-21
SLIDE 21

A better criterion: training and testing loss

Training set: xi ∈ Rn, yi ∈ R, i = 1, . . . , m Testing set: x′

i ∈ Rn, y′ i ∈ R, i = 1, . . . , m′

Find parameters by minimizing loss on the training set, but evaluate on the testing set Training: θ⋆ = arg min

θ m

  • i=1

ℓ(θT φ(xi), yi) Evaluation: Average Loss = 1 m′ ℓ((θ⋆)T φ(x′

i), y′ i)

Performance on test set called generalization performance.

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 21 / 51

slide-22
SLIDE 22

Sometimes, there is a natural breakdown between training and testing data (e.g., train system on one year, test on the next) More common, simply divide the data: for example, use 70% for training, 30% for testing

% Phi, y, m are all the data m train = ceil(0.7*m); m test = m - m train; p = randperm(m); Phi train = Phi(p(1:m train),:); y train = y(p(1:m train)); Phi test = Phi(p(m train+1:end),:); y test = y(p(m train+1:end));

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 22 / 51

slide-23
SLIDE 23

20 40 60 80 100 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW)

High temperature / peak demand observations

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 23 / 51

slide-24
SLIDE 24

2 4 6 8 10 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Polynomial degree, d Average squared loss Training set Testing set

Testing loss versus degree of polynomial

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 24 / 51

slide-25
SLIDE 25

10 20 30 40 10

−2

10 10

2

10

4

Polynomial degree, d Average squared loss Training set Testing set

Testing loss (log-scale) versus degree of polynomial

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 25 / 51

slide-26
SLIDE 26

y

2 4 6 8 10 12 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 Number of RBFs Average squared loss Training set Testing set

Testing loss versus number of RBF bases

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 26 / 51

slide-27
SLIDE 27

20 40 60 80 100 10

−2

10

−1

10 10

1

Number of RBFs Average squared loss Training set Testing set

Testing loss (log-scale) versus number of RBF bases

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 27 / 51

slide-28
SLIDE 28

10

−10

10

−5

10 10

5

10

−2

10

−1

10 10

1

Lambda Average squared loss Training set Testing set

Testing loss (log-scale) versus regularization parameter (log-scale), for 70 RBF bases

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 28 / 51

slide-29
SLIDE 29

Cross-validation

A common mistake: split the data into training/testing sets, use testing set to find best performing features, regularization parameter, kernel parameters, etc (hyperparameters), then report the testing error for these best features

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 29 / 51

slide-30
SLIDE 30

This is not a valid method for evaluating error: the problem is that we effectively used the testing set to “train” the system What we need to do instead: break the training set itself into two sets (training and cross-validation) sets

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 30 / 51

slide-31
SLIDE 31

Cross-validation Procedure:

1

Break all data into training/testing sets (e.g., 70%/30%)

2

Break training set into training/cross-validation set (e.g., 70%/30% again)

3

Choose hyperparameters using cross-validation set

4

(Optional) Once we have selected hyperparameters, retrain using all the training set

5

Evaluate performance on the testing set

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 31 / 51

slide-32
SLIDE 32

k-fold cross-validation: Split training set into k different “folds”(equally sized random subsets)

For each fold i, train on k − 1 only folks, evaluate on held out fold i

The extreme case, leave one out cross validation: folds are individual examples

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 32 / 51

slide-33
SLIDE 33

Non-linear regression in higher dimensions

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 33 / 51

slide-34
SLIDE 34

Reporting Errors

If we want to report performance of an algorithm, how do we do this? Reporting just test error doesn’t give a sense of our “confidence” in the prediction

If we have a testing set of size 1000, doesn’t this imply more confidence in result than a testing set of size 10? What about variance in predictions? Are we getting some almost completely right and others very wrong?

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 34 / 51

slide-35
SLIDE 35

Setting: in our test set, we have a number of actual labels y′

i, and

predictions ˆ y′

i of our algorithm

There are really two things we may care about:

1

What is the distribution of our errors y′

i − ˆ

y′

i?

2

If we want to report some average loss Average loss = 1 m′

m′

  • i=1

ℓ(ˆ y′

i, ˆ

yi) how confident are we in this value?

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 35 / 51

slide-36
SLIDE 36

Some basic probability notation

We’ll use Z to denote a random variable (with distribution D), and use p(z) to denote the it’s probability density Expected value, or mean: µ = E[Z] =

  • zp(z)dz

Variance σ2 = E[(Z − µ)2] If you haven’t seen any of this notation before, there are a number of good reviews available

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 36 / 51

slide-37
SLIDE 37

Suppose we have m samples drawn from the probability distribution D, written as z1, . . . , zm ∼ D Then we can form empirical estimates of the mean and variance of the distribution ˆ µ = 1 m

m

  • i=1

zi ˆ σ2 = 1 m

m

  • i=1

(zi − µ)2 ≈ 1 m

m

  • i=1

(zi − ˆ µ)2 [You may have seen variance estimates with a

1 m−1 term instead; this is

needed to make the estimator unbiased, but we’ll typically deal with large m, so there isn’t much difference]

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 37 / 51

slide-38
SLIDE 38

Reporting errors

As mentioned before, we might want to know about the distribution

  • ver our prediction errors ˆ

y′

i − y′ i

−0.4 −0.2 0.2 0.4 0.6 10 20 30 40 50 60 70 y’pred − y’ Frequency

Histogram of errors ˆ y′

i − y′ i

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 38 / 51

slide-39
SLIDE 39

Treat ˆ y′

i − y′ i as samples from a distribution

Might want to know about the mean (also called bias), or variance of this distribution If we assume prediction errors are zero-mean (but this is not always the case), then ˆ σ2 = 1 m

m

  • i=1

(ˆ y′

i − y′ i)2

which is the mean squared error

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 39 / 51

slide-40
SLIDE 40

If we want to report some average loss, then we can treat ℓ(ˆ y′

i, y′ i) (for

any loss) as the random samples (the average loss is just the mean of these samples)

0.2 0.4 0.6 0.8 10 20 30 40 50 60 70 80 |y’pred − y’| Frequency

Histogram of losses ℓ(ˆ y′

i, y′ i) for absolute loss

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 40 / 51

slide-41
SLIDE 41

How confident are we in our estimate of the mean (i.e., the average loss)? Here we’ll exploit the central limit theorem: If z1, . . . , zm are (independent, identially distributed) samples from any distribution with mean µ and variance σ2, then 1 m

m

  • i=1

zi → N(µ, σ2/m)

I.e., the mean of any set of random variables is normally distributed

For a normal distribution, 95% of the data falls within 1.96 standard deviations σ.

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 41 / 51

slide-42
SLIDE 42

This suggests a method for computing “confidence intervals” of our estimate of the average loss

1

Form estimate of the mean: ˆ µ = 1 m′

m′

  • i=1

ℓ(ˆ y′

i, yi)

2

Form estimate of the variance: ˆ σ2 = 1 m′

m′

  • i=1

(ℓ(ˆ y′

i, y′ i) − ˆ

µ)2

3

With 95% confidence, the “true” mean lies within ˆ µ ± 1.96ˆ σ/ √ m′

This procedure is technically wrong (we should be using the a different estimate of the variance, and a Student-t distribution instead of Gaussian), but it is close enough when m′ is reasonably large, which is usually our setting

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 42 / 51

slide-43
SLIDE 43

Should report errors relative to some baseline (i.e., degree zero polynomial)

Degree Test Error 0.2414 ± 0.0039 1 0.2407 ± 0.0027 2 0.1505 ± 0.0013 3 0.1255 ± 0.0009 4 0.1257 ± 0.0009 5 0.1267 ± 0.0009

A better way of determining how algorithms compare: pairwise hypothesis testing

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 43 / 51

slide-44
SLIDE 44

Alternative loss functions

Nothing special about least-squares loss function ℓ(ˆ y, y) = (ˆ y − y)2. Some alternatives: Absolute loss: ℓ(ˆ y, y) = |ˆ y − y| Deadband loss: ℓ(ˆ y, y) = max{0, |ˆ y − y| − ǫ}, ǫ ∈ R+

−3 −2 −1 1 2 3 1 2 3 4 Error: ypred − y Loss Squared Loss Absolute Loss Deadband Loss

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 44 / 51

slide-45
SLIDE 45

How do we find parameters that minimize absolute loss? minimize

θ m

  • i=1

|θT φ(xi) − yi|

Non-differentiable, can’t take gradient

Solution: frame as a constrained optimization problem

Introduce new variables ν ∈ Rm, (νi ≥ |θT φ(xi) − yi|)

minimize

θ,ν m

  • i=1

νi subject to − νi ≤ θT φ(xi) − yi ≤ νi Linear program (LP): linear object and linear constraints

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 45 / 51

slide-46
SLIDE 46

Aside: general optimization problems

In this class we’ll consider general optimization problems minimize

θ

J(θ) subject to gi(θ) ≤ 0, i = 1, . . . , Ni hi(θ) = 0, i = 1, . . . , Ne A constrained optimization problem; gi terms are the inequality constraints; hi terms are the equality constraints. Many different classifications of optimization problems (linear programming, quadratic programming, semidefinite programming, integer programming), depending on the form of J, the gi’s and the hi’s.

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 46 / 51

slide-47
SLIDE 47

Important distinctions in optimization is between convex (where J, gi are convex and hi linear) and non-convex problems f convex ⇐ ⇒ f(aθ + (1 − a)θ′) ≤ af(θ) + (1 − a)f(θ′) for 0 ≤ a ≤ 1 Informally speaking, we can usually find global solutions of convex problems efficiently, while for non-convex problems we must settle for local solutions or time-consuming optimization

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 47 / 51

slide-48
SLIDE 48

Solving optimization problems

Many generic optimization libraries We will be using YALMIP (Yet Another Linear Matrix Inequality Parser): http://users.isy.liu.se/johanl/yalmip/ YALMIP code for least squares optimization:

theta = sdpvar(n,1); solvesdp([], sum((Phi*theta - y).^2)); double(theta) ans = 0.0466

  • 1.4600

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 48 / 51

slide-49
SLIDE 49

To solve LPs, typically need to put them in standard form: minimize

z

cT z subject to Az ≤ b

z ∈ Rn, A ∈ RNi×n, b ∈ RNi

For absolute loss LP z = θ ν

  • ,

c = 1

  • ,

A =

  • Φ

−I −Φ −I

  • ,

b =

  • y

−y

  • Stefano Ermon

Machine Learning 2: Nonlinear Regression April 13, 2016 49 / 51

slide-50
SLIDE 50

MATLAB code

c = [zeros(n,1); ones(m,1)]; A = [Phi -eye(m); -Phi -eye(m)]; b = [y; -y]; z = linprog(c,A,b); theta = z(1:n) theta = 0.0477

  • 1.5978

The same solution in YALMIP:

theta = sdpvar(n,1); solvesdp([], sum(abs((Phi*theta - y)))); double(theta) theta = 0.0477

  • 1.5978

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 50 / 51

slide-51
SLIDE 51

Which loss function should we use?

60 65 70 75 80 85 90 95 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed data Squared loss Absolute loss Deadband loss, eps = 0.1

Stefano Ermon Machine Learning 2: Nonlinear Regression April 13, 2016 51 / 51