Machine Learning 1: Linear Regression Stefano Ermon March 31, 2016 - - PowerPoint PPT Presentation

machine learning 1 linear regression
SMART_READER_LITE
LIVE PREVIEW

Machine Learning 1: Linear Regression Stefano Ermon March 31, 2016 - - PowerPoint PPT Presentation

Machine Learning 1: Linear Regression Stefano Ermon March 31, 2016 Stefano Ermon March 31, 2016 1 / 25 Machine Learning 1: Linear Regression Plan for today Plan for today: Supervised Machine Learning: linear regression Stefano Ermon March


slide-1
SLIDE 1

Machine Learning 1: Linear Regression

Stefano Ermon March 31, 2016

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 1 / 25

slide-2
SLIDE 2

Plan for today

Plan for today: Supervised Machine Learning: linear regression

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 2 / 25

slide-3
SLIDE 3

Renewable electricity generation in the U.S

Source: Renewable energy data book, NREL

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 3 / 25

slide-4
SLIDE 4

Challenges for the grid

Wind and solar are intermittent We will need traditional power plants when the wind stops

Many power plants (e.g., nuclear) cannot be easily turned on/off or quickly ramped up/down

With more accurate forecasts, wind and solar power become more efficient alternatives

A few years ago, Xcel Energy (Colorado) ran ads opposing a proposal that it use 10% of renewable sources Thanks to wind forecasting (ML) algorithms developed at NCAR, they now aim for 30 percent. Accurate forecasting saved the utility $6-$10 million per year

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 4 / 25

slide-5
SLIDE 5

Motivation

Solar and wind are intermittent Can we accurately forecast how much energy will we consume tomorrow?

Difficult to estimate from “a priori” models But, we have lots of data from which to build a model

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 5 / 25

slide-6
SLIDE 6

Typical electricity consumption

5 10 15 20 1 1.5 2 2.5 3 Hour of Day Hourly Demand (GW) Feb 9 Jul 13 Oct 10 Data: PJM http://www.pjm.com

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 6 / 25

slide-7
SLIDE 7

Predict peak demand from high temperature

What will peak demand be tomorrow? If we know something else about tomorrow (like the high temperature), we can use this to predict peak demand

60 65 70 75 80 85 90 95 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW)

Data: PJM, Weather Underground (summer months, June-August)

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 7 / 25

slide-8
SLIDE 8

A simple model

A linear model that predicts demand: predicted peak demand = θ1 · (high temperature) + θ2

60 65 70 75 80 85 90 95 1.5 2 2.5 3 High Temperature (F) Peak Hourly Demand (GW) Observed data Linear regression prediction

Parameters of model: θ1, θ2 ∈ R (θ1 = 0.046, θ2 = −1.46)

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 8 / 25

slide-9
SLIDE 9

A simple model

We can use a model like this to make predictions What will be the peak demand tomorrow?

I know from weather report that high temperature will be 80◦F (ignore, for the moment, that this too is a prediction)

Then predicted peak demand is: θ1 · 80 + θ2 = 0.046 · 80 − 1.46 = 2.19 GW

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 9 / 25

slide-10
SLIDE 10

Formal problem setting

Input: xi ∈ Rn, i = 1, . . . , m

E.g.: xi ∈ R1 = {high temperature for day i}

Output: yi ∈ R (regression task)

E.g.: yi ∈ R = {peak demand for day i}

Model Parameters: θ ∈ Rk Predicted Output: ˆ yi ∈ R E.g.: ˆ yi = θ1 · xi + θ2

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 10 / 25

slide-11
SLIDE 11

For convenience, we define a function that maps inputs to feature vectors φ : Rn → Rk For example, in our task above, if we define φ(xi) = xi 1

  • (here n = 1, k = 2)

then we can write ˆ yi =

k

  • j=1

θj · φj(xi) ≡ θT φ(xi)

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 11 / 25

slide-12
SLIDE 12

Loss functions

Want a model that performs “well” on the data we have I.e., ˆ yi ≈ yi, ∀i We measure “closeness” of ˆ yi and yi using loss function ℓ : R × R → R+ Example: squared loss ℓ(ˆ yi, yi) = (ˆ yi − yi)2

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 12 / 25

slide-13
SLIDE 13

Finding model parameters, and optimization

Want to find model parameters such that minimize sum of costs over all input/output pairs J(θ) =

m

  • i=1

ℓ(ˆ yi, yi) =

m

  • i=1

(θT φ(xi) − yi)2 Write our objective formally as minimize

θ

J(θ) simple example of an optimization problem; these will dominate our development of algorithms throughout the course

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 13 / 25

slide-14
SLIDE 14

How do we optimize a function

Search algorithm: Start with an initial guess for θ. Keep changing θ (by a little bit) to reduce J(θ) Animation https://www.youtube.com/watch?v=vWFjqgb-ylQ

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 14 / 25

slide-15
SLIDE 15

Gradient descent

Search algorithm: Start with an initial guess for θ. Keep changing θ (by a little bit) to reduce J(θ) J(θ) =

m

  • i=1

ℓ(ˆ yi, yi) =

m

  • i=1

(θT φ(xi) − yi)2 Gradient descent: θj = θj − α ∂J(θ)

∂θj , for all j

∂J ∂θj = ∂ m

i=1(θT φ(xi) − yi)2

∂θj =

m

  • i=1

∂(θT φ(xi) − yi)2 ∂θj =

m

  • i=1

2(θT φ(xi) − yi)∂(θT φ(xi) − yi) ∂θj =

m

  • i=1

2(θT φ(xi) − yi)φ(xi)j

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 15 / 25

slide-16
SLIDE 16

Gradient descent

Repeat until “convergence”: θj = θj − α

m

  • i=1

2(θT φ(xi) − yi)φ(xi)j, for all j Demo: https://lukaszkujawa.github.io/gradient-descent.html Stochastic gradient descent

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 16 / 25

slide-17
SLIDE 17

Let’s write J(θ) a little more compactly using matrix notation; define Φ ∈ Rm×k =      — φ(x1)T — — φ(x2)T — . . . — φ(xm)T —      , y ∈ Rm =      y1 y2 . . . ym      then J(θ) =

m

  • i=1

(θT φ(xi) − yi)2 = Φθ − y2

2

(z2 is ℓ2 norm of a vector: z2 ≡ m

i=1 z2 i =

√ zT z) Called least-squares objective function

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 17 / 25

slide-18
SLIDE 18

How do we optimize a function? 1-D case (θ ∈ R):

−3 −2 −1 1 2 3 −2 2 4 6 8 10 12 14 −3 −2 −1 1 2 3 −2 2 4 6 8 10 12 14 −3 −2 −1 1 2 3 −8 −6 −4 −2 2 4 −3 −2 −1 1 2 3 −8 −6 −4 −2 2 4

J(θ) = θ2 − 2θ − 1

dJ dθ = 2θ − 2

θ⋆ minimum = ⇒ dJ dθ

  • θ⋆ = 0

= ⇒ 2θ⋆ − 2 = 0 = ⇒ θ⋆ = 1

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 18 / 25

slide-19
SLIDE 19

Multi-variate case: θ ∈ Rk, J : Rk → R Generalized condition: ∇θJ(θ)|θ⋆ = 0 ∇θJ(θ) denotes gradient of J with respect to θ ∇θJ(θ) ∈ Rk ≡      

∂J ∂θ1 ∂J ∂θ2

. . .

∂J ∂θk

      Some important rules and common gradient ∇θ(af(θ) + bg(θ)) = a∇θf(θ) + b∇θg(θ), (a, b ∈ R) ∇θ(θT Aθ) = (A + AT )θ, (A ∈ Rk×k) ∇θ(bT θ) = b, (b ∈ Rk)

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 19 / 25

slide-20
SLIDE 20

Optimizing least-squares objective J(θ) = Φθ − y2

2

= (Φθ − y)T (Φθ − y) = θT ΦT Φθ − 2yT Φθ + yT y Using the previous gradient rules ∇θJ(θ) = ∇θ(θT ΦT Φθ − 2yT Φθ + yT y) = ∇θ(θT ΦT Φθ) − 2∇θ(yT Φθ) + ∇θ(yT y) = 2ΦT Φθ − 2ΦT y Setting gradient equal to zero 2ΦT Φθ⋆ − 2ΦT y = 0 ⇐ ⇒ θ⋆ = (ΦT Φ)−1ΦT y known as the normal equations

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 20 / 25

slide-21
SLIDE 21

Let’s see how this looks in MATLAB code

X = load(´ high_temperature.txt´ ); y = load(´ peak_demand.txt´ ); n = size(X,2); m = size(X,1); Phi = [X ones(m,1)]; theta = inv(Phi´ * Phi) * Phi´ * y; theta = 0.0466

  • 1.4600

The normal equations are so common that MATLAB has a special

  • peration for them

% same as inv(Phi´ * Phi) * Phi´ * y theta = Phi \ y;

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 21 / 25

slide-22
SLIDE 22

Higher-dimensional inputs

Input: x ∈ R2 = temperature hour of day

  • Output: y ∈ R = demand

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 22 / 25

slide-23
SLIDE 23

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 23 / 25

slide-24
SLIDE 24

Features: φ(x) ∈ R3 =   temperature hour of day 1   Same matrices as before Φ ∈ Rm×k =    — φ(x1)T — . . . — φ(xm)T —    , y ∈ Rm =    y1 . . . ym    Same solution as before θ ∈ R3 = (ΦT Φ)−1ΦT y

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 24 / 25

slide-25
SLIDE 25

Stefano Ermon Machine Learning 1: Linear Regression March 31, 2016 25 / 25