Machine Learning 1: Linear Regression Stefano Ermon March 31, 2016 - PowerPoint PPT Presentation

Machine Learning 1: Linear Regression Stefano Ermon March 31, 2016 Stefano Ermon March 31, 2016 1 / 25 Machine Learning 1: Linear Regression

Plan for today Plan for today: Supervised Machine Learning: linear regression Stefano Ermon March 31, 2016 2 / 25 Machine Learning 1: Linear Regression

Renewable electricity generation in the U.S Source: Renewable energy data book, NREL Stefano Ermon March 31, 2016 3 / 25 Machine Learning 1: Linear Regression

Challenges for the grid Wind and solar are intermittent We will need traditional power plants when the wind stops Many power plants (e.g., nuclear) cannot be easily turned on/off or quickly ramped up/down With more accurate forecasts, wind and solar power become more efficient alternatives A few years ago, Xcel Energy (Colorado) ran ads opposing a proposal that it use 10% of renewable sources Thanks to wind forecasting (ML) algorithms developed at NCAR, they now aim for 30 percent. Accurate forecasting saved the utility $6-$10 million per year Stefano Ermon March 31, 2016 4 / 25 Machine Learning 1: Linear Regression

Motivation Solar and wind are intermittent Can we accurately forecast how much energy will we consume tomorrow? Difficult to estimate from “a priori” models But, we have lots of data from which to build a model Stefano Ermon March 31, 2016 5 / 25 Machine Learning 1: Linear Regression

Typical electricity consumption 3 Feb 9 Jul 13 Oct 10 2.5 Hourly Demand (GW) 2 1.5 1 0 5 10 15 20 Hour of Day Data: PJM http://www.pjm.com Stefano Ermon March 31, 2016 6 / 25 Machine Learning 1: Linear Regression

Predict peak demand from high temperature What will peak demand be tomorrow? If we know something else about tomorrow (like the high temperature), we can use this to predict peak demand 3 Peak Hourly Demand (GW) 2.5 2 1.5 60 65 70 75 80 85 90 95 High Temperature (F) Data: PJM, Weather Underground (summer months, June-August) Stefano Ermon March 31, 2016 7 / 25 Machine Learning 1: Linear Regression

A simple model A linear model that predicts demand: predicted peak demand = θ 1 · ( high temperature ) + θ 2 Observed data 3 Linear regression prediction Peak Hourly Demand (GW) 2.5 2 1.5 60 65 70 75 80 85 90 95 High Temperature (F) Parameters of model: θ 1 , θ 2 ∈ R ( θ 1 = 0 . 046 , θ 2 = − 1 . 46 ) Stefano Ermon March 31, 2016 8 / 25 Machine Learning 1: Linear Regression

A simple model We can use a model like this to make predictions What will be the peak demand tomorrow? I know from weather report that high temperature will be 80 ◦ F (ignore, for the moment, that this too is a prediction) Then predicted peak demand is: θ 1 · 80 + θ 2 = 0 . 046 · 80 − 1 . 46 = 2 . 19 GW Stefano Ermon March 31, 2016 9 / 25 Machine Learning 1: Linear Regression

Formal problem setting Input : x i ∈ R n , i = 1 , . . . , m E.g.: x i ∈ R 1 = { high temperature for day i } Output : y i ∈ R ( regression task) E.g.: y i ∈ R = { peak demand for day i } Model Parameters : θ ∈ R k Predicted Output : ˆ y i ∈ R E.g.: ˆ y i = θ 1 · x i + θ 2 Stefano Ermon March 31, 2016 10 / 25 Machine Learning 1: Linear Regression

For convenience, we define a function that maps inputs to feature vectors φ : R n → R k For example, in our task above, if we define � x i � φ ( x i ) = (here n = 1 , k = 2 ) 1 then we can write k θ j · φ j ( x i ) ≡ θ T φ ( x i ) � ˆ y i = j =1 Stefano Ermon March 31, 2016 11 / 25 Machine Learning 1: Linear Regression

Loss functions Want a model that performs “well” on the data we have I.e., ˆ y i ≈ y i , ∀ i We measure “closeness” of ˆ y i and y i using loss function ℓ : R × R → R + Example: squared loss y i − y i ) 2 ℓ (ˆ y i , y i ) = (ˆ Stefano Ermon March 31, 2016 12 / 25 Machine Learning 1: Linear Regression

Finding model parameters, and optimization Want to find model parameters such that minimize sum of costs over all input/output pairs m m ( θ T φ ( x i ) − y i ) 2 � � J ( θ ) = ℓ (ˆ y i , y i ) = i =1 i =1 Write our objective formally as minimize J ( θ ) θ simple example of an optimization problem ; these will dominate our development of algorithms throughout the course Stefano Ermon March 31, 2016 13 / 25 Machine Learning 1: Linear Regression

How do we optimize a function Search algorithm: Start with an initial guess for θ . Keep changing θ (by a little bit) to reduce J ( θ ) Animation https://www.youtube.com/watch?v=vWFjqgb-ylQ Stefano Ermon March 31, 2016 14 / 25 Machine Learning 1: Linear Regression

Gradient descent Search algorithm: Start with an initial guess for θ . Keep changing θ (by a little bit) to reduce J ( θ ) m m ( θ T φ ( x i ) − y i ) 2 � � J ( θ ) = ℓ (ˆ y i , y i ) = i =1 i =1 Gradient descent: θ j = θ j − α ∂J ( θ ) ∂θ j , for all j m i =1 ( θ T φ ( x i ) − y i ) 2 ∂ ( θ T φ ( x i ) − y i ) 2 = ∂ � m ∂J � = ∂θ j ∂θ j ∂θ j i =1 m 2( θ T φ ( x i ) − y i ) ∂ ( θ T φ ( x i ) − y i ) � = ∂θ j i =1 m 2( θ T φ ( x i ) − y i ) φ ( x i ) j � = i =1 Stefano Ermon March 31, 2016 15 / 25 Machine Learning 1: Linear Regression

Gradient descent Repeat until “convergence”: m 2( θ T φ ( x i ) − y i ) φ ( x i ) j , for all j � θ j = θ j − α i =1 Demo: https://lukaszkujawa.github.io/gradient-descent.html Stochastic gradient descent Stefano Ermon March 31, 2016 16 / 25 Machine Learning 1: Linear Regression

Let’s write J ( θ ) a little more compactly using matrix notation; define φ ( x 1 ) T     — — y 1 φ ( x 2 ) T — — y 2 Φ ∈ R m × k =   y ∈ R m =    , . .     . .     . .    φ ( x m ) T — — y m then m ( θ T φ ( x i ) − y i ) 2 = � Φ θ − y � 2 � J ( θ ) = 2 i =1 √ �� m z T z ) i =1 z 2 ( � z � 2 is ℓ 2 norm of a vector: � z � 2 ≡ i = Called least-squares objective function Stefano Ermon March 31, 2016 17 / 25 Machine Learning 1: Linear Regression

How do we optimize a function? 1-D case ( θ ∈ R ): 14 14 4 4 12 12 2 2 10 10 0 0 8 8 −3 −3 −2 −2 −1 −1 0 0 1 1 2 2 3 3 6 6 −2 −2 4 4 −4 −4 2 2 −6 −6 0 0 −3 −3 −2 −2 −1 −1 0 0 1 1 2 2 3 3 −2 −2 −8 −8 J ( θ ) = θ 2 − 2 θ − 1 dJ dθ = 2 θ − 2 � ⇒ dJ θ ⋆ minimum = � θ ⋆ = 0 � dθ � ⇒ 2 θ ⋆ − 2 = 0 = ⇒ θ ⋆ = 1 = Stefano Ermon March 31, 2016 18 / 25 Machine Learning 1: Linear Regression

Multi-variate case: θ ∈ R k , J : R k → R Generalized condition: ∇ θ J ( θ ) | θ ⋆ = 0 ∇ θ J ( θ ) denotes gradient of J with respect to θ ∂J   ∂θ 1 ∂J   ∇ θ J ( θ ) ∈ R k ≡  ∂θ 2   .  .   .   ∂J ∂θ k Some important rules and common gradient ∇ θ ( af ( θ ) + bg ( θ )) = a ∇ θ f ( θ ) + b ∇ θ g ( θ ) , ( a, b ∈ R ) ∇ θ ( θ T Aθ ) = ( A + A T ) θ, ( A ∈ R k × k ) ∇ θ ( b T θ ) = b, ( b ∈ R k ) Stefano Ermon March 31, 2016 19 / 25 Machine Learning 1: Linear Regression

Optimizing least-squares objective J ( θ ) = � Φ θ − y � 2 2 = (Φ θ − y ) T (Φ θ − y ) = θ T Φ T Φ θ − 2 y T Φ θ + y T y Using the previous gradient rules ∇ θ J ( θ ) = ∇ θ ( θ T Φ T Φ θ − 2 y T Φ θ + y T y ) = ∇ θ ( θ T Φ T Φ θ ) − 2 ∇ θ ( y T Φ θ ) + ∇ θ ( y T y ) = 2Φ T Φ θ − 2Φ T y Setting gradient equal to zero 2Φ T Φ θ ⋆ − 2Φ T y = 0 ⇐ ⇒ θ ⋆ = (Φ T Φ) − 1 Φ T y known as the normal equations Stefano Ermon March 31, 2016 20 / 25 Machine Learning 1: Linear Regression

Let’s see how this looks in MATLAB code high_temperature.txt´ X = load(´ ); peak_demand.txt´ y = load(´ ); n = size(X,2); m = size(X,1); Phi = [X ones(m,1)]; theta = inv(Phi´ * Phi) * Phi´ * y; theta = 0.0466 -1.4600 The normal equations are so common that MATLAB has a special operation for them % same as inv(Phi´ * Phi) * Phi´ * y theta = Phi \ y; Stefano Ermon March 31, 2016 21 / 25 Machine Learning 1: Linear Regression

Higher-dimensional inputs � temperature � Input: x ∈ R 2 = hour of day Output: y ∈ R = demand Stefano Ermon March 31, 2016 22 / 25 Machine Learning 1: Linear Regression

Stefano Ermon March 31, 2016 23 / 25 Machine Learning 1: Linear Regression

  temperature Features: φ ( x ) ∈ R 3 = hour of day   1 Same matrices as before  φ ( x 1 ) T    — — y 1 Φ ∈ R m × k = . y ∈ R m = . . .  ,     . .    φ ( x m ) T — — y m Same solution as before θ ∈ R 3 = (Φ T Φ) − 1 Φ T y Stefano Ermon March 31, 2016 24 / 25 Machine Learning 1: Linear Regression

Stefano Ermon March 31, 2016 25 / 25 Machine Learning 1: Linear Regression

Machine Learning 1: Linear Regression Stefano Ermon March 31, 2016 - PowerPoint PPT Presentation

Machine Learning 1: Linear Regression Stefano Ermon March 31, 2016 Stefano Ermon March 31, 2016 1 / 25 Machine Learning 1: Linear Regression Plan for today Plan for today: Supervised Machine Learning: linear regression Stefano Ermon March

Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

MACHINE LEARNING Linear and Weighted Regression Support Vector Regression 1 APPLIED MACHINE

10-601 Machine Learning Regression Outline Regression vs Classification Linear regression

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

Data Sciences CentraleSupelec Advance Machine Learning Course II - Linear regression/Linear

ADVANCED MACHINE LEARNING Non-linear regression techniques 1 1 ADVANCED MACHINE LEARNING

493 This scenario transferring money between bank accounts is the classic example to

The Future Defence Infrastructure Services Programme FDIS Industry Supplier Briefing Tidworth

Electronics PCC Walter Lara A Graphical Programming Language Allows to create programs with

Patterns admitted by a numerical semigroup Klara Stokes With gratitude to Ralf Fr oberg and

Applied Machine Learning Applied Machine Learning Bootstrap, Bagging and Boosting Siamak

Kernel Methods Barnabs Pczos Outline Quick Introduction Feature space Perceptron

I TERATING AND C ASTING DCC888 LLVMProvidesaRichProgrammingAPI

Cryptography Esthers added slides (the rest are in the lecture slide deck) RSA- Rivest Shamir