15-388/688 - Practical Data Science: Intro to Machine Learning & - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Intro to Machine Learning & Linear Regression J. Zico Kolter Carnegie Mellon University Fall 2019 1

Announcements HW 3 released, due 10/24 (not 10/17) Feedback on tutorial sent to everyone who submitted by deadline, will send to remaining people by tomorrow evening 2

Outline Least squares regression: a simple example Machine learning notation Linear regression revisited Matrix/vector notation and analytic solutions Implementing linear regression 3

A simple example: predicting electricity use What will peak power consumption be in Pittsburgh tomorrow? Difficult to build an “a priori” model from first principles to answer this question But, relatively easy to record past days of consumption, plus additional features that affect consumption (i.e., weather) Da Date te Hi High Tempera rature (F) Peak k Demand (GW) 2011-06-01 84.0 2.651 2011-06-02 73.0 2.081 2011-06-03 75.2 1.844 2011-06-04 84.9 1.959 … … … 5

Plot of consumption vs. temperature Plot of high temperature vs. peak demand for summer months (June – August) for past six years 3 . 00 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 60 70 80 90 High Temperature (F) 6

Hypothesis: linear model Let’s suppose that the peak demand approximately fits a linear model Peak_Demand ≈ 𝜄 1 ⋅ High_Temperature + 𝜄 2 Here 𝜄 1 is the “slope” of the line, and 𝜄 2 is the intercept How do we find a “good” fit to the data? Many possibilities, but natural objective is to minimize some difference between this line and the observed data, e.g. squared loss 𝜄 1 ⋅ High_Temperature 푖 + 𝜄 2 − Peak_Demand 푖 2 𝐹 𝜄 = ∑ 푖∈days 7

How do we find parameters? How do we find the parameters 𝜄 1 , 𝜄 2 that minimize the function 𝜄 1 ⋅ High_Temperature 푖 + 𝜄 2 − Peak_Demand 푖 2 𝐹 𝜄 = ∑ 푖∈days 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 2 ≡ ∑ 푖∈days General idea: suppose we want to minimize some function 𝑔 𝜄 f ( θ ) f ′ ( θ 0 ) θ 0 θ Derivative is slope of the function, so negative derivative points “downhill” 8

Computing the derivatives What are the derivatives of the error function with respect to each parameter 𝜄 1 and 𝜄 2 ? 𝜖𝐹 𝜄 = 𝜖 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 2 ∑ 𝜖𝜄 1 𝜖𝜄 1 푖∈days 𝜖 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 2 = ∑ 𝜖𝜄 1 푖∈days ⋅ 𝜖 2 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 𝜄 1 ⋅ 𝑦 푖 = ∑ 𝜖𝜄 1 푖∈days 2 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 ⋅ 𝑦 푖 = ∑ 푖∈days 𝜖𝐹 𝜄 2 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 = ∑ 𝜖𝜄 2 푖∈days 9

Finding the best 𝜄 To find a good value of 𝜄 , we can repeatedly take steps in the direction of the negative derivatives for each value Repeat: 2 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 ⋅ 𝑦 푖 𝜄 1 ≔ 𝜄 1 − 𝛽 ∑ 푖∈days 2 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 𝜄 2 ≔ 𝜄 2 − 𝛽 ∑ 푖∈days where 𝛽 is some small positive number called the step size This is the gradient decent algorithm , the workhorse of modern machine learning 10

Gradient descent 3 . 00 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 60 70 80 90 High Temperature (F) 11

Gradient descent 3 . 00 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 − 4 − 2 0 2 Normalized Temperature Normalize input by subtracting the mean and dividing by the standard deviation 12

Gradient descent – Iteration 1 3 . 00 Observed days Squared loss fit 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 − 4 − 2 0 2 Normalized Temperature θ = (0 . 00 , 0 . 00) E ( θ ) = 1427 . 53 ( ∂ E ( θ ) ∂θ 1 , ∂ E ( θ ) ∂θ 2 ) = ( − 151 . 20 , − 1243 . 10) 13

Fitted line in “original” coordinates 3 . 00 Observed days Squared loss fit 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 50 60 70 80 90 100 High Temperature (F) 19

Making predictions Importantly, our model also lets us make predictions about new days What will the peak demand be tomorrow? If we know the high temperature will be 72 degrees (ignoring for now that this is also a prediction), then we can predict peak demand to be: Predicted_demand = 𝜄 1 ⋅ 72 + 𝜄 2 = 1.821 GW (requires that we rescale 𝜄 after solving to “normal” coordinates) Equivalent to just “finding the point on the line” 20

Extensions What if we want to add additional features, e.g. day of week, instead of just temperature? What if we want to use a different loss function instead of squared error (i.e., absolute error)? What if we want to use a non-linear prediction instead of a linear one? We can easily reason about all these things by adopting some additional notation… 22

̂ Machine learning This has been an example of a machine learning algorithm Basic idea: in many domains, it is difficult to hand-build a predictive model, but easy to collect lots of data; machine learning provides a way to automatically infer the predictive model from data The basic process (supervised learning): Machine learning Training Data Predictions algorithm 𝑦 1 , 𝑧 1 Hypothesis function New example 𝑦 𝑦 2 , 𝑧 2 𝑧 푖 ≈ ℎ 𝑦 푖 𝑧 = ℎ(𝑦) 𝑦 3 , 𝑧 3 ⋮ 24

Terminology Input features: 𝑦 푖 ∈ ℝ 푛 , 𝑗 = 1, … , 𝑛 High_Temperature 푖 E. g. : 𝑦 푖 = Is_Weekday 푖 1 Outputs: 𝑧 푖 ∈ 𝒵, 𝑗 = 1, … , 𝑛 E. g. : 𝑧 푖 ∈ ℝ = Peak_Demand 푖 Model parameters: 𝜄 ∈ ℝ 푛 Hypothesis function: ℎ 휃 : ℝ 푛 → 𝒵 , predicts output given input 푛 E. g. : ℎ 휃 𝑦 = ∑ 𝜄 푗 ⋅ 𝑦 푗 푗=1 25

̂ ̂ Terminology Loss function: ℓ: 𝒵×𝒵 → ℝ + , measures the difference between a prediction and an actual output 𝑧 − 𝑧 2 E. g. : ℓ 𝑧, 𝑧 = The canonical machine learning optimization problem: 푚 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 minimize ∑ 휃 푖=1 Virtually every machine learning algorithm has this form, just specify • What is the hypothesis function? • What is the loss function? • How do we solve the optimization problem? 26

Example machine learning algorithms Note: we (machine learning researchers) have not been consistent in naming conventions, many machine learning algorithms actually only specify some of these three elements • Least squares: {linear hypothesis, squared loss, (usually) analytical solution} • Linear regression: {linear hypothesis, *, *} • Support vector machine: {linear or kernel hypothesis, hinge loss, *} • Neural network: {Composed non-linear function, *, (usually) gradient descent) • Decision tree: {Hierarchical axis-aligned halfplanes, *, greedy optimization} • Naïve Bayes: {Linear hypothesis, joint probability under certain independence assumptions, analytical solution} 27

̂ ̂ Least squares revisited Using our new terminology, plus matrix notion, let’s revisit how to solve linear regression with a squared error loss Setup: 푛 • Linear hypothesis function: ℎ 휃 𝑦 = ∑ 푗=1 𝜄 푗 ⋅ 𝑦 푗 𝑧 − 𝑧 2 • Squared error loss: ℓ 𝑧, 𝑧 = • Resulting machine learning optimization problem: 2 푚 푛 푖 − 𝑧 푖 minimize ∑ ∑ 𝜄 푗 ⋅ 𝑦 푗 ≡ minimize 𝐹 𝜄 휃 휃 푖=1 푗=1 29

15-388/688 - Practical Data Science: Intro to Machine Learning & - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Intro to Machine Learning & Linear Regression J. Zico Kolter Carnegie Mellon University Fall 2019 1 Announcements HW 3 released, due 10/24 (not 10/17) Feedback on tutorial sent to everyone who

15-388/688 - Practical Data Science: Debugging data science J. Zico Kolter School of Computer

15-388/688 - Practical Data Science: Introduction J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Linear classification J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Deep learning J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Basic probability J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Data collection and scraping J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Relational Data J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Maximum likelihood estimation, nave Bayes J. Zico Kolter

15-388/688 - Practical Data Science: Visualization and Data Exploration J. Zico Kolter Carnegie

Time Series Modeling Shouvik Mani April 5, 2018 15-388/688: Practical Data Science Carnegie

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, and regularization J.

15-388/688 - Practical Data Science: Hypothesis testing and experimental design J. Zico Kolter

15-388/688 - Practical Data Science: Recommender systems J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Jupyter notebook lab J. Zico Kolter Carnegie Mellon

Learning theory Lecture 10 David Sontag New York University

MANOVA and the Multivariate GLM Here we generalize the notation we learned before to the case of

CS480/680 Lecture 8: June 3, 2019 Classification by Logistic Regression, Generalized linear

Innodrive Final Conference Czech Republic Micro-Data Evidence CEPS Brussels February, 2011 Two

Workshop 7: (Generalized) Linear models Murray Logan July 19, 2017 Table of contents 1

Advanced Mathematical Methods Part II Statistics Generalised Linear Model Mel Slater

Hypothesis Testing and statistical preliminaries Stony Brook University CSE545, Spring 2019

Linear Models Machine Learning 1 Checkpoint: The bigger picture Supervised learning: