15-780 Graduate Artificial Intelligence: Machine learning J. Zico - PowerPoint PPT Presentation

15-780 – Graduate Artificial Intelligence: Machine learning J. Zico Kolter (this lecture) and Nihar Shah Carnegie Mellon University Spring 2020 1

Outline What is machine learning? Linear regression Linear classification Nonlinear methods Overfitting, generalization, and regularization Evaluating machine learning algorithms 2

Introduction: digit classification The task: k: write a program that, given a 28x28 grayscale image of a digit, outputs the number in the image Image: digits from the MNIST data set (http://yann.lecun.com/exdb/mnist/) 4

Approaches Approach 1: try to write a program by hand that uses your a priori knowledge about what images look like to determine what number they are 8 Approach 2: (the machine learning approach) collect a large volume of images and their corresponding numbers, let the “write its own program” to map from these images to their 5 corresponding number (More precisely, this is a subset of machine learning called supervised learning) 5

Supervised learning pipeline Training data , 2 Hypothesis function , 0 Machine learning ℎ: 𝒴 → 𝒵 such that , 5 𝑧 푖 ≈ ℎ 𝑦 푖 algorithm , ∀𝑗 , 8 (On new data 𝑦 ′ ∈ 𝒴 , make prediction 𝑧 푖 ∈ 𝒵 𝑧 ′ = ℎ(𝑦 ′ ) ) 𝑦 푖 ∈ 𝒴 6

A simple example: predicting electricity use What will peak power consumption be in Pittsburgh tomorrow? Difficult to build an “a priori” model from first principles to answer this question But, relatively easy to record past days of consumption, plus additional features that affect consumption (i.e., weather) Da Date te Hi High Tempera rature (F) Peak k Demand (GW) 2011-06-01 84.0 2.651 2011-06-02 73.0 2.081 2011-06-03 75.2 1.844 2011-06-04 84.9 1.959 … … … 8

Plot of consumption vs. temperature Plot of high temperature vs. peak demand for summer months (June – August) for past six years 9

Hypothesis: linear model Let’s suppose that the peak demand approximately fits a linear model Peak_Demand ≈ 𝜄 1 ⋅ High_Temperature + 𝜄 2 Here 𝜄 1 is the “slope” of the line, and 𝜄 2 is the intercept Now, given a forecast of tomorrow’s weather (ignoring for a moment that this is also a prediction), we can predict how high the peak demand 10

Predictions Predicting in this manner is equivalent to “drawing line through data” 3.2 Observed days 3.0 Prediction 2.8 Peak Demand (GW) 2.6 2.4 2.2 2.0 1.8 1.6 1.4 55 60 65 70 75 80 85 90 95 100 High Temperature (F) 11

Machine learning notation Input features: 𝑦 푖 ∈ ℝ 푛 , 𝑗 = 1, … , 𝑛 E. g. : 𝑦 푖 = High_Temperature 푖 Training 1 data Outputs: 𝑧 푖 ∈ ℝ (regression task) E. g. : 𝑧 푖 ∈ ℝ = Peak_Demand 푖 Model parameters: 𝜄 ∈ ℝ 푘 (for linear models 𝑙 = 𝑜 ) Hypothesis function: ℎ 휃 : ℝ 푛 → ℝ , predicts output given input 푛 E. g. : ℎ 휃 𝑦 = 𝜄 푇 𝑦 = ∑ 𝜄 푗 ⋅ 𝑦 푗 푗=1 12

Loss functions How do we measure how “good” a hypothesis function is, i.e. how close is our approximation on our training data 𝑧 푖 ≈ ℎ 휃 𝑦 푖 Typically done by introducing a loss function ℓ: ℝ×ℝ → ℝ + where ℓ ℎ 휃 𝑦 , 𝑧 denotes how far apart prediction is from actual output E.g., for regression a common loss function is squared error: ℓ ℎ 휃 𝑦 , 𝑧 = ℎ 휃 𝑦 − 𝑧 2 13

The canonical machine learning problem With this notation, we define the canonical machine learning problem: given a set of input features and outputs 𝑦 푖 , 𝑧 푖 , 𝑗 = 1, … , 𝑛 , find the parameters that minimize the sum of losses 푚 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 minimize ∑ 휃 푖=1 Virtually all machine learning algorithms have this form, we just need to specify 1. What is the hypothesis function? 2. What is the loss function? 3. How do we solve the optimization problem? 14

Least squares Let’s formulate our linear least squares problem in this notation Hypothesis function: ℎ 휃 𝑦 = 𝜄 푇 𝑦 Squared loss function: ℓ ℎ 휃 𝑦 , 𝑧 = ℎ 휃 𝑦 − 𝑧 2 Leads to the machine learning optimization problem 푚 푚 𝜄 푇 𝑦 푖 − 𝑧 푖 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 2 minimize ∑ ≡ minimize ∑ 휃 휃 푖=1 푖=1 A convex optimization problem in 𝜄 , so we expect global solutions But how do we solve this optimization problem? 15

Solution via gradient descent Recall the gradient descent algorithm (written now to optimize 𝜄 ) Repeat: 𝜄 → 𝜄 − 𝛽𝛼 휃 𝑔 𝜄 What is the gradient of our objective function? 푚 푚 𝜄 푇 𝑦 푖 − 𝑧 푖 2 = ∑ 𝛼 휃 𝜄 푇 𝑦 푖 − 𝑧 푖 2 𝛼 휃 ∑ 푖=1 푖=1 푚 𝑦 푖 (𝜄 푇 𝑦 푖 − 𝑧 푖 ) = 2 ∑ 푖=1 (using chain rule and the fact that 𝛼 휃 𝜄 푇 𝑦 푖 = 𝑦 푖 ), gives update: 푚 𝑦 푖 (𝜄 푇 𝑦 푖 − 𝑧 푖 ) Repeat: 𝜄 → 𝜄 − 𝛽 ∑ 푖=1 16

Linear algebra notation Summation notation gets cumbersome, so convenient to introduce a more compact notation: 𝑦 1 푇 𝑧 1 𝑦 2 푇 𝑧 2 ∈ ℝ 푚×푛 , ∈ ℝ 푚 𝑌 = 𝑧 = ⋮ ⋮ 𝑧 푚 𝑦 푚 푇 Least squares objective can now be written 푚 𝜄 푇 𝑦 푖 − 𝑧 푖 2 = 𝑌𝜄 − 𝑧 2 2 ∑ 푖=1 2 = 2𝑌 푇 (𝑌𝜄 − 𝑧) and gradient given by 𝛼 휃 𝑌𝜄 − 𝑧 2 17

An alternative solution method In order for 𝜄 ⋆ to minimize some (unconstrained, differentiable), function 𝑔 , necessary and sufficient that 𝛼 휃 𝑔 𝜄 ⋆ = 0 Previously, we attained this point iteratively through gradient descent, but for squared error loss, we can also find it analytically 𝛼 휃 𝑌𝜄 ⋆ − 𝑧 2 2 = 0 ⟹ 2𝑌 푇 𝑌𝜄 ⋆ − 𝑧 = 0 ⟹ 𝑌 푇 𝑌𝜄 ⋆ = 𝑌 푇 𝑧 ⟹ 𝜄 ⋆ = 𝑌 푇 𝑌 −1 𝑌 푇 𝑧 These are called the normal equations , a closed form solution for minimization of sum of squared losses 18

Least squares solution Solving normal equations (or running gradient descent), gives coefficients 𝜄 1 and 𝜄 2 corresponding to the following fit 19

Poll: least squares when 𝑛 < 𝑜 What happens you run a least-squares solver, built using the simple normal equations in Python, when 𝑛 < 𝑜 ? 1. Python will return an error, because the true minimum least-squares cost is infinite 2. Python will return an error, even though the true minimum least-squares cost is zero 3. Python will correctly compute the optimal solution, with strictly positive cost 4. Python will correctly compute the optimal solution, with zero cost 20

Alternative loss functions Why did we pick the squared loss ℓ ℎ 휃 𝑦 , 𝑧 = ℎ 휃 𝑦 − 𝑧 2 ? Why not use an alternative like absolute loss ℓ ℎ 휃 𝑦 , 𝑧 = ℎ 휃 𝑦 − 𝑧 ? We could write this optimization problem as 푚 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 minimize ∑ ≡ minimize 𝑌𝜄 − 𝑧 1 휃 휃 푖=1 where 𝑨 1 = ∑ 푖 𝑨 푖 is called the ℓ 1 norm No closed-form solution, but (sub)gradient is given by 𝛼 휃 𝑌𝜄 − 𝑧 1 = 𝑌 푇 sign(𝑌𝜄 − 𝑧) 21

Poll: alternative loss solutions Solutions for minimizing squared error and absolute error Po Poll: which solution is which? 1. Green is squared loss, red is absolute 2. Red is squared loss, green is absolute 3. Those lines look identical to me 22

Classification tasks Regression tasks: predicting real-valued quantity 𝑧 ∈ ℝ Classification tasks: predicting discrete-valued quantity 𝑧 Binary classification: 𝑧 ∈ −1, +1 Multiclass classification: 𝑧 ∈ 1,2, … , 𝑙 24

Example: breast cancer classification Well-known classification example: using machine learning to diagnose whether a breast tumor is benign or malignant [Street et al., 1992] Setting: doctor extracts a sample of fluid from tumor, stains cells, then outlines several of the cells (image processing refines outline) System computes features for each cell such as area, perimeter, concavity, texture (10 total); computes mean/std/max for all features 25

Example: breast cancer classification Plot of two features: mean area vs. mean concave points, for two classes 26

Linear classification example Linear classification ≡ “drawing line separating classes” 27

̂ Formal setting tures: 𝑦 푖 ∈ ℝ 푛 , 𝑗 = 1, … , 𝑛 Inpu In put t featu Mean_Area 푖 E. g. : 𝑦 푖 = Mean_Concave_Points 푖 1 Outputs: 𝑧 푖 ∈ {−1, +1}, 𝑗 = 1, … , 𝑛 Ou E. g. : 𝑧 푖 ∈ {−1 benign , +1 (malignant)} rs: 𝜄 ∈ ℝ 푛 Model Mo l para rameters : ℎ 휃 : ℝ 푛 → ℝ , aims for same sign as the output (informally, a Hy Hypot othesis f function on: measure of confidence in our prediction) E. g. : ℎ 휃 𝑦 = 𝜄 푇 𝑦, 𝑧 = sign(ℎ 휃 𝑦 ) 28

15-780 Graduate Artificial Intelligence: Machine learning J. Zico - PowerPoint PPT Presentation

15-780 Graduate Artificial Intelligence: Machine learning J. Zico Kolter (this lecture) and Nihar Shah Carnegie Mellon University Spring 2020 1 Outline What is machine learning? Linear regression Linear classification Nonlinear methods

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

15-780 - graduate artificial intelligence ai and education i . Shayan Doroudi April 24, 2017 1

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

15-780 Graduate Artificial Intelligence: Adversarial attacks and provable defenses J. Zico

15-780 Graduate Artificial Intelligence: Probabilistic modeling J. Zico Kolter (this lecture)

15-780 Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter

15-780 Graduate Artificial Intelligence: Integer programming J. Zico Kolter (this lecture)

15-780 Graduate Artificial Intelligence: Integer programming J. Zico Kolter (this lecture)

15-780 Graduate Artificial Intelligence: Probabilistic inference J. Zico Kolter (this

15-780 Graduate Artificial Intelligence: Optimization J. Zico Kolter (this lecture) and Ariel

15-780 Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter

15-780 - graduate artificial intelligence ai and education iii . Shayan Doroudi May 1, 2017 1

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

CSE 446: Week 3: Decision Trees (Apr 4) Instructor: Sergey Levine I. Overfitting idea 1: holdout

CBOE Holdings, Inc. First Quarter Earnings Conference Call M May 5, 2011 5 2011 p. 1 p. 1

Dynamic control of a multi class G / M / 1 + M queue with abandonments Alexandre Salch,

REINVENTING RETAIL David Rosenberg CEO Prime Automotive Group OVERVIEW: Looking back 10

DATA MINING CSE4334/5334 Data Mining, Fall 2014 Lecture 8: Department of Computer Science and

Natural Analysts in Adaptive Data Analysis Tijana Zrnic joint with Moritz Hardt Adaptivity

Introduction to Machine Learning Evaluation: Test Error Learning goals training error 0.06

A CONTINUAL LEARNING APPROACH FOR LOCAL LEVEL ENVIRONMENTAL MONITORING IN LOW-RESOURCE SETTINGS