Logistic Regression Matt Gormley Lecture 9 Feb. 13, 2019 1 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Feb. 13, 2019 1

Q&A Q: In recitation, we only covered the Perceptron mistake bound for linearly separable data . Isn’t that an unrealistic setting? A: Not at all! Even if your data isn’t linearly separable to begin with, we can often add features to make it so. Exercise : Add x 1 x 2 y another feature to +1 +1 + transform this +1 -1 - nonlinearly separable data into linearly -1 +1 - separable data. -1 -1 + 2

Reminders • Homework 3: KNN, Perceptron, Lin.Reg. – Out: Wed, Feb 6 – Due: Fri, Feb 15 at 11:59pm • Homework 4: Logistic Regression – Out: Fri, Feb 15 – Due: Fri, Mar 1 at 11:59pm • Midterm Exam 1 – Thu, Feb 21, 6:30pm – 8:00pm • Today’s In-Class Poll – http://p9.mlcourse.org 3

PROBABILISTIC LEARNING 5

Probabilistic Learning Function Approximation Probabilistic Learning Previously, we assumed that our Today, we assume that our output was generated using a output is sampled from a deterministic target function : conditional probability distribution : Our goal is to learn a probability Our goal was to learn a distribution p(y| x ) that best hypothesis h( x ) that best approximates p * (y| x ) approximates c * ( x ) 6

Robotic Farming Deterministic Probabilistic Classification Is this a picture of Is this plant (binary output) a wheat kernel? drought resistant? Regression How many wheat What will the yield (continuous kernels are in this of this plant be? output) picture? 7

Bayes Optimal Classifier Whiteboard – Bayes Optimal Classifier – Reducible / irreducible error – Ex: Bayes Optimal Classifier for 0/1 Loss 8

Learning from Data (Frequentist) Whiteboard – Principle of Maximum Likelihood Estimation (MLE) – Strawmen: • Example: Bernoulli • Example: Gaussian • Example: Conditional #1 (Bernoulli conditioned on Gaussian) • Example: Conditional #2 (Gaussians conditioned on Bernoulli) 10

MOTIVATION: LOGISTIC REGRESSION 12

Example: Image Classification • ImageNet LSVRC-2010 contest: – Dataset : 1.2 million labeled images, 1000 classes – Task : Given a new image, label it with the correct class – Multiclass classification problem • Examples from http://image-net.org/ 15

Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest • Five convolutional layers Input 1000-way (w/max-pooling) image softmax (pixels) • Three fully connected layers 19

Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest • Five convolutional layers Input 1000-way (w/max-pooling) image softmax (pixels) • Three fully connected layers The rest is just This “softmax” some fancy layer is Logistic feature extraction Regression! (discussed later in the course) 20

LOGISTIC REGRESSION 21

Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. We are back to classification. Despite the name logistic regression. 22

Recall… Linear Models for Classification Key idea: Try to learn this hyperplane directly Looking ahead: Directly modeling the • We’ll see a number of hyperplane would use a commonly used Linear decision function: Classifiers • These include: h ( � ) = sign ( θ T � ) – Perceptron – Logistic Regression – Naïve Bayes (under for: certain conditions) – Support Vector y ∈ { − 1 , +1 } Machines

Recall… Background: Hyperplanes Hyperplane (Definition 1): Notation Trick : fold the H = { x : w T x = b } bias b and the weights w into a single vector θ by Hyperplane (Definition 2): prepending a constant to x and increasing dimensionality by one! w Half-spaces:

Using gradient ascent for linear classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model 25

Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 h ( � ) = sign ( θ T � ) p θ ( y = 1 | � ) = 1 + �� ( − θ T � ) 1 sign(x) logistic( u ) ≡ 1 + e − u 26

Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 h ( � ) = sign ( θ T � ) p θ ( y = 1 | � ) = 1 + �� ( − θ T � ) 1 sign(x) logistic( u ) ≡ 1 + e − u 27

Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. Model: Logistic function applied to dot product of parameters with input vector. 1 p θ ( y = 1 | � ) = 1 + �� ( − θ T � ) Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ Prediction: Output is the most probable class. y = �� ˆ p θ ( y | � ) y ∈ { 0 , 1 } 28

Logistic Regression Whiteboard – Bernoulli interpretation – Logistic Regression Model – Decision boundary 29

Learning for Logistic Regression Whiteboard – Partial derivative for Logistic Regression – Gradient for Logistic Regression 30

Logistic Regression 31

LEARNING LOGISTIC REGRESSION 34

Maximum Conditional Likelihood Estimation Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ We minimize the negative log conditional likelihood: N � p θ ( y ( i ) | � ( i ) ) J ( θ ) = − �� i =1 Why? 1. We can’t maximize likelihood (as in Naïve Bayes) because we don’t have a joint model p(x,y) 2. It worked well for Linear Regression (least squares is MCLE) 35

Maximum Conditional Likelihood Estimation θ ∗ = argmin Learning: Four approaches to solving J ( θ ) θ Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton’s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters) 36

Maximum Conditional Likelihood Estimation θ ∗ = argmin Learning: Four approaches to solving J ( θ ) θ Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton’s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters) Logistic Regression does not have a closed form solution for MLE parameters. 37

SGD for Logistic Regression Question: Which of the following is a correct description of SGD for Logistic Regression? Answer: At each step (i.e. iteration) of SGD for Logistic Regression we… A. (1) compute the gradient of the log-likelihood for all examples (2) update all the parameters using the gradient B. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down, (3) report that answer C. (1) compute the gradient of the log-likelihood for all examples (2) randomly pick an example (3) update only the parameters for that example D. (1) randomly pick a parameter, (2) compute the partial derivative of the log- likelihood with respect to that parameter, (3) update that parameter for all examples E. (1) randomly pick an example, (2) compute the gradient of the log-likelihood for that example, (3) update all the parameters using that gradient F. (1) randomly pick a parameter and an example, (2) compute the gradient of the log-likelihood for that example with respect to that parameter, (3) update that parameter using that gradient 38

Recall… Gradient Descent Algorithm 1 Gradient Descent 1: procedure GD ( D , θ (0) ) θ � θ (0) 2: while not converged do 3: θ � θ + λ � θ J ( θ ) — 4: return θ 5: d d θ 1 J ( θ )   In order to apply GD to Logistic d Regression all we need is the d θ 2 J ( θ )   gradient of the objective � θ J ( θ ) =   . .   function (i.e. vector of partial .   derivatives). d d θ N J ( θ ) 39

Recall… Stochastic Gradient Descent (SGD) — We can also apply SGD to solve the MCLE problem for Logistic Regressio n. We need a per-example objective: Let J ( θ ) = � N i =1 J ( i ) ( θ ) where J ( i ) ( θ ) = − �� p θ ( y i | � i ) . 40

Mini-Batch SGD • Gradient Descent : Compute true gradient exactly from all N examples • Mini-Batch SGD : Approximate true gradient by the average gradient of K randomly chosen examples • Stochastic Gradient Descent (SGD) : Approximate true gradient by the gradient of one randomly chosen example 41

Mini-Batch SGD Three variants of first-order optimization: 42

Logistic Regression Matt Gormley Lecture 9 Feb. 13, 2019 1 - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Logistic Regression Matt Gormley Lecture 9 Feb. 13, 2019 1 Q&A Q: In recitation, we only covered the Perceptron

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Logistic Regression: MLE vs. OLS3 in Excel2013 25 Aug 2016 V0H V0H V0H Schield MLE vs.

Is the best model good enough? Assessing the absolute fit of phylogenetic models via posterior

Nutritional Information Calories Saturated Trans Fat Cholestero Carbohydrate Dietary Vitamin

Introduction to relational databases Importing Data in Python I What is a relational database?

Introd u ction to relational databases IN TR OD U C TION TO IMP OR TIN G DATA IN P YTH ON H u

The Chow form of a reciprocal linear space Cynthia Vinzant North Carolina State University joint

Summary statistics DATA MAN IP ULATION W ITH PAN DAS Maggie Matsui Content Developer at

Stanley @ 70 June 23, 2014 4 n Theorem 1 . There are binary sequences of length 4 n + 1 2

Tropical cycles and Chow polytopes Alex Fink Department of Mathematics University of California,