Binary Logistic Regression + Multinomial Logistic Regression Matt - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Binary Logistic Regression + Multinomial Logistic Regression Matt Gormley Lecture 10 Feb. 17, 2020 1

Reminders • Midterm Exam 1 – Tue, Feb. 18, 7:00pm – 9:00pm • Homework 4: Logistic Regression – Out: Wed, Feb. 19 – Due: Fri, Feb. 28 at 11:59pm • Today’s In-Class Poll – http://p10.mlcourse.org • Reading on Probabilistic Learning is reused later in the course for MLE/MAP 3

� �� MLE Suppose we have data D = { x ( i ) } N i =1 Principle of Maximum Likelihood Estimation: MLE Choose the parameters that maximize the likelihood N of the data. θ MLE = �� p ( � ( i ) | θ ) θ MAP i =1 Maximum Likelihood Estimate (MLE) θ 2 θ MLE L(θ) L(θ 1 , θ 2 ) θ MLE θ 1 5

MLE What does maximizing likelihood accomplish? • There is only a finite amount of probability mass (i.e. sum-to-one constraint) • MLE tries to allocate as much probability mass as possible to the things we have observed… … at the expense of the things we have not observed 6

MOTIVATION: LOGISTIC REGRESSION 7

Example: Image Classification • ImageNet LSVRC-2010 contest: – Dataset : 1.2 million labeled images, 1000 classes – Task : Given a new image, label it with the correct class – Multiclass classification problem • Examples from http://image-net.org/ 10

Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest • Five convolutional layers Input 1000-way (w/max-pooling) image softmax (pixels) • Three fully connected layers 14

Example: Image Classification CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest • Five convolutional layers Input 1000-way (w/max-pooling) image softmax (pixels) • Three fully connected layers The rest is just This “softmax” some fancy layer is Logistic feature extraction Regression! (discussed later in the course) 15

LOGISTIC REGRESSION 16

Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. We are back to classification. Despite the name logistic regression. 17

Recall… Linear Models for Classification Key idea: Try to learn this hyperplane directly Looking ahead: Directly modeling the • We’ll see a number of hyperplane would use a commonly used Linear decision function: Classifiers • These include: h ( � ) = sign ( θ T � ) – Perceptron – Logistic Regression – Naïve Bayes (under for: certain conditions) – Support Vector y ∈ { − 1 , +1 } Machines

Recall… Background: Hyperplanes Hyperplane (Definition 1): Notation Trick : fold the H = { x : w T x = b } bias b and the weights w into a single vector θ by Hyperplane (Definition 2): prepending a constant to x and increasing dimensionality by one! w Half-spaces:

Using gradient ascent for linear classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model 20

Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 h ( � ) = sign ( θ T � ) p θ ( y = 1 | � ) = 1 + �� ( − θ T � ) 1 sign(x) logistic( u ) ≡ 1 + e − u 21

Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 h ( � ) = sign ( θ T � ) p θ ( y = 1 | � ) = 1 + �� ( − θ T � ) 1 sign(x) logistic( u ) ≡ 1 + e − u 22

Logistic Regression Data: Inputs are continuous vectors of length M. Outputs are discrete. Model: Logistic function applied to dot product of parameters with input vector. 1 p θ ( y = 1 | � ) = 1 + �� ( − θ T � ) Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ Prediction: Output is the most probable class. y = �� ˆ p θ ( y | � ) y ∈ { 0 , 1 } 23

Logistic Regression Whiteboard – Bernoulli interpretation – Logistic Regression Model – Decision boundary 24

Learning for Logistic Regression Whiteboard – Partial derivative for Logistic Regression – Gradient for Logistic Regression 25

LOGISTIC REGRESSION ON GAUSSIAN DATA 26

Logistic Regression 27

LEARNING LOGISTIC REGRESSION 30

Maximum Conditional Likelihood Estimation Learning: finds the parameters that minimize some objective function . θ ∗ = argmin J ( θ ) θ We minimize the negative log conditional likelihood: N � p θ ( y ( i ) | � ( i ) ) J ( θ ) = − �� i =1 Why? 1. We can’t maximize likelihood (as in Naïve Bayes) because we don’t have a joint model p(x,y) 2. It worked well for Linear Regression (least squares is MCLE) 31

Maximum Conditional Likelihood Estimation θ ∗ = argmin Learning: Four approaches to solving J ( θ ) θ Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton’s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters) 32

Maximum Conditional Likelihood Estimation θ ∗ = argmin Learning: Four approaches to solving J ( θ ) θ Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton’s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters) Logistic Regression does not have a closed form solution for MLE parameters. 33

SGD for Logistic Regression Question: Which of the following is a correct description of SGD for Logistic Regression? Answer: At each step (i.e. iteration) of SGD for Logistic Regression we… A. (1) compute the gradient of the log-likelihood for all examples (2) update all the parameters using the gradient B. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down, (3) report that answer C. (1) compute the gradient of the log-likelihood for all examples (2) randomly pick an example (3) update only the parameters for that example D. (1) randomly pick a parameter, (2) compute the partial derivative of the log- likelihood with respect to that parameter, (3) update that parameter for all examples E. (1) randomly pick an example, (2) compute the gradient of the log-likelihood for that example, (3) update all the parameters using that gradient F. (1) randomly pick a parameter and an example, (2) compute the gradient of the log-likelihood for that example with respect to that parameter, (3) update that parameter using that gradient 34

Recall… Gradient Descent Algorithm 1 Gradient Descent 1: procedure GD ( D , θ (0) ) θ � θ (0) 2: while not converged do 3: θ � θ + λ � θ J ( θ ) — 4: return θ 5: d d θ 1 J ( θ )   In order to apply GD to Logistic d Regression all we need is the d θ 2 J ( θ )   gradient of the objective � θ J ( θ ) =  .  .   function (i.e. vector of partial .   derivatives). d d θ N J ( θ ) 35

Recall… Stochastic Gradient Descent (SGD) — We can also apply SGD to solve the MCLE problem for Logistic Regressio n. We need a per-example objective: Let J ( θ ) = � N i =1 J ( i ) ( θ ) where J ( i ) ( θ ) = − �� p θ ( y i | � i ) . 36

Logistic Regression vs. Perceptron Question: True or False: Just like Perceptron, one step (i.e. iteration) of SGD for Logistic Regression will result in a change to the parameters only if the current example is incorrectly classified. Answer: 37

Matching Game Goal: Match the Algorithm to its Update Rule 1. SGD for Logistic Regression 4. θ k ← θ k + ( h θ ( x ( i ) ) − y ( i ) ) h θ ( x ) = p ( y | x ) 1 2. Least Mean Squares 5. θ k ← θ k + 1 + exp λ ( h θ ( x ( i ) ) − y ( i ) ) h θ ( x ) = θ T x 3. Perceptron 6. θ k ← θ k + λ ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) h θ ( x ) = sign( θ T x ) k A. 1=5, 2=4, 3=6 E. 1=6, 2=6, 3=6 B. 1=5, 2=6, 3=4 F. 1=6, 2=5, 3=5 C. 1=6, 2=4, 3=4 G. 1=5, 2=5, 3=5 D. 1=5, 2=6, 3=6 H. 1=4, 2=5, 3=6 38

OPTIMIZATION METHOD #4: MINI-BATCH SGD 39

Mini-Batch SGD • Gradient Descent : Compute true gradient exactly from all N examples • Stochastic Gradient Descent (SGD) : Approximate true gradient by the gradient of one randomly chosen example • Mini-Batch SGD : Approximate true gradient by the average gradient of K randomly chosen examples 40

Mini-Batch SGD Three variants of first-order optimization: 41

Binary Logistic Regression + Multinomial Logistic Regression Matt - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Binary Logistic Regression + Multinomial Logistic Regression Matt Gormley Lecture 10 Feb. 17, 2020 1 Reminders

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

Windrose Planarity Embedding Graphs with Direction-Constrained Edges Philipp Kindermann LG

qst tr r srt

Information Visualization Aggregate & Filter 2 Tamara Munzner Department of Computer Science

Bimodal bilattice logic Igor Sedlr Institute of Computer Science, Czech Academy of Sciences,

Midterm Exam Review + Binary Logistic Regression Matt Gormley Lecture 10 Sep. 25, 2019 1

Machine Learning (CSE 446): Probabilistic View of Logistic Regression and Linear Regression Sham

Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Binary Logistic Regression + Multinomial Logistic Regression Matt - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Binary Logistic Regression + Multinomial Logistic Regression Matt Gormley Lecture 10 Feb. 17, 2020 1 Reminders

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &amp;

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

Windrose Planarity Embedding Graphs with Direction-Constrained Edges Philipp Kindermann LG

qst tr r srt

Information Visualization Aggregate &amp; Filter 2 Tamara Munzner Department of Computer Science

Bimodal bilattice logic Igor Sedlr Institute of Computer Science, Czech Academy of Sciences,

Midterm Exam Review + Binary Logistic Regression Matt Gormley Lecture 10 Sep. 25, 2019 1

Machine Learning (CSE 446): Probabilistic View of Logistic Regression and Linear Regression Sham

Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Logistic Regression: From Binary to Multi-Class Shuiwang Ji Department of Computer Science &

Information Visualization Aggregate & Filter 2 Tamara Munzner Department of Computer Science