Binary Logistic Regression + Multinomial Logistic Regression Matt - - PowerPoint PPT Presentation

binary logistic regression multinomial logistic regression
SMART_READER_LITE
LIVE PREVIEW

Binary Logistic Regression + Multinomial Logistic Regression Matt - - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Binary Logistic Regression + Multinomial Logistic Regression Matt Gormley Lecture 10 Feb. 17, 2020 1 Reminders


slide-1
SLIDE 1

Binary Logistic Regression + Multinomial Logistic Regression

1

10-601 Introduction to Machine Learning

Matt Gormley Lecture 10

  • Feb. 17, 2020

Machine Learning Department School of Computer Science Carnegie Mellon University

slide-2
SLIDE 2

Reminders

  • Midterm Exam 1

– Tue, Feb. 18, 7:00pm – 9:00pm

  • Homework 4: Logistic Regression

– Out: Wed, Feb. 19 – Due: Fri, Feb. 28 at 11:59pm

  • Today’s In-Class Poll

– http://p10.mlcourse.org

  • Reading on Probabilistic Learning is reused

later in the course for MLE/MAP

3

slide-3
SLIDE 3

MLE

5

Suppose we have data D = {x(i)}N

i=1 MLE

  • MAP
  • Principle of Maximum Likelihood Estimation:

Choose the parameters that maximize the likelihood

  • f the data.

θMLE =

θ N

  • i=1

p((i)|θ)

Maximum Likelihood Estimate (MLE) L(θ) θMLE θMLE θ2 θ1 L(θ1, θ2)

slide-4
SLIDE 4

MLE

What does maximizing likelihood accomplish?

  • There is only a finite amount of probability

mass (i.e. sum-to-one constraint)

  • MLE tries to allocate as much probability

mass as possible to the things we have

  • bserved…

…at the expense of the things we have not

  • bserved

6

slide-5
SLIDE 5

MOTIVATION: LOGISTIC REGRESSION

7

slide-6
SLIDE 6

Example: Image Classification

  • ImageNet LSVRC-2010 contest:

– Dataset: 1.2 million labeled images, 1000 classes – Task: Given a new image, label it with the correct class – Multiclass classification problem

  • Examples from http://image-net.org/

10

slide-7
SLIDE 7

11

slide-8
SLIDE 8

12

slide-9
SLIDE 9

13

slide-10
SLIDE 10

Example: Image Classification

14

CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest Input image (pixels)

  • Five convolutional layers

(w/max-pooling)

  • Three fully connected layers

1000-way softmax

slide-11
SLIDE 11

Example: Image Classification

15

CNN for Image Classification (Krizhevsky, Sutskever & Hinton, 2011) 17.5% error on ImageNet LSVRC-2010 contest Input image (pixels)

  • Five convolutional layers

(w/max-pooling)

  • Three fully connected layers

1000-way softmax This “softmax” layer is Logistic Regression!

The rest is just some fancy feature extraction (discussed later in the course)

slide-12
SLIDE 12

LOGISTIC REGRESSION

16

slide-13
SLIDE 13

Logistic Regression

17

We are back to classification. Despite the name logistic regression.

Data: Inputs are continuous vectors of length M. Outputs are discrete.

slide-14
SLIDE 14

Key idea: Try to learn this hyperplane directly

Linear Models for Classification

Directly modeling the hyperplane would use a decision function: for:

h() = sign(θT )

y ∈ {−1, +1}

Looking ahead:

  • We’ll see a number of

commonly used Linear Classifiers

  • These include:

– Perceptron – Logistic Regression – Naïve Bayes (under certain conditions) – Support Vector Machines

Recall…

slide-15
SLIDE 15

Background: Hyperplanes

H = {x : wT x = b}

Hyperplane (Definition 1):

w

Hyperplane (Definition 2): Half-spaces:

Notation Trick: fold the bias b and the weights w into a single vector θ by prepending a constant to x and increasing dimensionality by one!

Recall…

slide-16
SLIDE 16

Using gradient ascent for linear classifiers

Key idea behind today’s lecture:

1. Define a linear classifier (logistic regression)

  • 2. Define an objective function (likelihood)
  • 3. Optimize it with gradient descent to learn

parameters

  • 4. Predict the class with highest probability under

the model

20

slide-17
SLIDE 17

Using gradient ascent for linear classifiers

21

Use a differentiable function instead: logistic(u) ≡ 1 1+e−u

pθ(y = 1|) = 1 1 + (−θT )

This decision function isn’t differentiable:

sign(x)

h() = sign(θT )

slide-18
SLIDE 18

Using gradient ascent for linear classifiers

22

Use a differentiable function instead: logistic(u) ≡ 1 1+e−u

pθ(y = 1|) = 1 1 + (−θT )

This decision function isn’t differentiable:

sign(x)

h() = sign(θT )

slide-19
SLIDE 19

Logistic Regression

23

Learning: finds the parameters that minimize some

  • bjective function. θ∗ = argmin

θ

J(θ)

Prediction: Output is the most probable class.

ˆ y =

y∈{0,1}

pθ(y|)

Model: Logistic function applied to dot product of parameters with input vector.

pθ(y = 1|) = 1 1 + (−θT )

Data: Inputs are continuous vectors of length M. Outputs are discrete.

slide-20
SLIDE 20

Logistic Regression

Whiteboard

– Bernoulli interpretation – Logistic Regression Model – Decision boundary

24

slide-21
SLIDE 21

Learning for Logistic Regression

Whiteboard

– Partial derivative for Logistic Regression – Gradient for Logistic Regression

25

slide-22
SLIDE 22

LOGISTIC REGRESSION ON GAUSSIAN DATA

26

slide-23
SLIDE 23

Logistic Regression

27

slide-24
SLIDE 24

Logistic Regression

28

slide-25
SLIDE 25

Logistic Regression

29

slide-26
SLIDE 26

LEARNING LOGISTIC REGRESSION

30

slide-27
SLIDE 27

Maximum Conditional Likelihood Estimation

31

Learning: finds the parameters that minimize some

  • bjective function.

We minimize the negative log conditional likelihood: Why? 1. We can’t maximize likelihood (as in Naïve Bayes) because we don’t have a joint model p(x,y)

  • 2. It worked well for Linear Regression (least squares is

MCLE)

θ∗ = argmin

θ

J(θ)

J(θ) = −

N

  • i=1

pθ(y(i)|(i))

slide-28
SLIDE 28

Maximum Conditional Likelihood Estimation

32

Learning: Four approaches to solving

Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton’s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters)

θ∗ = argmin

θ

J(θ)

slide-29
SLIDE 29

Maximum Conditional Likelihood Estimation

33

Learning: Four approaches to solving

Approach 1: Gradient Descent (take larger – more certain – steps opposite the gradient) Approach 2: Stochastic Gradient Descent (SGD) (take many small steps opposite the gradient) Approach 3: Newton’s Method (use second derivatives to better follow curvature) Approach 4: Closed Form??? (set derivatives equal to zero and solve for parameters)

θ∗ = argmin

θ

J(θ)

Logistic Regression does not have a closed form solution for MLE parameters.

slide-30
SLIDE 30

SGD for Logistic Regression

34

Question: Which of the following is a correct description of SGD for Logistic Regression? Answer: At each step (i.e. iteration) of SGD for Logistic Regression we… A. (1) compute the gradient of the log-likelihood for all examples (2) update all the parameters using the gradient B. (1) ask Matt for a description of SGD for Logistic Regression, (2) write it down, (3) report that answer C. (1) compute the gradient of the log-likelihood for all examples (2) randomly pick an example (3) update only the parameters for that example D. (1) randomly pick a parameter, (2) compute the partial derivative of the log- likelihood with respect to that parameter, (3) update that parameter for all examples E. (1) randomly pick an example, (2) compute the gradient of the log-likelihood for that example, (3) update all the parameters using that gradient F. (1) randomly pick a parameter and an example, (2) compute the gradient of the log-likelihood for that example with respect to that parameter, (3) update that parameter using that gradient

slide-31
SLIDE 31

Algorithm 1 Gradient Descent

1: procedure GD(D, θ(0)) 2:

θ θ(0)

3:

while not converged do

4:

θ θ + λθJ(θ)

5:

return θ

Gradient Descent

35

In order to apply GD to Logistic Regression all we need is the gradient of the objective function (i.e. vector of partial derivatives).

θJ(θ) =     

d dθ1 J(θ) d dθ2 J(θ)

. . .

d dθN J(θ)

    

Recall…

slide-32
SLIDE 32

Stochastic Gradient Descent (SGD)

36

Recall…

We need a per-example objective: We can also apply SGD to solve the MCLE problem for Logistic Regression.

Let J(θ) = N

i=1 J(i)(θ)

where J(i)(θ) = − pθ(yi|i).

slide-33
SLIDE 33

Answer:

Logistic Regression vs. Perceptron

37

Question:

True or False: Just like Perceptron, one step (i.e. iteration) of SGD for Logistic Regression will result in a change to the parameters only if the current example is incorrectly classified.

slide-34
SLIDE 34

Matching Game

Goal: Match the Algorithm to its Update Rule

38

  • 1. SGD for Logistic Regression
  • 2. Least Mean Squares
  • 3. Perceptron

4. 5. 6.

θk ← θk + 1 1 + exp λ(hθ(x(i)) − y(i)) θk ← θk + (hθ(x(i)) − y(i)) θk ← θk + λ(hθ(x(i)) − y(i))x(i)

k

hθ(x) = p(y|x) hθ(x) = θT x hθ(x) = sign(θT x)

  • A. 1=5, 2=4, 3=6
  • B. 1=5, 2=6, 3=4
  • C. 1=6, 2=4, 3=4
  • D. 1=5, 2=6, 3=6
  • E. 1=6, 2=6, 3=6
  • F. 1=6, 2=5, 3=5
  • G. 1=5, 2=5, 3=5
  • H. 1=4, 2=5, 3=6
slide-35
SLIDE 35

OPTIMIZATION METHOD #4: MINI-BATCH SGD

39

slide-36
SLIDE 36

Mini-Batch SGD

  • Gradient Descent:

Compute true gradient exactly from all N examples

  • Stochastic Gradient Descent (SGD):

Approximate true gradient by the gradient

  • f one randomly chosen example
  • Mini-Batch SGD:

Approximate true gradient by the average gradient of K randomly chosen examples

40

slide-37
SLIDE 37

Mini-Batch SGD

41

Three variants of first-order optimization:

slide-38
SLIDE 38

Summary

  • 1. Discriminative classifiers directly model the

conditional, p(y|x)

  • 2. Logistic regression is a simple linear

classifier, that retains a probabilistic semantics

  • 3. Parameters in LR are learned by iterative
  • ptimization (e.g. SGD)

50

slide-39
SLIDE 39

Logistic Regression Objectives

You should be able to…

  • Apply the principle of maximum likelihood estimation (MLE) to

learn the parameters of a probabilistic model

  • Given a discriminative probabilistic model, derive the conditional

log-likelihood, its gradient, and the corresponding Bayes Classifier

  • Explain the practical reasons why we work with the log of the

likelihood

  • Implement logistic regression for binary or multiclass

classification

  • Prove that the decision boundary of binary logistic regression is

linear

  • For linear regression, show that the parameters which minimize

squared error are equivalent to those that maximize conditional likelihood

51

slide-40
SLIDE 40

MULTINOMIAL LOGISTIC REGRESSION

54

slide-41
SLIDE 41

55

slide-42
SLIDE 42

Multinomial Logistic Regression

Chalkboard

– Background: Multinomial distribution – Definition: Multi-class classification – Geometric intuitions – Multinomial logistic regression model – Generative story – Reduction to binary logistic regression – Partial derivatives and gradients – Applying Gradient Descent and SGD – Implementation w/ sparse features

56

slide-43
SLIDE 43

Debug that Program!

In-Class Exercise: Think-Pair-Share Debug the following program which is (incorrectly) attempting to run SGD for multinomial logistic regression

57

Buggy Program:

while not converged: for i in shuffle([1,…,N]): for k in [1,…,K]: theta[k] = theta[k] - lambda * grad(x[i], y[i], theta, k) Assume: grad(x[i], y[i], theta, k) returns the gradient of the negative log-likelihood of the training example (x[i],y[i]) with respect to vector theta[k]. lambda is the learning rate. N = # of examples. K = # of output classes. M = # of

  • features. theta is a K by M matrix.