[PPT] - Logistic Regression & Softmax Regression Rui Xia T ext M ining PowerPoint Presentation

SLIDE 1

Lecture 3 Logistic Regression & Softmax Regression

Rui Xia Text Mining Group Nanjing University of Science & Technology rxia@njust.edu.cn

SLIDE 2

Supervised Learning

Classification
Regression

Machine Learning, by Rui Xia @ NJUST 2

SLIDE 3

Logistic Regression

Machine Learning, by Rui Xia @ NJUST 3

SLIDE 4

Introduction

Logistic Regression is a classification model, although it is called

“regression”;

Logistic regression is a binary classification model;
Logistic regression is a linear classification model. It has a linear

decision boundary (hyperplane), but with a nonlinear activation function (Sigmoid function) to model the posterior probability.

Machine Learning, by Rui Xia @ NJUST 4

SLIDE 5

Model Hypothesis

Hypothesis

Machine Learning, by Rui Xia @ NJUST

Hypothesis (Compact Form)
Sigmoid Function

𝜀 𝑨 = 1 1 + 𝑓−𝑨 𝑒𝜀 𝑨 𝑒𝑨 = 𝜀 𝑨 (1 − 𝜀 𝑨 )

𝑞 𝑧 = 1|𝑦; 𝜄 = ℎ𝜄 𝑦 = 𝜀 𝜄𝑈𝑦 =

1 1+𝑓−𝜄𝑈𝑦

𝑞 𝑧 = 0 |𝑦; 𝜄 = 1 − ℎ𝜄 𝑦 𝑞 𝑧 |𝑦; 𝜄 = (ℎ𝜄 𝑦 )𝑧(1 − ℎ𝜄 𝑦 )(1−𝑧)= ( 1 1 + 𝑓−𝜄𝑈𝑦 )𝑧(1 − 1 1 + 𝑓−𝜄𝑈𝑦 )(1−𝑧)

5

SLIDE 6

Learning Algorithm

(Conditional) Likelihood Function

The neg log-likelihood function is also known as the Cross-Entropy cost function

Machine Learning, by Rui Xia @ NJUST

𝑀 𝜄 = ෑ

𝑗=1 𝑂

𝑞 𝑧 𝑗 𝑦 𝑗 ; 𝜄 = ෑ

𝑗=1 𝑂

ℎ𝜄 𝑦(𝑗)

𝑧(𝑗)

1 − ℎ𝜄 𝑦(𝑗)

(1−𝑧(𝑗))

= ෑ

𝑗=1 𝑂

1 1 + 𝑓−𝜄𝑈𝑦(𝑗)

𝑧(𝑗)

1 − 1 1 + 𝑓−𝜄𝑈𝑦(𝑗)

(1−𝑧(𝑗))

max

𝜄

𝑀 𝜄 ֞ max

𝜄

෍

𝑗=1 𝑜

𝑧(𝑗)logℎ𝜄 𝑦(𝑗) + 1 − 𝑧 𝑗 log 1 − ℎ𝜄 𝑦 𝑗

Maximum Likelihood Estimation

6

SLIDE 7

Unconstraint Optimization

Unconstraint Optimization Problem

Machine Learning, by Rui Xia @ NJUST

Optimization Methods

– Gradient Descent – Stochastic Gradient Descent – Newton Method – Quasi-Newton Method – Conjugate Gradient – …

max

𝜄

෍

𝑗=1 𝑜

𝑧(𝑗)logℎ𝜄 𝑦(𝑗) + 1 − 𝑧 𝑗 log 1 − ℎ𝜄 𝑦 𝑗

7

SLIDE 8

Gradient Descent/Ascent

Gradient Computation

Error × Feature

Machine Learning, by Rui Xia @ NJUST

Gradient Ascent Optimization

𝑒𝑚(𝜄) 𝑒𝜄 = ෍

𝑗=1 𝑂

𝑧(𝑗) 1 ℎ𝜄(𝑦(𝑗)) − (1 − 𝑧 𝑗 ) 1 1 − ℎ𝜄 𝑦 𝑗 𝜖 𝜖𝜄 ℎ𝜄(𝑦(𝑗)) = ෍

𝑗=1 𝑂

𝑧 𝑗 1 ℎ𝜄 𝑦 𝑗 − 1 − 𝑧 𝑗 1 1 − ℎ𝜄 𝑦 𝑗 ℎ𝜄 𝑦 𝑗 1 − ℎ𝜄 𝑦 𝑗 𝜖 𝜖𝜄 𝜄𝑈𝑦 𝑗 = ෍

𝑗=1 𝑂

𝑧 𝑗 1 − ℎ𝜄 𝑦 𝑗 − 1 − 𝑧 𝑗 ℎ𝜄 𝑦 𝑗 𝑦 𝑗 = ෍

𝑗=1 𝑂

𝑧 𝑗 − ℎ𝜄 𝑦 𝑗 𝑦 𝑗 𝜄 ≔ 𝜄 + 𝛽 ෍

𝑗=1 𝑂

𝑧 𝑗 − ℎ𝜄 𝑦 𝑗 𝑦 𝑗

8

SLIDE 9

Stochastic Gradient Descent

Randomly choose a training sample
Compute gradient
Updating weights
Repeat…

Machine Learning, by Rui Xia @ NJUST

Gradient descent -- batch updating Stochastic gradient descent -- online updating

(𝑦, 𝑧) (𝑧 − ℎ𝜄(𝑦))𝑦 𝜄 ≔ 𝜄 + 𝛽(𝑧 − ℎ𝜄(𝑦))𝑦

9

SLIDE 10

GD vs. SGD

Machine Learning, by Rui Xia @ NJUST

Gradient Descent (GD) Stochastic Gradient Descent (SGD)

10

SLIDE 11

Illustration of Newton’s Method

tangent line:

Machine Learning, by Rui Xia @ NJUST

𝑕 = 𝑔′ 𝜄0 + 𝑔′′(𝜄0)(𝜄 − 𝜄0) 𝑕 = 𝑔′(𝜄) 𝜄(1) = 𝜄(0) − 𝑔′ 𝜄(0) 𝑔′′(𝜄(0)) 𝜄(2) = 𝜄(1) − 𝑔′ 𝜄(1) 𝑔′′(𝜄(1))

𝜄(3), 𝜄(4), ⋯ , 𝜄∗

𝜄(2) 𝜄(1)

𝑕 𝜄

𝜄(0)

11

SLIDE 12

Newton’s Method

Problem

Machine Learning, by Rui Xia @ NJUST

Hessian Matrix

Second-order Taylor expansion
Newton’s method (also called Newton-Raphson method)

arg min 𝑔 𝜄 ֞ 𝑡𝑝𝑚𝑤𝑓 ∶ 𝛼𝑔 𝜄 = 0 𝜚 𝜄 = 𝑔 𝜄 𝑙 + 𝛼𝑔 𝜄 𝑙 θ − 𝜄 𝑙 + 1 2 𝛼2𝑔(𝜄 𝑙 ) θ − 𝜄 𝑙

2 ≈ 𝑔(𝜄)

𝛼𝜚 𝜄 = 0 ֜ 𝜄 = 𝜄 𝑙 − 𝛼2𝑔(𝜄 𝑙 )−1𝛼𝑔(𝜄 𝑙 ) 𝜄 𝑙+1 = 𝜄 𝑙 − 𝛼2𝑔(𝜄 𝑙 )−1𝛼𝑔(𝜄 𝑙 )

12

SLIDE 13

Gradient’ vs. Newton’s Method

Machine Learning, by Rui Xia @ NJUST 13

SLIDE 14

Newton’s Method for Logistic Regression

Optimization Problem

Machine Learning, by Rui Xia @ NJUST

Gradient and Hessian Matrix
Weight updating using Newton’s method

arg min 1 𝑂 ෍

𝑗=1 𝑂

−𝑧(𝑗) logℎ𝜄 𝑦(𝑗) − 1 − 𝑧 𝑗 log 1 − ℎ𝜄 𝑦(𝑗) 𝛼𝐾 𝜄 = 1 𝑂 ෍

𝑗=1 𝑂

ℎ𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑦 𝑗 𝐼 = 1 𝑂 ෍

𝑗=1 𝑂

ℎ𝜄 𝑦 𝑗

T 1 − ℎ𝜄 𝑦 𝑗

𝑦 𝑗 (𝑦(𝑗))T

𝜄(𝑢+1) = 𝜄(𝑢) − 𝐼−1𝛼𝐾(𝜄(𝑢))

14

SLIDE 15

Practice: Logistic Regression

Given the following training data:
Implement 1) GD; 2) SGD; 3) Newton's Method for logistic regression, starting

with the initial parameter \theta=0.

Determine how many iterations to use, and calculate for each iteration and plot

your results.

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html

Machine Learning, by Rui Xia @ NJUST 15

SLIDE 16

Softmax Regression

Machine Learning, by Rui Xia @ NJUST 16

SLIDE 17

Softmax Regression

Softmax Regression is a multi-class classification model, also called

Multi-class Logistic Regression;

It is also known as the Maximum Entropy Model (in NLP);
It is one of the most used classification algorithms.

Machine Learning, by Rui Xia @ NJUST 17

SLIDE 18

Model Description

Model Hypothesis

Machine Learning, by Rui Xia @ NJUST

Model Hypothesis (Compact Form)
Parameters

𝑞 𝑧 = 𝑘 𝑦; 𝜄 = ℎ𝑘 𝑦 = 𝑓𝜄𝑘

T𝑦

1 + σ𝑘’=1

𝑑−1 𝑓𝜄𝑘’

T𝑦 , 𝑘 = 1, … , 𝐷 − 1

𝑞 𝑧 = 𝐷 𝑦; 𝜄 = ℎ𝐷 𝑦 = 1 1 + σ𝑘’=1

𝑑−1 exp{𝜄 𝑘’ T𝑦}

𝑞 𝑧 = 𝑘 𝑦; 𝜄 = ℎ𝑘 𝑦 = 𝑓𝜄𝑘

T𝑦

σ𝑘’=1

𝐷

𝑓𝜄𝑘’

T𝑦 , 𝑘 = 1,2, … , 𝐷, where 𝜄𝐷 = 0

𝜄𝐷×𝑁

18

SLIDE 19

Maximum Likelihood Estimation

(Conditional) Log-likelihood

Machine Learning, by Rui Xia @ NJUST 19

Softmax Regression Logistic Regression

𝑚 𝜄 = ෍

𝑗=1 𝑂

log 𝑞(𝑧 𝑗 |𝑦 𝑗 ; 𝜄) = ෍

𝑗=1 𝑂

log ෑ

𝑘=1 𝐷

𝑓𝜄𝑘

T𝑦

σ𝑘’=1

𝐷

𝑓𝜄𝑘’

T𝑦

1{𝑧 𝑗 =𝑘}

= ෍

𝑗=1 𝑂

෍

𝑘=1 𝐷

1 𝑧 𝑗 = 𝑘 log 𝑓𝜄𝑘

T𝑦

σ𝑘’=1

𝐷

𝑓𝜄𝑘’

T𝑦

= ෍

𝑗=1 𝑂

෍

𝑘=1 𝐷

1 𝑧 𝑗 = 𝑘 log ℎ𝑘(𝑦 𝑗 ) 𝑚 𝜄 = ෍

𝑗=1 𝑂

𝑧 𝑗 log ℎ𝜄 𝑦 𝑗 + 1 − 𝑧 𝑗 log 1 − ℎ𝜄 𝑦 𝑗

SLIDE 20

Gradient

Gradient Descent Optimization

Error × Feature

Machine Learning, by Rui Xia @ NJUST

𝜖 log ℎ𝑘(𝑦) 𝜖𝜄𝑙 = ൝ 1 − ℎ𝑙 𝑦 𝑦, 𝑘 = 𝑙 −ℎ𝑙 𝑦 𝑦, 𝑘 ≠ 𝑙 𝜖 σ𝑘=1

𝐷

1{𝑧 = 𝑘} log ℎ𝑘(𝑦) 𝜖𝜄𝑙 = ൝ 1 − ℎ𝑙 𝑦 𝑦, 𝑧 = 𝑙 −ℎ𝑙 𝑦 𝑦, 𝑧 ≠ 𝑙 = 1 𝑧 = 𝑙 − ℎ𝑙 𝑦 𝑦 𝜖𝑚(𝜄) 𝜖𝜄𝑙 = ෍

𝑗=1 𝑂

1 𝑧 𝑗 = 𝑙 − ℎ𝑙(𝑦 𝑗 ) 𝑦 𝑗

20

SLIDE 21

Gradient Descent Optimization

Gradient Descent

Machine Learning, by Rui Xia @ NJUST

Stochastic Gradient Descent

𝜄𝑙 : = 𝜄𝑙 + 𝛽 ෍

𝑗=1 𝑂

1 𝑧 𝑗 = 𝑙 − ℎ𝑙(𝑦 𝑗 ) 𝑦 𝑗 where ℎ𝑙 𝑦 = 𝑓𝜄𝑙

T𝑦

σ𝑙’=1

𝐷

𝑓𝜄𝑙’

T𝑦 , 𝑙 = 1,2, … , 𝐷

𝜄𝑙 : = 𝜄𝑙 + 𝛽 1 𝑧 = 𝑙 − ℎ𝑙 𝑦 𝑦

21

SLIDE 22

The other optimization methods

Newton Method
Quasi-Newton Method (BFGS)
Limited Memory BFGS (L-BFGS)
Conjugate Gradient
GIS
IIS
…

Machine Learning, by Rui Xia @ NJUST 22

SLIDE 23

Practice: Softmax Regression

Given the following training data:
Implement logistic regression with 1) GD; 2) SGD.
Implement softmax regression with 1) GD; 2) SGD.
Compare logisitic regression and softmax regression.

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html

Machine Learning, by Rui Xia @ NJUST 23

SLIDE 24

Machine Learning, by Rui Xia @ NJUST

Questions?

24