Logistic Regression & Softmax Regression Rui Xia T ext M ining - - PowerPoint PPT Presentation

β–Ά
logistic regression
SMART_READER_LITE
LIVE PREVIEW

Logistic Regression & Softmax Regression Rui Xia T ext M ining - - PowerPoint PPT Presentation

Lecture 3 Logistic Regression & Softmax Regression Rui Xia T ext M ining Group N anjing U niversity of S cience & T echnology rxia@njust.edu.cn Supervised Learning Regression Classification Machine Learning, by Rui Xia @ NJUST


slide-1
SLIDE 1

Lecture 3 Logistic Regression & Softmax Regression

Rui Xia Text Mining Group Nanjing University of Science & Technology rxia@njust.edu.cn

slide-2
SLIDE 2

Supervised Learning

  • Classification
  • Regression

Machine Learning, by Rui Xia @ NJUST 2

slide-3
SLIDE 3

Logistic Regression

Machine Learning, by Rui Xia @ NJUST 3

slide-4
SLIDE 4

Introduction

  • Logistic Regression is a classification model, although it is called

β€œregression”;

  • Logistic regression is a binary classification model;
  • Logistic regression is a linear classification model. It has a linear

decision boundary (hyperplane), but with a nonlinear activation function (Sigmoid function) to model the posterior probability.

Machine Learning, by Rui Xia @ NJUST 4

slide-5
SLIDE 5

Model Hypothesis

  • Hypothesis

Machine Learning, by Rui Xia @ NJUST

  • Hypothesis (Compact Form)
  • Sigmoid Function

πœ€ 𝑨 = 1 1 + π‘“βˆ’π‘¨ π‘’πœ€ 𝑨 𝑒𝑨 = πœ€ 𝑨 (1 βˆ’ πœ€ 𝑨 )

π‘ž 𝑧 = 1|𝑦; πœ„ = β„Žπœ„ 𝑦 = πœ€ πœ„π‘ˆπ‘¦ =

1 1+π‘“βˆ’πœ„π‘ˆπ‘¦

π‘ž 𝑧 = 0 |𝑦; πœ„ = 1 βˆ’ β„Žπœ„ 𝑦 π‘ž 𝑧 |𝑦; πœ„ = (β„Žπœ„ 𝑦 )𝑧(1 βˆ’ β„Žπœ„ 𝑦 )(1βˆ’π‘§)= ( 1 1 + π‘“βˆ’πœ„π‘ˆπ‘¦ )𝑧(1 βˆ’ 1 1 + π‘“βˆ’πœ„π‘ˆπ‘¦ )(1βˆ’π‘§)

5

slide-6
SLIDE 6

Learning Algorithm

  • (Conditional) Likelihood Function

The neg log-likelihood function is also known as the Cross-Entropy cost function

Machine Learning, by Rui Xia @ NJUST

𝑀 πœ„ = ΰ·‘

𝑗=1 𝑂

π‘ž 𝑧 𝑗 𝑦 𝑗 ; πœ„ = ΰ·‘

𝑗=1 𝑂

β„Žπœ„ 𝑦(𝑗)

𝑧(𝑗)

1 βˆ’ β„Žπœ„ 𝑦(𝑗)

(1βˆ’π‘§(𝑗))

= ΰ·‘

𝑗=1 𝑂

1 1 + π‘“βˆ’πœ„π‘ˆπ‘¦(𝑗)

𝑧(𝑗)

1 βˆ’ 1 1 + π‘“βˆ’πœ„π‘ˆπ‘¦(𝑗)

(1βˆ’π‘§(𝑗))

max

πœ„

𝑀 πœ„ ֞ max

πœ„

෍

𝑗=1 π‘œ

𝑧(𝑗)logβ„Žπœ„ 𝑦(𝑗) + 1 βˆ’ 𝑧 𝑗 log 1 βˆ’ β„Žπœ„ 𝑦 𝑗

  • Maximum Likelihood Estimation

6

slide-7
SLIDE 7

Unconstraint Optimization

  • Unconstraint Optimization Problem

Machine Learning, by Rui Xia @ NJUST

  • Optimization Methods

– Gradient Descent – Stochastic Gradient Descent – Newton Method – Quasi-Newton Method – Conjugate Gradient – …

max

πœ„

෍

𝑗=1 π‘œ

𝑧(𝑗)logβ„Žπœ„ 𝑦(𝑗) + 1 βˆ’ 𝑧 𝑗 log 1 βˆ’ β„Žπœ„ 𝑦 𝑗

7

slide-8
SLIDE 8

Gradient Descent/Ascent

  • Gradient Computation

Error Γ— Feature

Machine Learning, by Rui Xia @ NJUST

  • Gradient Ascent Optimization

π‘’π‘š(πœ„) π‘’πœ„ = ෍

𝑗=1 𝑂

𝑧(𝑗) 1 β„Žπœ„(𝑦(𝑗)) βˆ’ (1 βˆ’ 𝑧 𝑗 ) 1 1 βˆ’ β„Žπœ„ 𝑦 𝑗 πœ– πœ–πœ„ β„Žπœ„(𝑦(𝑗)) = ෍

𝑗=1 𝑂

𝑧 𝑗 1 β„Žπœ„ 𝑦 𝑗 βˆ’ 1 βˆ’ 𝑧 𝑗 1 1 βˆ’ β„Žπœ„ 𝑦 𝑗 β„Žπœ„ 𝑦 𝑗 1 βˆ’ β„Žπœ„ 𝑦 𝑗 πœ– πœ–πœ„ πœ„π‘ˆπ‘¦ 𝑗 = ෍

𝑗=1 𝑂

𝑧 𝑗 1 βˆ’ β„Žπœ„ 𝑦 𝑗 βˆ’ 1 βˆ’ 𝑧 𝑗 β„Žπœ„ 𝑦 𝑗 𝑦 𝑗 = ෍

𝑗=1 𝑂

𝑧 𝑗 βˆ’ β„Žπœ„ 𝑦 𝑗 𝑦 𝑗 πœ„ ≔ πœ„ + 𝛽 ෍

𝑗=1 𝑂

𝑧 𝑗 βˆ’ β„Žπœ„ 𝑦 𝑗 𝑦 𝑗

8

slide-9
SLIDE 9

Stochastic Gradient Descent

  • Randomly choose a training sample
  • Compute gradient
  • Updating weights
  • Repeat…

Machine Learning, by Rui Xia @ NJUST

Gradient descent -- batch updating Stochastic gradient descent -- online updating

(𝑦, 𝑧) (𝑧 βˆ’ β„Žπœ„(𝑦))𝑦 πœ„ ≔ πœ„ + 𝛽(𝑧 βˆ’ β„Žπœ„(𝑦))𝑦

9

slide-10
SLIDE 10

GD vs. SGD

Machine Learning, by Rui Xia @ NJUST

Gradient Descent (GD) Stochastic Gradient Descent (SGD)

10

slide-11
SLIDE 11

Illustration of Newton’s Method

tangent line:

Machine Learning, by Rui Xia @ NJUST

𝑕 = 𝑔′ πœ„0 + 𝑔′′(πœ„0)(πœ„ βˆ’ πœ„0) 𝑕 = 𝑔′(πœ„) πœ„(1) = πœ„(0) βˆ’ 𝑔′ πœ„(0) 𝑔′′(πœ„(0)) πœ„(2) = πœ„(1) βˆ’ 𝑔′ πœ„(1) 𝑔′′(πœ„(1))

πœ„(3), πœ„(4), β‹― , πœ„βˆ—

πœ„(2) πœ„(1)

𝑕 πœ„

πœ„(0)

11

slide-12
SLIDE 12

Newton’s Method

  • Problem

Machine Learning, by Rui Xia @ NJUST

Hessian Matrix

  • Second-order Taylor expansion
  • Newton’s method (also called Newton-Raphson method)

arg min 𝑔 πœ„ ֞ π‘‘π‘π‘šπ‘€π‘“ ∢ 𝛼𝑔 πœ„ = 0 𝜚 πœ„ = 𝑔 πœ„ 𝑙 + 𝛼𝑔 πœ„ 𝑙 ΞΈ βˆ’ πœ„ 𝑙 + 1 2 𝛼2𝑔(πœ„ 𝑙 ) ΞΈ βˆ’ πœ„ 𝑙

2 β‰ˆ 𝑔(πœ„)

π›Όπœš πœ„ = 0 ֜ πœ„ = πœ„ 𝑙 βˆ’ 𝛼2𝑔(πœ„ 𝑙 )βˆ’1𝛼𝑔(πœ„ 𝑙 ) πœ„ 𝑙+1 = πœ„ 𝑙 βˆ’ 𝛼2𝑔(πœ„ 𝑙 )βˆ’1𝛼𝑔(πœ„ 𝑙 )

12

slide-13
SLIDE 13

Gradient’ vs. Newton’s Method

Machine Learning, by Rui Xia @ NJUST 13

slide-14
SLIDE 14

Newton’s Method for Logistic Regression

  • Optimization Problem

Machine Learning, by Rui Xia @ NJUST

  • Gradient and Hessian Matrix
  • Weight updating using Newton’s method

arg min 1 𝑂 ෍

𝑗=1 𝑂

βˆ’π‘§(𝑗) logβ„Žπœ„ 𝑦(𝑗) βˆ’ 1 βˆ’ 𝑧 𝑗 log 1 βˆ’ β„Žπœ„ 𝑦(𝑗) 𝛼𝐾 πœ„ = 1 𝑂 ෍

𝑗=1 𝑂

β„Žπœ„ 𝑦 𝑗 βˆ’ 𝑧 𝑗 𝑦 𝑗 𝐼 = 1 𝑂 ෍

𝑗=1 𝑂

β„Žπœ„ 𝑦 𝑗

T 1 βˆ’ β„Žπœ„ 𝑦 𝑗

𝑦 𝑗 (𝑦(𝑗))T

πœ„(𝑒+1) = πœ„(𝑒) βˆ’ πΌβˆ’1𝛼𝐾(πœ„(𝑒))

14

slide-15
SLIDE 15

Practice: Logistic Regression

  • Given the following training data:
  • Implement 1) GD; 2) SGD; 3) Newton's Method for logistic regression, starting

with the initial parameter \theta=0.

  • Determine how many iterations to use, and calculate for each iteration and plot

your results.

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html

Machine Learning, by Rui Xia @ NJUST 15

slide-16
SLIDE 16

Softmax Regression

Machine Learning, by Rui Xia @ NJUST 16

slide-17
SLIDE 17

Softmax Regression

  • Softmax Regression is a multi-class classification model, also called

Multi-class Logistic Regression;

  • It is also known as the Maximum Entropy Model (in NLP);
  • It is one of the most used classification algorithms.

Machine Learning, by Rui Xia @ NJUST 17

slide-18
SLIDE 18

Model Description

  • Model Hypothesis

Machine Learning, by Rui Xia @ NJUST

  • Model Hypothesis (Compact Form)
  • Parameters

π‘ž 𝑧 = π‘˜ 𝑦; πœ„ = β„Žπ‘˜ 𝑦 = π‘“πœ„π‘˜

T𝑦

1 + Οƒπ‘˜β€™=1

π‘‘βˆ’1 π‘“πœ„π‘˜β€™

T𝑦 , π‘˜ = 1, … , 𝐷 βˆ’ 1

π‘ž 𝑧 = 𝐷 𝑦; πœ„ = β„Žπ· 𝑦 = 1 1 + Οƒπ‘˜β€™=1

π‘‘βˆ’1 exp{πœ„ π‘˜β€™ T𝑦}

π‘ž 𝑧 = π‘˜ 𝑦; πœ„ = β„Žπ‘˜ 𝑦 = π‘“πœ„π‘˜

T𝑦

Οƒπ‘˜β€™=1

𝐷

π‘“πœ„π‘˜β€™

T𝑦 , π‘˜ = 1,2, … , 𝐷, where πœ„π· = 0

πœ„π·Γ—π‘

18

slide-19
SLIDE 19

Maximum Likelihood Estimation

  • (Conditional) Log-likelihood

Machine Learning, by Rui Xia @ NJUST 19

Softmax Regression Logistic Regression

π‘š πœ„ = ෍

𝑗=1 𝑂

log π‘ž(𝑧 𝑗 |𝑦 𝑗 ; πœ„) = ෍

𝑗=1 𝑂

log ΰ·‘

π‘˜=1 𝐷

π‘“πœ„π‘˜

T𝑦

Οƒπ‘˜β€™=1

𝐷

π‘“πœ„π‘˜β€™

T𝑦

1{𝑧 𝑗 =π‘˜}

= ෍

𝑗=1 𝑂

෍

π‘˜=1 𝐷

1 𝑧 𝑗 = π‘˜ log π‘“πœ„π‘˜

T𝑦

Οƒπ‘˜β€™=1

𝐷

π‘“πœ„π‘˜β€™

T𝑦

= ෍

𝑗=1 𝑂

෍

π‘˜=1 𝐷

1 𝑧 𝑗 = π‘˜ log β„Žπ‘˜(𝑦 𝑗 ) π‘š πœ„ = ෍

𝑗=1 𝑂

𝑧 𝑗 log β„Žπœ„ 𝑦 𝑗 + 1 βˆ’ 𝑧 𝑗 log 1 βˆ’ β„Žπœ„ 𝑦 𝑗

slide-20
SLIDE 20
  • Gradient

Gradient Descent Optimization

Error Γ— Feature

Machine Learning, by Rui Xia @ NJUST

πœ– log β„Žπ‘˜(𝑦) πœ–πœ„π‘™ = ࡝ 1 βˆ’ β„Žπ‘™ 𝑦 𝑦, π‘˜ = 𝑙 βˆ’β„Žπ‘™ 𝑦 𝑦, π‘˜ β‰  𝑙 πœ– Οƒπ‘˜=1

𝐷

1{𝑧 = π‘˜} log β„Žπ‘˜(𝑦) πœ–πœ„π‘™ = ࡝ 1 βˆ’ β„Žπ‘™ 𝑦 𝑦, 𝑧 = 𝑙 βˆ’β„Žπ‘™ 𝑦 𝑦, 𝑧 β‰  𝑙 = 1 𝑧 = 𝑙 βˆ’ β„Žπ‘™ 𝑦 𝑦 πœ–π‘š(πœ„) πœ–πœ„π‘™ = ෍

𝑗=1 𝑂

1 𝑧 𝑗 = 𝑙 βˆ’ β„Žπ‘™(𝑦 𝑗 ) 𝑦 𝑗

20

slide-21
SLIDE 21

Gradient Descent Optimization

  • Gradient Descent

Machine Learning, by Rui Xia @ NJUST

  • Stochastic Gradient Descent

πœ„π‘™ : = πœ„π‘™ + 𝛽 ෍

𝑗=1 𝑂

1 𝑧 𝑗 = 𝑙 βˆ’ β„Žπ‘™(𝑦 𝑗 ) 𝑦 𝑗 where β„Žπ‘™ 𝑦 = π‘“πœ„π‘™

T𝑦

σ𝑙’=1

𝐷

π‘“πœ„π‘™β€™

T𝑦 , 𝑙 = 1,2, … , 𝐷

πœ„π‘™ : = πœ„π‘™ + 𝛽 1 𝑧 = 𝑙 βˆ’ β„Žπ‘™ 𝑦 𝑦

21

slide-22
SLIDE 22

The other optimization methods

  • Newton Method
  • Quasi-Newton Method (BFGS)
  • Limited Memory BFGS (L-BFGS)
  • Conjugate Gradient
  • GIS
  • IIS
  • …

Machine Learning, by Rui Xia @ NJUST 22

slide-23
SLIDE 23

Practice: Softmax Regression

  • Given the following training data:
  • Implement logistic regression with 1) GD; 2) SGD.
  • Implement softmax regression with 1) GD; 2) SGD.
  • Compare logisitic regression and softmax regression.

http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html

Machine Learning, by Rui Xia @ NJUST 23

slide-24
SLIDE 24

Machine Learning, by Rui Xia @ NJUST

Questions?

24