Logistic Regression & Softmax Regression Rui Xia T ext M ining - - PowerPoint PPT Presentation
Logistic Regression & Softmax Regression Rui Xia T ext M ining - - PowerPoint PPT Presentation
Lecture 3 Logistic Regression & Softmax Regression Rui Xia T ext M ining Group N anjing U niversity of S cience & T echnology rxia@njust.edu.cn Supervised Learning Regression Classification Machine Learning, by Rui Xia @ NJUST
Supervised Learning
- Classification
- Regression
Machine Learning, by Rui Xia @ NJUST 2
Logistic Regression
Machine Learning, by Rui Xia @ NJUST 3
Introduction
- Logistic Regression is a classification model, although it is called
βregressionβ;
- Logistic regression is a binary classification model;
- Logistic regression is a linear classification model. It has a linear
decision boundary (hyperplane), but with a nonlinear activation function (Sigmoid function) to model the posterior probability.
Machine Learning, by Rui Xia @ NJUST 4
Model Hypothesis
- Hypothesis
Machine Learning, by Rui Xia @ NJUST
- Hypothesis (Compact Form)
- Sigmoid Function
π π¨ = 1 1 + πβπ¨ ππ π¨ ππ¨ = π π¨ (1 β π π¨ )
π π§ = 1|π¦; π = βπ π¦ = π πππ¦ =
1 1+πβπππ¦
π π§ = 0 |π¦; π = 1 β βπ π¦ π π§ |π¦; π = (βπ π¦ )π§(1 β βπ π¦ )(1βπ§)= ( 1 1 + πβπππ¦ )π§(1 β 1 1 + πβπππ¦ )(1βπ§)
5
Learning Algorithm
- (Conditional) Likelihood Function
The neg log-likelihood function is also known as the Cross-Entropy cost function
Machine Learning, by Rui Xia @ NJUST
π π = ΰ·
π=1 π
π π§ π π¦ π ; π = ΰ·
π=1 π
βπ π¦(π)
π§(π)
1 β βπ π¦(π)
(1βπ§(π))
= ΰ·
π=1 π
1 1 + πβπππ¦(π)
π§(π)
1 β 1 1 + πβπππ¦(π)
(1βπ§(π))
max
π
π π Φ max
π
ΰ·
π=1 π
π§(π)logβπ π¦(π) + 1 β π§ π log 1 β βπ π¦ π
- Maximum Likelihood Estimation
6
Unconstraint Optimization
- Unconstraint Optimization Problem
Machine Learning, by Rui Xia @ NJUST
- Optimization Methods
β Gradient Descent β Stochastic Gradient Descent β Newton Method β Quasi-Newton Method β Conjugate Gradient β β¦
max
π
ΰ·
π=1 π
π§(π)logβπ π¦(π) + 1 β π§ π log 1 β βπ π¦ π
7
Gradient Descent/Ascent
- Gradient Computation
Error Γ Feature
Machine Learning, by Rui Xia @ NJUST
- Gradient Ascent Optimization
ππ(π) ππ = ΰ·
π=1 π
π§(π) 1 βπ(π¦(π)) β (1 β π§ π ) 1 1 β βπ π¦ π π ππ βπ(π¦(π)) = ΰ·
π=1 π
π§ π 1 βπ π¦ π β 1 β π§ π 1 1 β βπ π¦ π βπ π¦ π 1 β βπ π¦ π π ππ πππ¦ π = ΰ·
π=1 π
π§ π 1 β βπ π¦ π β 1 β π§ π βπ π¦ π π¦ π = ΰ·
π=1 π
π§ π β βπ π¦ π π¦ π π β π + π½ ΰ·
π=1 π
π§ π β βπ π¦ π π¦ π
8
Stochastic Gradient Descent
- Randomly choose a training sample
- Compute gradient
- Updating weights
- Repeatβ¦
Machine Learning, by Rui Xia @ NJUST
Gradient descent -- batch updating Stochastic gradient descent -- online updating
(π¦, π§) (π§ β βπ(π¦))π¦ π β π + π½(π§ β βπ(π¦))π¦
9
GD vs. SGD
Machine Learning, by Rui Xia @ NJUST
Gradient Descent (GD) Stochastic Gradient Descent (SGD)
10
Illustration of Newtonβs Method
tangent line:
Machine Learning, by Rui Xia @ NJUST
π = πβ² π0 + πβ²β²(π0)(π β π0) π = πβ²(π) π(1) = π(0) β πβ² π(0) πβ²β²(π(0)) π(2) = π(1) β πβ² π(1) πβ²β²(π(1))
π(3), π(4), β― , πβ
π(2) π(1)
π π
π(0)
11
Newtonβs Method
- Problem
Machine Learning, by Rui Xia @ NJUST
Hessian Matrix
- Second-order Taylor expansion
- Newtonβs method (also called Newton-Raphson method)
arg min π π Φ π‘πππ€π βΆ πΌπ π = 0 π π = π π π + πΌπ π π ΞΈ β π π + 1 2 πΌ2π(π π ) ΞΈ β π π
2 β π(π)
πΌπ π = 0 Φ π = π π β πΌ2π(π π )β1πΌπ(π π ) π π+1 = π π β πΌ2π(π π )β1πΌπ(π π )
12
Gradientβ vs. Newtonβs Method
Machine Learning, by Rui Xia @ NJUST 13
Newtonβs Method for Logistic Regression
- Optimization Problem
Machine Learning, by Rui Xia @ NJUST
- Gradient and Hessian Matrix
- Weight updating using Newtonβs method
arg min 1 π ΰ·
π=1 π
βπ§(π) logβπ π¦(π) β 1 β π§ π log 1 β βπ π¦(π) πΌπΎ π = 1 π ΰ·
π=1 π
βπ π¦ π β π§ π π¦ π πΌ = 1 π ΰ·
π=1 π
βπ π¦ π
T 1 β βπ π¦ π
π¦ π (π¦(π))T
π(π’+1) = π(π’) β πΌβ1πΌπΎ(π(π’))
14
Practice: Logistic Regression
- Given the following training data:
- Implement 1) GD; 2) SGD; 3) Newton's Method for logistic regression, starting
with the initial parameter \theta=0.
- Determine how many iterations to use, and calculate for each iteration and plot
your results.
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html
Machine Learning, by Rui Xia @ NJUST 15
Softmax Regression
Machine Learning, by Rui Xia @ NJUST 16
Softmax Regression
- Softmax Regression is a multi-class classification model, also called
Multi-class Logistic Regression;
- It is also known as the Maximum Entropy Model (in NLP);
- It is one of the most used classification algorithms.
Machine Learning, by Rui Xia @ NJUST 17
Model Description
- Model Hypothesis
Machine Learning, by Rui Xia @ NJUST
- Model Hypothesis (Compact Form)
- Parameters
π π§ = π π¦; π = βπ π¦ = πππ
Tπ¦
1 + Οπβ=1
πβ1 πππβ
Tπ¦ , π = 1, β¦ , π· β 1
π π§ = π· π¦; π = βπ· π¦ = 1 1 + Οπβ=1
πβ1 exp{π πβ Tπ¦}
π π§ = π π¦; π = βπ π¦ = πππ
Tπ¦
Οπβ=1
π·
πππβ
Tπ¦ , π = 1,2, β¦ , π·, where ππ· = 0
ππ·Γπ
18
Maximum Likelihood Estimation
- (Conditional) Log-likelihood
Machine Learning, by Rui Xia @ NJUST 19
Softmax Regression Logistic Regression
π π = ΰ·
π=1 π
log π(π§ π |π¦ π ; π) = ΰ·
π=1 π
log ΰ·
π=1 π·
πππ
Tπ¦
Οπβ=1
π·
πππβ
Tπ¦
1{π§ π =π}
= ΰ·
π=1 π
ΰ·
π=1 π·
1 π§ π = π log πππ
Tπ¦
Οπβ=1
π·
πππβ
Tπ¦
= ΰ·
π=1 π
ΰ·
π=1 π·
1 π§ π = π log βπ(π¦ π ) π π = ΰ·
π=1 π
π§ π log βπ π¦ π + 1 β π§ π log 1 β βπ π¦ π
- Gradient
Gradient Descent Optimization
Error Γ Feature
Machine Learning, by Rui Xia @ NJUST
π log βπ(π¦) πππ = ΰ΅ 1 β βπ π¦ π¦, π = π ββπ π¦ π¦, π β π π Οπ=1
π·
1{π§ = π} log βπ(π¦) πππ = ΰ΅ 1 β βπ π¦ π¦, π§ = π ββπ π¦ π¦, π§ β π = 1 π§ = π β βπ π¦ π¦ ππ(π) πππ = ΰ·
π=1 π
1 π§ π = π β βπ(π¦ π ) π¦ π
20
Gradient Descent Optimization
- Gradient Descent
Machine Learning, by Rui Xia @ NJUST
- Stochastic Gradient Descent
ππ : = ππ + π½ ΰ·
π=1 π
1 π§ π = π β βπ(π¦ π ) π¦ π where βπ π¦ = πππ
Tπ¦
Οπβ=1
π·
πππβ
Tπ¦ , π = 1,2, β¦ , π·
ππ : = ππ + π½ 1 π§ = π β βπ π¦ π¦
21
The other optimization methods
- Newton Method
- Quasi-Newton Method (BFGS)
- Limited Memory BFGS (L-BFGS)
- Conjugate Gradient
- GIS
- IIS
- β¦
Machine Learning, by Rui Xia @ NJUST 22
Practice: Softmax Regression
- Given the following training data:
- Implement logistic regression with 1) GD; 2) SGD.
- Implement softmax regression with 1) GD; 2) SGD.
- Compare logisitic regression and softmax regression.
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ex4.html
Machine Learning, by Rui Xia @ NJUST 23
Machine Learning, by Rui Xia @ NJUST
Questions?
24