Logistic Regression & Neural Networks
CMSC 723 / LING 723 / INST 725 Marine Carpuat
Slides credit: Graham Neubig, Jacob Eisenstein
Logistic Regression & Neural Networks CMSC 723 / LING 723 / - - PowerPoint PPT Presentation
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability p(y|x)? The
CMSC 723 / LING 723 / INST 725 Marine Carpuat
Slides credit: Graham Neubig, Jacob Eisenstein
Illustrations: Graham Neubig
𝑧" given examples 𝑦"
Given an introductory sentence in Wikipedia predict whether the article is about a person
𝛽 = 1 𝐷 + 𝑢
Parameter Number of samples
Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better
𝑥 + 𝑥 ,
Given an introductory sentence in Wikipedia predict whether the article is about a person
sign - 𝑥"
. "/,
⋅ ϕ" 𝑦
φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0
2
O X O X O X
O X O X O X
negative examples
X O O X
φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0
φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0
(non-linear)
X O O X
φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}
sign sign
φ0[0] φ0[1] 1 1 1
φ0[0] φ0[1]
φ1[0] φ0[0] φ0[1] 1
w0,0 b0,0
φ1[1]
w0,1 b0,1
X O O X
φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}
1 1
φ1 φ2
φ1[1] φ1[0]
φ1[0] φ1[1]
φ1(x1) = {-1, -1}
X O
φ1(x2) = {1, -1}
O
φ1(x3) = {-1, 1} φ1(x4) = {-1, -1}
X O O X
φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}
1 1
φ0[0] φ0[1]
φ1[1] φ1[0] φ1[0] φ1[1] φ1(x1) = {-1, -1} X O φ1(x2) = {1, -1} O φ1(x3) = {-1, 1} φ1(x4) = {-1, -1}
1 1 1
φ2[0] = y
tanh tanh
φ0[0] φ0[1] 1 φ0[0] φ0[1] 1 1 1
1 1 1 1
tanh
φ1[0] φ1[1] φ2[0]
30
;
;
Current class Sum of other classes
Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw In other words
(the direction that will increase the probability of y)
Take the derivative of the probability
𝑒 𝑒𝑥 𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒 𝑒𝑥 𝑓𝐱⋅6 7 1 + 𝑓𝐱⋅6 7 = ϕ 𝑦 𝑓𝐱⋅6 7 1 + 𝑓𝐱⋅6 7
+
𝑒 𝑒𝑥 𝑄 𝑧 = −1 ∣ 𝑦 = 𝑒 𝑒𝑥 1 − 𝑓𝐱⋅6 7 1 + 𝑓𝐱⋅6 7 = −ϕ 𝑦 𝑓𝐱⋅6 7 1 + 𝑓𝐱⋅6 7
+
For NNs, only know correct tag for last layer
y=1
ϕ 𝑦 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟓 = 𝐢 𝑦 𝑓𝐱𝟓⋅𝐢 7 1 + 𝑓𝐱𝟓⋅𝐢 7
+
𝐢 𝑦 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟐 = ? 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟑 = ? 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟒 = ?
w1 w2 w3 w4
Calculate derivative with chain rule
𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱𝟐 = 𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱𝟓𝐢 𝐲 𝑒𝐱𝟓𝐢 𝐲 𝑒ℎ, 𝐲 𝑒ℎ, 𝐲 𝑒𝐱𝟐 𝑓𝐱𝟓⋅𝐢 7 1 + 𝑓𝐱𝟓⋅𝐢 7
+
𝑥,,R
Error of next unit (δ4) Weight Gradient of this unit
𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝐱𝐣 = 𝑒ℎ" 𝐲 𝑒𝐱𝐣
𝑥",U
In General Calculate i based
All connections point forward y
ϕ 𝑦
It is a directed acyclic graph (DAG)
For more details, see CIML Chap 7