Logistic Regression
- Dr. Besnik Fetahu
Logistic Regression Dr. Besnik Fetahu Supervised Classification X = - - PowerPoint PPT Presentation
Logistic Regression Dr. Besnik Fetahu Supervised Classification X = { x (1) , . . . , x ( n ) } Y = { T, F } Input instances Output labels (classes) S = { x ( i ) , y ( i ) } m Training IID examples (input-target samples) i =1 f ( x ( i ) )
2
Input instances Output labels (classes) Training IID examples (input-target samples) Learn a function that maps x(i) to y(i)
i=1
different machine learning models that are used for classification
distribution P(x,y):
generated? P(x|Y=y)
learns only how to distinguish between the different classes:
P(Y=y|x)
3
4
Generative: will try to model how horses look like! Discriminative: will try to map horse instances to the correct class!
5
predict the class y (e.g. the topic):
6
likelihood prior
ymax = arg max
y∈Y P(Y = y|x)
= arg max
y∈Y
P(x|Y = y)P(Y = y) P(x) = arg max
y∈Y P(x|Y = y)P(Y = y)
= arg max
y∈Y P(x1 . . . xk|Y = y)P(Y = y)
7
y∈Y P(x1 . . . xk|Y = y)P(Y = y)
y∈Y P(y) k
i=1
Feature independence assumption
are the characteristics of instances belonging to some class y)
problems that are not directly related to P(y|x). What class does x belong to?
number of class
x (likelihood term)
8
O(|X|n|Y |)
9
label!
predicting the right class.
that have high ability to discriminate between the different classes.
right class for P(y|x)
10
model in the binary case?
factor b based on some training data for the classification task.
11
x(i) = [x(i)
1 . . . x(i) k ]
features for our input space (e.g. “awesome” is important in determining positive sentiment)
12
z = k X
i=1
wixi ! + b
function (aka logistic function)
13
z = k X
i=1
wixi ! + b
14
15
function? P(y = 1) = σ(w · x + b) = 1 1 + e−(w·x+b)
P(y = 0) = 1 − σ(w · x + b) = 1 − 1 1 + e−(w·x+b) = e−(w·x+b) 1 + e−(w·x+b)
b yi = ( 1 if P(y = 1|Xi) > 0.5 0 otherwise
Decision boundary
16
17
w = [2.5, −5.0, −1.2, 0.5, 2.0, 0.7]
b = 0.1
p(Y = 1|x) =σ(w · x + b) =σ([2.5, −5.0, −1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.15] + 0.1) =σ(1.805) =0.86 P(Y = 0|x) =1 − σ(w · x + b) =0.14
sentiment will contain more word that have a prior positive sentiment)
18
19
very hard to optimize for probabilistic output.
likelihood loss is also called the cross-entropy loss)
20
L(b y, y) = how much does our prediction b y differ from y
terms of the Bernoulli distribution:
21
This is the log likelihood that should be maximized such that w, b will maximize the probability of our labels being close to the true labels
minimize, thus, we flip the sign of the log likelihood:
22
LCE(b y, y) = − log p(y|x) = −[y log b y + (1 − y) log(1 − b y)]
b y = σ(w · x + b)
LR model
LCE(w, b) = −[y log σ(w · x + b) + (1 − y) log(1 − σ(w · x + b))]
probability close to 1 to the correct class (y=1 or y=0)
classifier, and vice versa, the closer it is to zero the worse it is.
whereas goes to infinity for the cases where we get everything wrong (log 0)
by maximizing the correct label we do this on the expense of the wrong label.
23
24
Cost(w, b) = 1 m
m
X
i=1
LCE(b y(i), y(i)) = − 1 m
m
X
i=1
⇣ y(i) log σ(w · x(i) + b) + (1 − y(i)) log ⇣ 1 − σ(w · x(i) + b ⌘⌘
25
which direction in the parameter space the function’s slope is rising most steeply and moving in the
26
θ
27
given point and then moves in the opposite direction s.t. the loss function is minimized
the gradient descent is determined by the value of the slope (or derivative) weighted by some learning rate
28
29
Gradient descent with small (top) and large (bottom) learning rates. Source: Andrew Ng’s Machine Learning course on Coursera
However, with each time step the gradient will be smaller and smaller, thus, there is no need to adaptively fix the learning rate, as the value of the negative direction as the slope will be less steep too
parameters that GD needs to find their optimal value, thus, we operate in the N-dimensional space
dimensions
30
the total loss in L?”
31
rθL(f(x; θ), y)) =
∂ ∂w1 L(f(x; θ), y) ∂ ∂w2 L(f(x; θ), y)
. . .
∂ ∂wn L(f(x; θ), y)
32
Cost(w, b) = − 1 m
m
X
i=1
y(i) log σ(w·x(i)+b)+(1−y(i)) log ⇣ 1 − σ(w · x(i) + b) ⌘
m
i=1
j
33
dσ(z) dz = σ(z)(1 − σ(z))
Use the following derivatives to derive the partial derivative of the cross-entropy loss function
m
i=1
j
34
has only two features:
35
x1 count of positive lexicon words x2 count of negative lexicon words
w1 = w2 = b = 0
Weights are initialized to zero
θ2 = w1 w2 b − η −1.5 −1.0 −0.5 = 0.15 0.10 0.05
rw,b =
∂LCE(w,b) ∂w1 ∂LCE(w,b) ∂w2 ∂LCE(w,b) ∂b
= (σ(w · x + b) y)x1 (σ(w · x + b) y)x2 σ(w · x + b) y) = (σ(0) 1)x1 (σ(0) 1)x2 σ(0) 1 = 0.5x1 0.5x2 0.5 = 1.5 1.0 0.5
η = 0.1
Learning rate
36
training data, the weight of that feature will be very high
models that are not robust to noise and do not generalize well
37
b w = arg max
w m
X
i=1
log P(y(i)|x(i)) − αR(w)
Regularization term
absolute sum of the weight values
the squares of the weight values
38
R(w) = ||W||1 =
m
X
i=1
|wi|
R(w) = ||W||2
2 = m
X
i=1
w2
i
39
than two classes by changing its classification function to the softmax function
40
softmax(zi) = ezi Pk
j=1 ezj
P(Y = c|x) = ewc·x+bc Pk
j=1 ewj·x+bj
41
LR?
42
LCE(b y, y) = −
k
X
i=1
1{Y = k} log P(Y = k|x) = −
k
X
i=1
1{Y = k} log ewk·x+bk PK
j=1 ewj·x+bj
∂LCE ∂wk = (1{Y = k} − P(Y = k|x))xk = 1{Y = k} − ewk·x+bk PK
j=1 ewj·x+bj
! xk
43
44