Machine Learning
Logistic Regression
1
Logistic Regression Machine Learning 1 Where are we? We have seen - - PowerPoint PPT Presentation
Logistic Regression Machine Learning 1 Where are we? We have seen the following ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Nave Bayes classifier 2 This lecture
1
2
3
4
5
6
7
What is the domain and the range of the sigmoid function? This is a reasonable choice. We will see why later
8
This is a reasonable choice. We will see why later
9
What is the domain and the range of the sigmoid function? This is a reasonable choice. We will see why later
10
¾(z) z
11
12
13
14
15
16
17
18
Note that we are directly modeling 𝑄(𝑧 | 𝑦) rather than 𝑄(𝑦 |𝑧) and 𝑄(𝑧)
19
20
21
22
23
24
25
26
27
28
The usual trick: Convert products to sums by taking log Recall that this works only because log is an increasing function and the maximizer will not change
𝐱
𝐱
= <>?
29
Equivalent to solving
𝐱
𝐱
= <>?
𝐱 @ log 𝑄 𝑧< 𝐲<, 𝐱) = <
30
But (by definition) we know that
𝐱
𝐱
= <>?
𝐱 @ log 𝑄 𝑧< 𝐲<, 𝐱) = <
31
𝐱
𝐱
= <>?
𝐱 @ log 𝑄 𝑧< 𝐲<, 𝐱) = < 𝑄 𝑧 𝐱, 𝐲 = 1 1 + exp (−yB𝐱2𝐲<) Equivalent to solving
𝐱 @ −log(1 + exp
= <
32
𝐱
𝐱
= <>?
𝐱 @ log 𝑄 𝑧< 𝐲<, 𝐱) = < 𝑄 𝑧 𝐱, 𝐲 = 1 1 + exp (−yB𝐱2𝐲<) Equivalent to solving The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution.
𝐱 @ −log(1 + exp
= <
33
𝐱
𝐱
= <>?
𝐱 @ log 𝑄 𝑧< 𝐲<, 𝐱) = < 𝑄 𝑧 𝐱, 𝐲 = 1 1 + exp (−yB𝐱2𝐲<) Equivalent to solving
𝐱 @ −log(1 + exp
= < Equivalent to: Training a linear classifier by minimizing the logistic loss. The goal: Maximum likelihood training of a discriminative probabilistic classifier under the logistic model for the posterior distribution.
E F>?
J
E F>?
34
35
E F>?
J
E F>?
Let us work through this procedure again to see what changes
36
E F>?
J
E F>?
Let us work through this procedure again to see what changes What is the goal of MAP estimation? (In maximum likelihood, we maximized the likelihood of the data)
37
E F>?
J
E F>?
What is the goal of MAP estimation? (In maximum likelihood, we maximized the likelihood of the data) To maximize the posterior probability of the model given the data (i.e. to find the most probable model, given the data) 𝑄 𝐱 𝑇 ∝ 𝑄 𝑇 𝐱 𝑄(𝐱)
38
Learning by solving
E F>?
J
E F>?
𝐱
𝐱
39
Learning by solving
E F>?
J
E F>?
𝐱
Take log to simplify
𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)
40
Learning by solving
E F>?
J
E F>?
𝐱
Take log to simplify
𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)
We have already expanded out the first term.
= <
41
Learning by solving
E F>?
J
E F>?
𝐱
Take log to simplify
𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)
= <
J
E F>?
Expand the log prior
42
Learning by solving
E F>?
J
E F>?
𝐱
Take log to simplify
𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)
𝐱 @ −log(1 + exp
= <
J
E F>?
43
Learning by solving
E F>?
J
E F>?
𝐱
Take log to simplify
𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)
𝐱 @ −log(1 + exp
= <
44
Learning by solving
E F>?
J
E F>?
𝐱
Take log to simplify
𝐱 log 𝑄 𝑇 𝐱 + log 𝑄(𝐱)
𝐱 @ −log(1 + exp
= <
Maximizing a negative function is the same as minimizing the function
45
𝐱 @ log(1 + exp
= <
46
Where have we seen this before?
𝐱 @ log(1 + exp
= <
47
Where have we seen this before? The first question in the homework: Write down the stochastic gradient descent algorithm for this? Historically, other training algorithms exist. In particular, you might run into LBFGS
𝐱 @ log(1 + exp
= <
48
49
50
But distribution D is unknown
51
52
Overfitting!
53
(using l2 regularization)
54
55
56
57
Zero-one
58
Hinge: SVM Zero-one
59
Perceptron Hinge: SVM Zero-one
60
Perceptron Hinge: SVM Exponential: AdaBoost Zero-one
61
Perceptron Hinge: SVM Logistic regression Exponential: AdaBoost Zero-one
62
Zoomed out
63
Zoomed out even more