IN4080 – 2020 FALL
NATURAL LANGUAGE PROCESSING
Jan Tore Lønning
1
IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - - PowerPoint PPT Presentation
1 IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Logistic Regression Lecture 4, 7 Sept Logistic regression 3 In natural language processing, logistic regression is the baseline supervised machine learning algorithm
1
2
3
4
Generalizes Extends
Linear classifiers Linear regression Logistic regression Training the logistic regression classifier Multinomial Logistic Regression Representing categorical features Generative and discriminative classifiers Logistic regression vs Naïve Bayes
5
Last week: Naive Bayes
Probabilistic classifier Categorical features
Today
A geometrical view on classification Numeric features
Eventually see that both Naive Bayes and Logistic regression can fit
6
𝑦1, 𝑦2, … , 𝑦𝑜 for the features, where
each feature is a number a fixed order is assumed
𝑧 for the output value/class In particular, J&M use
ො
𝑧 for the true value (where Marsland, IN3050, uses 𝑧 and 𝑢, resp.)
7
In NLP
thousands of features (dimension) categorical data
These are difficult to illustrate by figures To understand ML algorithms
it easier to use one or two features, 2-3 dimensions, to be able to draw
and then to use numerical data, to get non-trivial figures
8
Two numeric features Three classes We may indicate the classes by
9
Many classification methods are
And then generalizes to more
The goal is to find a curve that
With more dimensions: to find a
10
Linear classifiers try to find a
The two classes are linearly
If the data isn’t linearly
Then: the goal is to make as few
11
A linear separator is
An observation is
class 1 iff x>m Class 0 iff x<m
12
1 1 x x x x Data set 1: linerarly separable Data set 2: not linerarly separable m m m m
a line has the form ax+by+c=0
ax + by < -c for red points ax + by > -c for blue points
13
In a 3 dimensional space (3
In a higher-dimensional space it
14
A hyperplane has the form
σ𝑗=1
𝑜
𝑥𝑗𝑦𝑗 + 𝑥0 = 0
which equals
σ𝑗=0
𝑜
𝑥𝑗𝑦𝑗 = 𝑥0, 𝑥1, … , 𝑥𝑜 ∙ 𝑦0, 𝑦1, … , 𝑦𝑜 = 𝑥 ∙ Ԧ 𝑦 = 0,
assuming 𝑦0 = 1
An object belongs to class C iff
𝑗=0 𝑜
and to not C, otherwise
15
Linear classifiers Linear regression Logistic regression Training the logistic regression classifier Multinomial Logistic Regression Representing categorical features Generative and discriminative classifiers Logistic regression vs Naïve Bayes
16
Data:
100 males: height and weight
Goal:
Guess the weight of other males
17
Method:
Try to fit a straight line to the
Predict that unseen data are
Questions:
What is the best line? How do we find it?
18
To find the best fit, we compare each
true value 𝑧𝑗 (green point) to the corresponding predicted value ො
We define a loss function
which measures the discrepancy between
(alternatively called error function)
The goal is to minimize the loss
19
xi yi di
Mean square error:
1 𝑛
𝑗=1 𝑛
𝑒𝑗
2
where 𝑒𝑗 = 𝑧𝑗 − ො
𝑧𝑗
ො
𝑧𝑗 = 𝑏𝑦𝑗 + 𝑐
Why squaring?
To not get 0 when we sum the diff.s. Large mistakes are punished more severly
20
xi yi di
For lin. regr. there is a formula
(this is called an analytic
But slow with many (millions) of
Alternative:
Start with one candidate line Try to find better weights Use Gradient Descent A kind of search problem
21
We use the derivative of the
We are approaching a unique
For details:
IN3050/4050 (spring)
22
Linear regression of more than two variables
We try to fit the best (hyper-)plane
𝑗=0 𝑜
We can use the same mean square error:
1 𝑛
𝑗=1 𝑛
𝑧𝑗 − ො 𝑧𝑗 2
23
The loss function is convex: you
The gradient
(= the partial derivatives of the
tells us in which direction we
= how long steps in each
24
Linear classifiers Linear regression Logistic regression Training the logistic regression classifier Multinomial Logistic Regression Representing categorical features Generative and discriminative classifiers Logistic regression vs Naïve Bayes
25
Goal: predict gender from two
26
First:
The decision boundary should
An observation, n, is classified
1(male) if height_n > c 0 (not male) otherwise
How do we determine c?
27
How good are the best
Given weight? Given height+weight?
28
How do we determine c? We may use linear regression:
Try to fit a straight line The observations has 𝑧 ∈ 0,1 The predicted value ො
Possible, but Bad fit, 𝑧𝑗 and ො
Correctly classified objects
29
c
The correct decision boundary
But:
Not a differentiable function
can't apply gradient descent
30
An approximation to the ideal
Differentiable
Gradient descent
Mistakes further from the decision
31
𝑧 =
1 1+𝑓−𝑨 = 𝑓𝑨 𝑓𝑨+1
A sigmoid curve
But also other functions make
Maps (−∞, ∞) to 0,1 Monotone Can be used for transforming
32
33
Instead of a linear classifier which
The logistic regression will ascribe
We can turn it into a classifier by
We could also choose other cut-
34
source: Wikipedia
Logistic regression is probability-based Given to classes C, not-C,
Consider the odds
𝑄(𝐷| Ԧ 𝑦) 𝑄(𝑜𝑝𝑢𝐷| Ԧ 𝑦) = 𝑄(𝐷| Ԧ 𝑦) 1−𝑄(𝐷| Ԧ 𝑦)
If this is >1, Ԧ
It varies between 0 and infinity
Take the logarithm of this log
𝑄(𝐷| Ԧ 𝑦) 1−𝑄(𝐷| Ԧ 𝑦)
If this is >0, Ԧ
It varies between minus infinity and pluss infinity
35
log
𝑄(𝐷| Ԧ 𝑦) 1−𝑄(𝐷| Ԧ 𝑦) > 0 ?
Try to find a linear expression for this log
𝑄(𝐷| Ԧ 𝑦) 1−𝑄(𝐷| Ԧ 𝑦) = 𝑥 ∙ Ԧ
Given such a linear expression
𝑄(𝐷| Ԧ 𝑦) 1−𝑄(𝐷| Ԧ 𝑦) = 𝑓𝑥∙ Ԧ 𝑦
𝑄 𝐷 Ԧ
𝑓𝑥∙𝑦 1+𝑓𝑥∙𝑦 = 1 1+𝑓−𝑥∙𝑦
36
Two features: 𝑦1, 𝑦2 Apply weights: 𝑥0, 𝑥1, 𝑥2 Let 𝑧 = 𝑥0 + 𝑥1𝑦1 + 𝑥2𝑦2 Apply the logistic function, 𝜏,
𝜏 𝑧 =
1 1+𝑓−𝑧 > 0.5
37
From IDRE, UCLA
Linear classifiers Linear regression Logistic regression Training the logistic regression classifier Multinomial Logistic Regression Representing categorical features Generative and discriminative classifiers Logistic regression vs Naïve Bayes
38
What are the best choices of a
1 1+𝑓− 𝑏𝑦+𝑐 ?
Geometrically a and b
Midpoint Steepness
of the curve
39
A training instance consists of
a feature vector Ԧ
a label (class), 𝑧, which is 1 or 0.
With a set of weights, 𝑥,
ො
1 1+𝑓−𝑥∙𝑦
where 𝑄 𝐷 = 0 Ԧ
Goal: find 𝑥 that maximize
40
In machine learning we have to
We can do that in terms of a
The goal of the training is to
Example: linear regression
Loss: Mean Square Error
We can choose between
The choice is partly determined
For logistic regression we
41
The underlying idea is that we want to maximize the joint probability
ς𝑗=1
𝑛 𝑄 𝑧(𝑗)
This is the same as maximizing
log ς𝑗=1
𝑛 𝑄 𝑧(𝑗) Ԧ
𝑛 log 𝑄 𝑧(𝑗) Ԧ
This is the same as minimizing
𝑀𝐷𝐹 𝑥 = − log ς𝑗=1
𝑛 𝑄(𝑧 𝑗 | 𝑦(𝑗)) = σ𝑗=1 𝑛 − log 𝑄(𝑧(𝑗)| Ԧ
Which is an instance of what is called the cross-entropy loss
42
To minimize the loss function we
Good news:
The loss function is convex: you
We know which way to go
We skip the details of sec. 5.4
43
Calculate the loss for the whole
Make one move in the correct
Repeat (an epoch)
Can be slow
Pick one item Calculate the loss for this item Move in the direction of the gradient
Each move does not have to be in
But the overall effect may be good Can be faster
44
Batch training: Stochastic gradient descent:
Pick a subset of the training set of
Calculate the loss for this subset Make one move in the direction of
Repeat (an epoch)
A good compromise between the
(The other two are subcases of
There are various different
Observe that you may specify
45
Mini-batch training: Solvers/optimizers
LogReg is prone to overfitting to the training data Hence apply regularization The regularization punishes large weights Most common is L2-regularization 𝑆 𝑋 = σ0
𝑜 𝑥𝑗 2
Alternative: L1-regularization 𝑆 𝑋 = σ0
𝑜 |𝑥𝑗|
46
1
m i i i w
47
LogisticRegression(penalty=’l2’, …, C=1.0, …) By adjusting C, you may get better results The optimal C varies from task to task Uses L2-regularization as default Whether L1 or L2 may depend on the learner
48
Linear classifiers Linear regression Logistic regression Training the logistic regression classifier Multinomial Logistic Regression Representing categorical features Generative and discriminative classifiers Logistic regression vs Naïve Bayes
49
Also called maximum entropy (maxent) classifier, or softmax regression With one class we
considered 𝑄 𝐷 Ԧ
𝑓𝑥∙𝑦 1+𝑓𝑥∙𝑦 = 1 1+𝑓−𝑥∙𝑦
and implicitly 𝑄 𝑜𝑝𝑜𝐷 Ԧ
𝑓𝑥∙𝑦 1+𝑓𝑥∙𝑦 = 1 1+𝑓𝑥∙𝑦
We now consider a linear expression 𝑥𝑗, for each class 𝐷𝑗, 𝑗 = 1, … , 𝑙 The probability for each class is then given by the softmax function
𝑘 Ԧ
𝑦
𝑙
𝑦
50
4 different classes corresponding to
For each of them a corresponding
This expresses the probability of the
For classification of a new
In 3D A surface for each class They cut each other along straight lines = decision boundaries
51
52
https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_multinomial.html
This is done similarly to the binary task We skip the details
53
Multinomial LR constructs 𝑄 𝐷
𝑘 Ԧ
𝑓𝑥𝑘∙𝑦 σ𝑗=1
𝑙
𝑓𝑥𝑗∙𝑦 for each class.
This corresponds to one linear expression 𝑥𝑗, for each 𝐷𝑗, 𝑗 = 1, … , 𝑙 Alternatively, think of this
different features for each class:
notation 𝑔 𝑘(𝐷, 𝑦) feature j for the class C and observation x
and one set of weights for the features and classes:
In scikit-learn we write features as before and LogisticRegression
54
Linear classifiers Linear regression Logistic regression Training the logistic regression classifier Multinomial Logistic Regression Representing categorical features Generative and discriminative classifiers Logistic regression vs Naïve Bayes
55
In the naive Bayes model we could handle categorical values directly,
What is the probability that c_n = ‘z’
But many classifier can only handle numerical data How can we represent categorical data by numerical data? (In general, it is not a good idea to just assign a single number to each
56
[({'f1': 'a', 'f2': 'z', 'f3': True, 'f4': 5}, 'class_1'), ({'f1': 'b', 'f2': 'z', 'f3': False, 'f4': 2}, 'class_2'), ({'f1': 'c', 'f2': 'x', 'f3': False, 'f4': 4}, 'class_1')]
57
4 features Representation in NLTK class 3 training instances 4 different featues Classes feature f1 f2 f3 f4 type cat cat Bool (num) num Value set a, b, c x, y True, False 0, 1, 2, 3, … Class1, class2 Assume the following example
58
feature 1 feature 2 a b c x y (1,0,0) (0,1,0) (0,0,1) (1,0) (0,1)
Represent categorical variables
[({'f1': 'a', 'f2': 'z', 'f3': True, 'f4': 5}, 'class_1'), ({'f1': 'b', 'f2': 'z', 'f3': False, 'f4': 2}, 'class_2'), ({'f1': 'c', 'f2': 'x', 'f3': False, 'f4': 4}, 'class_1')] X_train: array([[ 1., 0., 0., 0., 1., 1., 5.], [ 0., 1., 0., 0., 1., 0., 2.], [ 0., 0., 1., 1., 0., 0., 4.]]) train_target: ['class_1', 'class_2', 'class_1'], or train_target: [1, 2, 1]
59
4 features scikit NLTK 7 features class 3 corresponding classes 3 training instances 3 training instances One-hot encoding a b c [1, 0, 0] [0, 1, ] [0, 0, 1
We can construct the data to scikit directly Scikit has methods for converting Python-dictionaries/NLTK-format
»
»
»
»
»
60
Transform Use same v as for train
We can construct the data to scikit directly Scikit has methods for converting text to bag of words arrays Positions corresponds to [anta, en, er, fiol, rose] »
»
»
»
61
One hot encoding uses space 26 English characters:
Each is represented as a vector
Bernoulli NB text. classifier with
Each word represented by a
scikit-learn uses internally a
62
Linear classifiers Linear regression Logistic regression Training the logistic regression classifier Multinomial Logistic Regression Representing categorical features Generative and discriminative classifiers Logistic regression vs Naïve Bayes
63
Naive Bayes is an example of a generative
classifier
On its way to deciding which class is most
probable:
It estimates the probability of the observation
given the class
It "generates" the observation with a certain
probability
For an observation:
which model ascribes the highest probability x the probability of the model
Example: is this picture of a dog or cat? To decide:
Generate a picture of a dog i.e. make a probability distribution over all
picture: how probable is it you will draw a dog like this?
Do the same for a cat 64
n n m m n n n n m
v f v f v f P s P s v f v f v f P v f v f v f s P ,..., , ) ( | ,..., , ,..., , |
2 2 1 1 2 2 1 1 2 2 1 1
First choose the length of the
Then choose the first word
according to the probability
31 798 742
Then choose word 2, etc. up to
Observation:
Whether we compare to
Footnote:
The multinomial text model
65
A discriminative classifier considers the probability of the class given
E.g. a discriminative text classifier may focus on the features:
terrible and terrific for pos. vs. neg film review director and author for pos. film vs. pos. book review
The discriminative classifier
may be more efficient but gives less explanation and may eventually focus on wrong features
66
Linear classifiers Linear regression Logistic regression Training the logistic regression classifier Multinomial Logistic Regression Representing categorical features Generative and discriminative classifiers Logistic regression vs Naïve Bayes
67
Both are probability-based In the two-class case they consider whether𝑄 𝐷 Ԧ
equivalently whether log
𝑄(𝐷| Ԧ 𝑦) 𝑄(𝑜𝑝𝑢𝐷| Ԧ 𝑦)>0
68
NB is a generative classifier:
It has a model of how the data are generated 𝑄 𝐷 𝑄 Ԧ
LogReg is a discriminative classifier
It only considers the conditional probability 𝑄 𝐷| Ԧ
69
whether log
𝑄(𝐷| Ԧ 𝑦) 𝑄(𝑜𝑝𝑢𝐷| Ԧ 𝑦)>0
For NB: log
𝑄(𝐷| Ԧ 𝑦) 𝑄(𝑜𝑝𝑢𝐷| Ԧ 𝑦) =
one particular linear expression,
For LR: log
𝑄(𝐷| Ԧ 𝑦) 𝑄(𝑜𝑝𝑢𝐷| Ԧ 𝑦) = 𝑥0 + 𝑥1𝑦1 + 𝑥2𝑦2 + ⋯ + 𝑥𝑜𝑦𝑜
the linear expression that fits the training data best
70
) 2 | ( log ) ( log ) | ( log ) ( log ) | ( log ) | ( log
1 2 1 1 1 2 1
n j j n j j
c f P c P c f P c P f c P f c P
LR: log
𝑄(𝐷| Ԧ 𝑦) 𝑄(𝑜𝑝𝑢𝐷| Ԧ 𝑦) = 𝑥0 + 𝑥1𝑦1 + 𝑥2𝑦2 + ⋯ + 𝑥𝑜𝑦𝑜
NB: Where:
𝑥0 = 𝑄 𝑑1 − 𝑄(𝑑2) 𝑥𝑗 = 𝑄 𝑔
𝑘 𝑑1) − 𝑄 𝑔 𝑘 𝑑2)
71
) 2 | ( log ) ( log ) | ( log ) ( log ) | ( log ) | ( log
1 2 1 1 1 2 1
n j j n j j
c f P c P c f P c P f c P f c P
NB is an instance of LogReg,
i.e. one possible choice of weights
LogReg will do at least as well as NB on the training data
(without any smoothing)
When the independence assumptions of NB holds, NB will do as well as
When the independence assumptions does not hold, NB may put too much
LogReg will not do this: If we add features that depend on other features,
72
73
One way to see which features are important for LogReg Start with a classifier which uses many features Remove one feature f1, retrain and see whether it has an effect Remove another feature f2, instead of f1 or in addition to f1, and study the
Beware of the possibility:
Removing f1 only has little effect Removing f2 only has little effect Removing both f1 and f2 might have a large effect Why is this so?