Emma Strubell
Algorithms for NLP
CS 11-711 · Fall 2020
Algorithms for NLP CS 11-711 Fall 2020 Lecture 2: Linear text - - PowerPoint PPT Presentation
Algorithms for NLP CS 11-711 Fall 2020 Lecture 2: Linear text classification Emma Strubell Lets try this again Emma Yulia Bob Sanket Han Jiateng she/her she/her he/him he/him he/him he/him 2 Outline Basic
Emma Strubell
CS 11-711 · Fall 2020
Emma Yulia Bob Sanket Han Jiateng
2
she/her she/her he/him he/him he/him he/him
3
Problem definition
4
xt w = (w1, w2, . . . , wT) ∈ V∗
el y ∈ Y.
Y = Y =
{ positive, negative, neutral }
Y = Y =
{ toxic, non-toxic }
Y = Y =
{ Mandarin, English, Spanish, … } The drinks were strong but the fish tacos were bland
w1 w2 w3 w4 w5 w6 w7 w8 w9 w = w10 y = negative
One choice of R: bag-of-words
5
The drinks were strong but the fish tacos were bland
w = x =
a a r d v a r k
… 1 1 … 1 … 1 2 1 … 2 …
b u t t h e w e r e b l a n d z y t h e r … t a c
s t r
g t a c
s h … … … …
6
Then:
where θ is a vector of weights, and f is a feature function
y
x ψ(x, y).
ψ(x, y) = θ · f (x, y) = X
j=1
θj × fj(x, y),
7
such as:
fj(x, y) = ( x
xfantastic if y = positive
f (x, y = 1) =
x0 x1 … x|V| …
| {z }
(K−1)×V
T
8
such as:
fj(x, y) = ( x
xfantastic if y = positive
f (x, y = 1) = f (x, y = 2) = [
x0 x1 … x|V| … … x0 x1 … x|V| …
| {z }
(K−2)×V
{z
V
T T
9
such as:
fj(x, y) = ( x
xbland if y = negative
f (x, y = 1) = f (x, y = 2) = [
x0 x1 … x|V| … … x0 x1 … x|V| …
f (x, y = K) =
… x0 x1 … x|V|
| {z }
(K−1)×V
T T T
10
def compute_score(x, y, weights): total = 0 for feature, count in feature_function(x, y).items(): total += weights[feature] * count return total
x =
a a r d v a r k
… 1 1 … 1 … 1 2 1 … 2 …
b u t t h e w e r e b l a n d z y t h e r … t a c
s t r
g t a c
s h … … … … θ =
… 0.36
K × V
<latexit sha1_base64="smzW0dbHv5EhLkT3s462FG624o=">ADFHicbVLahRBFK1pX3F8JWbpnEQRGToSQRdBt0IbiI4k+DUEKqrb80UqUdTdXt0aPovxJ1+ibvg1r3/4QdY/RDsS5U1eHcx7l1uWmupMck+T2Irl2/cfPWzu3hnbv37j/Y3Xs487ZwHKbcKutOU+ZBSQNTlKjgNHfAdKrgJD1/U/tP1uC8tOYDbnJYaLY0UkjOMFAf38UpQYfz852R8k4aSy+DCYdGJHOjs/2Bn9oZnmhwSBXzPv5JMlxUTKHkiuohrTwkDN+zpYwD9CwoLMom5ar+ElgslhYF47BuGH/zyiZ9n6j0xCpGa78tq8mr/LNCxSvFqU0eYFgeCskChWjev/x5l0wFtAmDcydBrzFfMY5hSj2VFag1YO8fpReN8JA6MPCJW62ZyZ6VDAt1SYDwQqFVUm9+IevmsLzbC1z3w3kczuRIVWA1Dq5lIYpBQJpfXp8KyQNne/hbIJDsJ1ezYHU1YN5Mp6oOnS2SLvFa+285uioQATYRBtPT2oiwJ5PtrbgMZgfjyeH4P2L0dHrbmN2yCPymDwlE/KSHJG35JhMCSeGfCHfyPfoa/Qjuoh+tqHRoMvZJz2Lfv0FSFEHkQ=</latexit>11
import numpy as np def compute_score(x, y, weights): return np.dot(weights, feature_function(x, y))
x =
a a r d v a r k
… 1 1 … 1 … 1 2 1 … 2 …
b u t t h e w e r e b l a n d z y t h e r … t a c
s t r
g t a c
s h … … … … θ =
…
K × V
<latexit sha1_base64="smzW0dbHv5EhLkT3s462FG624o=">ADFHicbVLahRBFK1pX3F8JWbpnEQRGToSQRdBt0IbiI4k+DUEKqrb80UqUdTdXt0aPovxJ1+ibvg1r3/4QdY/RDsS5U1eHcx7l1uWmupMck+T2Irl2/cfPWzu3hnbv37j/Y3Xs487ZwHKbcKutOU+ZBSQNTlKjgNHfAdKrgJD1/U/tP1uC8tOYDbnJYaLY0UkjOMFAf38UpQYfz852R8k4aSy+DCYdGJHOjs/2Bn9oZnmhwSBXzPv5JMlxUTKHkiuohrTwkDN+zpYwD9CwoLMom5ar+ElgslhYF47BuGH/zyiZ9n6j0xCpGa78tq8mr/LNCxSvFqU0eYFgeCskChWjev/x5l0wFtAmDcydBrzFfMY5hSj2VFag1YO8fpReN8JA6MPCJW62ZyZ6VDAt1SYDwQqFVUm9+IevmsLzbC1z3w3kczuRIVWA1Dq5lIYpBQJpfXp8KyQNne/hbIJDsJ1ezYHU1YN5Mp6oOnS2SLvFa+285uioQATYRBtPT2oiwJ5PtrbgMZgfjyeH4P2L0dHrbmN2yCPymDwlE/KSHJG35JhMCSeGfCHfyPfoa/Qjuoh+tqHRoMvZJz2Lfv0FSFEHkQ=</latexit>12
aset {(x(i), y(i))}N
i=1.
13
i.e. by maximizing the likelihood of the dataset
where C is constant in y. This ensures that:
ψ(x, y) = log p(x, y) = log p(y | x) + C,
ˆ y = argmax
y
p(y | x).
■ First, assume each instance ((x, y) pair) is independent of the others: ■ Apply the chain rule of probability: ■ Define the parametric form of each probability: ■ The multinomial is a distribution over vectors of counts ■ The parameters μ and φ are vectors of probabilities
p(x(1:N), y (1:N)) =
N
Y
i=1
p(x(i), y(i)).
14
p(x, y) = p(x | y) × p(y)
p(y) = Categorical(µ) p(x | y) = Multinomial(φ, T).
What is the probability that this word appears 3 times?
does not depend on the frequency parameter φ.
15
T) = ⇣PV
j=1 xj
⌘ ! QV
j=1(xj!) V
Y
j=1
φxj
j .
Multinomial(x; φ,
equal to the log parameters: where f(x, y) is extended to include an “offset” 1 for each possible label after the word counts.
16
ψ(x, y) = θ · f (x, y) = log p(x | y) + log p(y),
θ =
log φy1,w1 log φy1,w2 … log φy1,wv log μy1 log φy2,w1 … log φy2,wv log μy2 … log φyk,wv log μyk
K × (V + 1)
<latexit sha1_base64="4F1q9zeaHiIDviznv+xz7u4uTGE=">ADHXicbVJLaxsxEJa3r9R9xGmPvYiaQPrA7KaF9BjaS6GXFGonYBmj1Y5sEa20SLNuzOJ/Uuip/Se9lV5Lf0jvldbyDoZkPQx81DH5MWnmM4z+d6MbNW7fv7Nzt3rv/4OFub+/RyNvSCRgKq607S7kHrQwMUaGs8IBz1MNp+n5u3X8dAHOK2s+4bKASc5nRklOAbXtNf7QBmqHDw9GNEXNHk27fXjQVwbvQqSBvRJYyfTvc5flR5mBQaO79OIkLnFTcoRIaVl1Wei4OczGAdoeOg2qerRV3Q/eDIqrQvHIK29lzMqnu/zNPAzDnO/XZs7bwuNi5RvplUyhQlghGbRrLUFC1d60Az5UCgXgbAhVNhVirm3HGBQa3u/uU2c9ALwNZHKi/rzl3mwMBnYfOcm+x5xSTPlV5mIHmpcVUxL/j62R4mS1U4RtFLjaSdJkGZNapmTJca5DI1lfbHZ45svpuj1DV5NB4PZ4twFSrGgptPbB05mxZtIqvtvProqEAl0GJDR/aRtGWJRkey2ugtHhIHk1OPz4un/8tlmZHfKEPCUHJCFH5Ji8JydkSARZkC/kG/kefY1+RD+jXxtq1GlyHpOWRb/ARK7CRY=</latexit>17
ˆ φy,j = count(y, j) PV
j0=1 count(y, j0)
= P
i:y(i)=y x(i) j
PV
j0=1
P
i:y(i)=y x(i) j0
ˆ µy = count(y) P
y0 count(y0).
ˆ φ, ˆ µ = argmax
φ,µ N
Y
i=1
p(x(i), y(i)) = argmax
φ,µ N
X
i=1
log p(x(i), y(i)).
likelihood estimates.
idiosyncrasies of a finite dataset.
development set.
18
j0=1 count(y, j0)
probability p(y | x).
treats each word as equally informative. Suppose naïve and Bayes always occur together. Should we really count them both independently for classification?
generative probability p(x).
19
■ A simple learning rule: ■ Run the current classifier on an instance in the training data, obtaining ■ If the prediction is incorrect:
■ Repeat until all training instances are correctly classified (or you run out of time) ■ If the dataset is linearly separable — if there is some θ that correctly labels all the
training instances — then this method is guaranteed to find it.
g ˆ y = argmaxy ψ(x(i), y).
20
e the weights for the features of the pre
θ ← θ + f (x(i), y(i)) − f (x(i), ˆ y).
21
Such a function should have two properties:
`0-1(θ; x(i), y(i)) = ( 0, y(i) = argmaxy θ · f(x(i), y) 1,
22
`perceptron(θ; x(i), y(i)) = − θ · f (x(i), y(i)) + max
y06=y(i) θ · f (x(i), y0)
@ @θ`perceptron = − f (x(i), y(i)) + f (x(i), ˆ y) θ(t+1) ←θ(t) − @ @θ`perceptron =θ(t) + f (x(i), y(i)) − f (x(i), ˆ y).
Gradient descent!
23
easy to optimize. However, NB can be optimized in closed form while perceptron requires iterating over the dataset multiple times.
is -inf; some examples will be over-emphasized, others will be under-emphasized
performance depends on the extent to which this holds.
answer by a very small margin, the loss is still 0.
24
large margin:
(θ; x(i), y(i)) = θ · f (x(i), y(i)) − max
y06=y(i) θ · f (x(i), y0)
`MARGIN(θ; x(i), y(i)) = ( 0, (θ; x(i), y(i)) 1, 1 (θ; x(i), y(i)),
⇣ ⌘
in = max
⇣ 0, 1 − (θ; x(i), y(i)) ⌘
25
generalize the notion of classification error using a cost function:
classification rule:
c(y(i), y) = ( 1, y(i) 6= ˆ y 0,
ˆ y = argmax
y∈Y
θ · f(x(i), y) + c(y(i), y) θ(t) ←(1 − λ)θ(t−1) + f(x(i), y(i)) − f(x(i), ˆ y)
???
cost-augmented decoding regularization
26
27
discriminate correct and incorrect labels.
predictions.
conditional probability of the label:
p(y | x; θ) = exp(θ · f (x, y)) P
y02Y exp(θ · f (x, y0))
28
29
log p(y (1:N) | x(1:N); θ) =
N
X
i=1
log p(y(i) | x(i); θ)
`LogReg(θ; x(i), y(i)) = −θ·f (x(i), y(i))+log X
y02Y
exp(θ·f (x(i), y0)).
(Compare to perceptron loss!)
X =
N
X
i=1
θ · f(x(i), y(i)) − log X
y02Y
exp ⇣ θ · f(x(i), y0) ⌘
p(y | x; θ) = exp(θ · f (x, y)) P
y02Y exp(θ · f (x, y0))
30
−2 −1 1 2 θ · f(x(i), y(i)) − θ · f(x(i), ˆ y) 1 2 3 loss 0/1 loss margin loss logistic loss
31
instances?
What will its weight be?
idiosyncratic features of the training data.
32
large margin loss.
min
θ N
X
i=1
`LogReg(θ; x(i), y(i)) + ||θ||2
2,
I ||θ||2
2 = P j ✓2 j .
I The scalar c
33
34
minimizing a loss function. A general strategy for minimization is gradient descent:
θ(t+1) ← θ(t) − ⌘ @ @θ
N
X
i=1
`(θ(t); x(i), y(i)),
35
single instance: where (x(i), y(i)) is sampled at random from the training set.
number of instances. This is well suited to modern high-throughput hardware (e.g. GPUs and TPUs) and is commonly used in deep learning.
@ @θ
N
X
i=1
`(θ(t); x(i), y(i)) ≈ C × @ @θ`(θ(t); x(i), y(i)),
36
Pros Cons Naïve Bayes simple, probabilistic, fast not very accurate Perceptron simple, accurate not probabilistic, may overfit Large margin error-driven learning, can be regularized not probabilistic Logistic regression error-driven learning, regularized more difficult to implement
37
Due next Friday (9/11).
questions!
w/ Colloquium.