CS 533: Natural Language Processing
From Log-Linear to Neural Language Models
Karl Stratos
Rutgers University
Karl Stratos CS 533: Natural Language Processing 1/32
From Log-Linear to Neural Language Models Karl Stratos Rutgers - - PowerPoint PPT Presentation
CS 533: Natural Language Processing From Log-Linear to Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32 Agenda 1. Loose ends (STOP symbol, Zipfs law) 2. Log-linear language
CS 533: Natural Language Processing
Rutgers University
Karl Stratos CS 533: Natural Language Processing 1/32
◮ Gradient descent
◮ Feedforward ◮ Recurrent Karl Stratos CS 533: Natural Language Processing 2/32
to
in
and a
Karl Stratos CS 533: Natural Language Processing 3/32
the , . to
in and a `` 's
'' that for said he with is was his it at as has i by be but will have from an we not ) ( after who this had are president two would been they also first their which world
were
new its n't
more when
against about up she year her minister all us him china : there united time team government if u.s. do no before
into you years second than between could people told can since state tuesday win 10000 20000 30000 40000 50000 60000 Frequency Zipf
Karl Stratos CS 533: Natural Language Processing 4/32
◮ Random variables: context x (e.g., previous n words), next
◮ Assumes a feature function φ(x, y) ∈ {0, 1}d ◮ Model parameter: weight vector w ∈ Rd ◮ Model: for any (x, y)
◮ Model estimation: minimize cross entropy (≡ MLE)
w∈Rd
(x,y)∼pXY
CS 533: Natural Language Processing 5/32
◮ the dog chased the cat ◮ the cat chased the mouse ◮ the mouse chased the dog
◮ (x[−1], y) ◮ (x[−2], y) ◮ (x[−2], x[−1], y) ◮ (x[−1][: −2], y)
Karl Stratos CS 533: Natural Language Processing 6/32
d
Karl Stratos CS 533: Natural Language Processing 7/32
(x,y)∼pXY
N
N
y∈V
Karl Stratos CS 533: Natural Language Processing 8/32
w∈Rd J(w) + λ d
i ||w||2
2
w∈Rd J(w) + λ d
||w||1
Karl Stratos CS 533: Natural Language Processing 9/32
Karl Stratos CS 533: Natural Language Processing 10/32
2.1 Obtain ∆t ∈ Rn such that J(θt + ∆t) ≤ J(θt). 2.2 Choose some “step size” ηt ∈ R. 2.3 Set θt+1 = θt + ηt∆t.
Karl Stratos CS 533: Natural Language Processing 11/32
ǫ→0
Karl Stratos CS 533: Natural Language Processing 12/32
Karl Stratos CS 533: Natural Language Processing 13/32
J(w) = 1 N
N
J(l)(w) J(l)(w) = ln
y∈V
ew⊤φ(x(l),y) − w⊤φ(x(l), y(l))
2.1 For l ∈ shuffle({1 . . . N}), wt+1 = wt − ηt∇wJ(l)(wt)
Karl Stratos CS 533: Natural Language Processing 14/32
Karl Stratos CS 533: Natural Language Processing 15/32
◮ Gradient descent is a local search algorithm that can be
◮ Stochastic gradient descent is the cornerstone of modern
Karl Stratos CS 533: Natural Language Processing 16/32
◮ Instead of manually designing features φ, can we learn
◮ Model parameter: now includes E ∈ R|V |×d
◮ Ew ∈ Rd: continuous dense representation of word w ∈ V
◮ If we define q(y|x) as a differentiable function of E, we learn
Karl Stratos CS 533: Natural Language Processing 17/32
◮ Parameters: E ∈ R|V |×d, W ∈ R|V |×2d ◮ Model:
E∈R|V |×d W∈R|V |×2d
(x,y)∼pXY
CS 533: Natural Language Processing 18/32
Karl Stratos CS 533: Natural Language Processing 19/32
i log p(yi|xi).
i(yi − fi(x))2.
Karl Stratos CS 533: Natural Language Processing 20/32
i log p(yi|xi). ◮ Q. What’s the catch?
Karl Stratos CS 533: Natural Language Processing 21/32
Karl Stratos CS 533: Natural Language Processing 22/32
◮ J(θ) any loss function differentiable with respect to θ ∈ Rd ◮ The gradient of J with respect to θ at some point θ′ ∈ Rd
◮ Note/code:
Karl Stratos CS 533: Natural Language Processing 23/32
◮ Parameters: E ∈ R|V |×d, W ∈ Rd′×nd, V ∈ R|V |×d′ ◮ Model:
◮ Model estimation: minimize cross entropy (≡ MLE)
E∈R|V |×d W∈Rd′×nd V ∈R|V |×d′
(x,y)∼pXY
CS 533: Natural Language Processing 24/32
Karl Stratos CS 533: Natural Language Processing 25/32
Karl Stratos CS 533: Natural Language Processing 26/32
Karl Stratos CS 533: Natural Language Processing 27/32
◮ For example, using a giant feedforward to cover instances of
◮ In principle, we can learn any function. ◮ This tells us nothing about how to get there. How many
◮ Specializing an architecture to a particular type of
◮ “Right” architecture is absolutely critical in practice.
Karl Stratos CS 533: Natural Language Processing 28/32
◮ For i = 1 . . . N,
Karl Stratos CS 533: Natural Language Processing 29/32
◮ (V, W) are applied N times
Karl Stratos CS 533: Natural Language Processing 30/32
◮ RNN produces a sequence of output vectors
◮ LSTM produces “memory cell vectors” along with output
◮ These c1 . . . cN enable the network to keep or drop
Karl Stratos CS 533: Natural Language Processing 31/32
◮ Compute a masking vector for the memory cell:
◮ Use qi to keep/forget dimensions in previous memory cell:
◮ Compute another masking vector for the output:
◮ Use oi to keep/forget dimensions in current memory cell:
Karl Stratos CS 533: Natural Language Processing 32/32