From Log-Linear to Neural Language Models Karl Stratos Rutgers - - PowerPoint PPT Presentation

from log linear to neural language models
SMART_READER_LITE
LIVE PREVIEW

From Log-Linear to Neural Language Models Karl Stratos Rutgers - - PowerPoint PPT Presentation

CS 533: Natural Language Processing From Log-Linear to Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/32 Agenda 1. Loose ends (STOP symbol, Zipfs law) 2. Log-linear language


slide-1
SLIDE 1

CS 533: Natural Language Processing

From Log-Linear to Neural Language Models

Karl Stratos

Rutgers University

Karl Stratos CS 533: Natural Language Processing 1/32

slide-2
SLIDE 2

Agenda

  • 1. Loose ends (STOP symbol, Zipf’s law)
  • 2. Log-linear language models

◮ Gradient descent

  • 3. Neural language models

◮ Feedforward ◮ Recurrent Karl Stratos CS 533: Natural Language Processing 2/32

slide-3
SLIDE 3

Zipf’s Law

w1 . . . w|V | ∈ V sorted in decreasing probability p(wi) = 2p(wi+1) First four words: 93% of the unigram probability mass?

the , .

to

  • f

in

and a

Karl Stratos CS 533: Natural Language Processing 3/32

slide-4
SLIDE 4

Zipf’s Law: Empirical

the , . to

  • f

in and a `` 's

  • n

'' that for said he with is was his it at as has i by be but will have from an we not ) ( after who this had are president two would been they also first their which world

  • last

were

  • ver

new its n't

  • ne
  • ut

more when

  • bama

against about up she year her minister all us him china : there united time team government if u.s. do no before

  • r

into you years second than between could people told can since state tuesday win 10000 20000 30000 40000 50000 60000 Frequency Zipf

Karl Stratos CS 533: Natural Language Processing 4/32

slide-5
SLIDE 5

Log-Linear Language Model

◮ Random variables: context x (e.g., previous n words), next

word y

◮ Assumes a feature function φ(x, y) ∈ {0, 1}d ◮ Model parameter: weight vector w ∈ Rd ◮ Model: for any (x, y)

qφ,w(y|x) = ew⊤φ(x,y)

  • y′∈V ew⊤φ(x,y′)

◮ Model estimation: minimize cross entropy (≡ MLE)

w∗ = arg min

w∈Rd

E

(x,y)∼pXY

  • − ln qφ,w(y|x)
  • Karl Stratos

CS 533: Natural Language Processing 5/32

slide-6
SLIDE 6

Example: Feature Extraction

Corpus:

◮ the dog chased the cat ◮ the cat chased the mouse ◮ the mouse chased the dog

Feature template

◮ (x[−1], y) ◮ (x[−2], y) ◮ (x[−2], x[−1], y) ◮ (x[−1][: −2], y)

How many features do we extract from the corpus (what is d)?

Karl Stratos CS 533: Natural Language Processing 6/32

slide-7
SLIDE 7

Example: Score of (x, y)

For any (x, y), its “score” given by parameter w ∈ Rd is w⊤φ(x, y) =

d

  • i=1: φi(x,y)=1

wi Example: x = mouse chased w⊤φ(mouse chased, the) = w(-1)chased,the + w(-2)mouse,the + w(-2)mouse(-1)chased,the + w(-1:-2)ed,the w⊤φ(mouse chased, chased) = w(-1)chased,chased + w(-2)mouse,chased + w(-2)mouse(-1)chased,chased + w(-1:-2)ed,chased

Karl Stratos CS 533: Natural Language Processing 7/32

slide-8
SLIDE 8

Empirical Objective

E

(x,y)∼pXY

  • − ln qφ,w(y|x)
  • ≈ 1

N

N

  • l=1

− ln qφ,w(y(l)|x(l)) = 1 N

N

  • l=1

ln  

y∈V

ew⊤φ(x(l),y)   − w⊤φ(x(l), y(l))

  • J(w)

When is J(w) minimized?

Karl Stratos CS 533: Natural Language Processing 8/32

slide-9
SLIDE 9

Regularization

Ways to make sure w doesn’t overfit training data

  • 1. Early stopping: stop training when validation performance

stops improving

  • 2. Explicit regularization term

min

w∈Rd J(w) + λ d

  • i=1

w2

i ||w||2

2

  • r

min

w∈Rd J(w) + λ d

  • i=1

|wi|

||w||1

  • 3. Other techniques (e.g., dropout)

Karl Stratos CS 533: Natural Language Processing 9/32

slide-10
SLIDE 10

Gradient Descent

Minimize f(x) = x3 + 2x2 − x − 1 over x (Courtesy to FooPlot)

Karl Stratos CS 533: Natural Language Processing 10/32

slide-11
SLIDE 11

Local Search

Input: training objective J(θ) ∈ R, number of iterations T Output: parameter ˆ θ ∈ Rd such that J(ˆ θ) is small

  • 1. Initialize θ0 (e.g., randomly).
  • 2. For t = 0 . . . T − 1,

2.1 Obtain ∆t ∈ Rn such that J(θt + ∆t) ≤ J(θt). 2.2 Choose some “step size” ηt ∈ R. 2.3 Set θt+1 = θt + ηt∆t.

  • 3. Return θT .

What is a good ∆t?

Karl Stratos CS 533: Natural Language Processing 11/32

slide-12
SLIDE 12

Gradient of the Objective at the Current Parameter

At θt ∈ Rn, the rate of increase (of the value of J) along a direction u ∈ Rn (i.e., ||u||2 = 1) is given by the directional derivative ∇uJ(θt) := lim

ǫ→0

J(θt + ǫu) − J(θt) ǫ The gradient of J at θt is defined to be a vector ∇J(θt) such that ∇uJ(θt) = ∇J(θt) · u ∀u ∈ Rn Therefore, the direction of the greatest rate of decrease is given by −∇J(θt)/

  • ∇J(θt)
  • 2.

Karl Stratos CS 533: Natural Language Processing 12/32

slide-13
SLIDE 13

Gradient Descent

Input: training objective J(θ) ∈ R, number of iterations T Output: parameter ˆ θ ∈ Rn such that J(ˆ θ) is small

  • 1. Initialize θ0 (e.g., randomly).
  • 2. For t = 0 . . . T − 1,

θt+1 = θt − ηt∇J(θt)

  • 3. Return θT .

When J(θ) is additionally convex (as in linear regression), gradient descent converges to an optimal solution (for appropriate step sizes).

Karl Stratos CS 533: Natural Language Processing 13/32

slide-14
SLIDE 14

Stochastic Gradient Descent for Log-Linear Model

Input: training objective

J(w) = 1 N

N

  • l=1

J(l)(w) J(l)(w) = ln  

y∈V

ew⊤φ(x(l),y)   − w⊤φ(x(l), y(l))

number of iterations T (“epochs”)

  • 1. Initialize w0 (e.g., randomly).
  • 2. For t = 0 . . . T − 1,

2.1 For l ∈ shuffle({1 . . . N}), wt+1 = wt − ηt∇wJ(l)(wt)

  • 3. Return wT .

Karl Stratos CS 533: Natural Language Processing 14/32

slide-15
SLIDE 15

Gradient Derivation

Board

Karl Stratos CS 533: Natural Language Processing 15/32

slide-16
SLIDE 16

Summary of Gradient Descent

◮ Gradient descent is a local search algorithm that can be

used to optimize any differentiable objective function.

◮ Stochastic gradient descent is the cornerstone of modern

large-scale machine learning.

Karl Stratos CS 533: Natural Language Processing 16/32

slide-17
SLIDE 17

Word Vectors

◮ Instead of manually designing features φ, can we learn

features themselves?

◮ Model parameter: now includes E ∈ R|V |×d

◮ Ew ∈ Rd: continuous dense representation of word w ∈ V

◮ If we define q(y|x) as a differentiable function of E, we learn

E itself.

Karl Stratos CS 533: Natural Language Processing 17/32

slide-18
SLIDE 18

Simple Model?

◮ Parameters: E ∈ R|V |×d, W ∈ R|V |×2d ◮ Model:

qE,W (y|x) = softmaxy

  • W

Ex[−1] Ex[−2]

  • ◮ Model estimation: minimize cross entropy (≡ MLE)

E∗, W ∗ = arg min

E∈R|V |×d W∈R|V |×2d

E

(x,y)∼pXY

  • − ln qE,W (y|x)
  • Karl Stratos

CS 533: Natural Language Processing 18/32

slide-19
SLIDE 19

Neural Network

Just a composition of linear/nonlinear functions.

f(x) = W (L) tanh

  • W (L−1) · · · tanh
  • W (1)x
  • · · ·
  • More like a paradigm, not a specific model.
  • 1. Transform your input x −

→ f(x).

  • 2. Define loss between f(x) and the target label y.
  • 3. Train parameters by minimizing the loss.

Karl Stratos CS 533: Natural Language Processing 19/32

slide-20
SLIDE 20

You’ve Already Seen Some Neural Networks. . .

Log-linear model is a neural network with 0 hidden layer and a softmax output layer: p(y|x) := exp([Wx]y)

  • y′ exp([Wx]y′) = softmaxy(Wx)

Get W by minimizing L(W) = −

i log p(yi|xi).

Linear regression is a neural network with 0 hidden layer and the identity output layer: f(x) := Wx Get W by minimizing L(W) =

i(yi − fi(x))2.

Karl Stratos CS 533: Natural Language Processing 20/32

slide-21
SLIDE 21

Feedforward Network

Think: log-linear with extra transformation With 1 hidden layer: h(1) = tanh(W (1)x) p(y|x) = softmaxy(h(1)) With 2 hidden layers: h(1) = tanh(W (1)x) h(2) = tanh(W (2)h(1)) p(y|x) = softmaxy(h(2)) Again, get parameters W (l) by minimizing −

i log p(yi|xi). ◮ Q. What’s the catch?

Karl Stratos CS 533: Natural Language Processing 21/32

slide-22
SLIDE 22

Training = Loss Minimization

We can decrease any continuous loss by following the gradient.

  • 1. Differentiate the loss wrt. model parameters (backprop)
  • 2. Take a gradient step

Karl Stratos CS 533: Natural Language Processing 22/32

slide-23
SLIDE 23

Backpropagation

◮ J(θ) any loss function differentiable with respect to θ ∈ Rd ◮ The gradient of J with respect to θ at some point θ′ ∈ Rd

∇θJ(θ′) ∈ Rd can be calculated automatically by backpropagation.

◮ Note/code:

http://karlstratos.com/notes/backprop.pdf

Karl Stratos CS 533: Natural Language Processing 23/32

slide-24
SLIDE 24

Bengio et al. (2003)

◮ Parameters: E ∈ R|V |×d, W ∈ Rd′×nd, V ∈ R|V |×d′ ◮ Model:

qE,W,V (y|x) = softmaxy   V tanh   W    Ex[−1] . . . Ex[−n]         

◮ Model estimation: minimize cross entropy (≡ MLE)

E∗, W ∗, V ∗ = arg min

E∈R|V |×d W∈Rd′×nd V ∈R|V |×d′

E

(x,y)∼pXY

  • − ln qE,W,V (y|x)
  • Karl Stratos

CS 533: Natural Language Processing 24/32

slide-25
SLIDE 25

Bengio et al. (2003): Continued

Karl Stratos CS 533: Natural Language Processing 25/32

slide-26
SLIDE 26

Collobert and Weston (2008)

Nearest neighbors of trained word embeddings E ∈ R|V |×d https: //ronan.collobert.com/pub/matos/2008_nlp_icml.pdf

Karl Stratos CS 533: Natural Language Processing 26/32

slide-27
SLIDE 27

Neural Networks are (Finite-Sample) Universal Learners!

  • Theorem. (Zhang et al., 2016) Give me any
  • 1. Set of n samples S =
  • x(1) . . . x(n)

⊂ Rd

  • 2. Function f : S → R that assigns some arbitrary value f(x(i))

to each i = 1 . . . n Then I can specify a 1-hidden-layer feedforward network C : S → R with 2n + d parameters such that C(x(i)) = f(x(i)) for all i = 1 . . . n.

Proof.

Define C(x) = w⊤relu((a⊤x . . . a⊤x) + b) where w, b ∈ Rn and a ∈ Rd are network parameters. Choose a, b so that the matrix Ai,j := [max

  • 0, a⊤x(i) − bj
  • ] is triangular. Solve for w in

   f(x(1)) . . . f(x(n))    = Aw

Karl Stratos CS 533: Natural Language Processing 27/32

slide-28
SLIDE 28

So Why Not Use a Simple Feedforward for Everything?

Computational reasons

◮ For example, using a giant feedforward to cover instances of

different sizes is clearly inefficient. Empirical reasons

◮ In principle, we can learn any function. ◮ This tells us nothing about how to get there. How many

samples do we need? How can we find the right parameters?

◮ Specializing an architecture to a particular type of

computation allows us to incorporate inductive bias.

◮ “Right” architecture is absolutely critical in practice.

Karl Stratos CS 533: Natural Language Processing 28/32

slide-29
SLIDE 29

Recurrent Neural Network (RNN)

Think: HMM (or Kalman filter) with extra transformation Input: sequence x1 . . . xN ∈ Rd

◮ For i = 1 . . . N,

hi = tanh (Wxi + V hi−1) Output: sequence h1 . . . hN ∈ Rd′

Karl Stratos CS 533: Natural Language Processing 29/32

slide-30
SLIDE 30

RNN ≈ Deep Feedforward

Unroll the expression for the last output vector hN: hN = tanh

  • WxN + V
  • · · · + V tanh
  • Wx1 + V h0
  • · · ·
  • It’s just a deep “feedforward network” with one important

difference: parameters are reused

◮ (V, W) are applied N times

Training: do backprop on this unrolled network, update parameters

Karl Stratos CS 533: Natural Language Processing 30/32

slide-31
SLIDE 31

LSTM

◮ RNN produces a sequence of output vectors

x1 . . . xN − → h1 . . . hN

◮ LSTM produces “memory cell vectors” along with output

x1 . . . xN − → c1 . . . cN, h1 . . . hN

◮ These c1 . . . cN enable the network to keep or drop

information from previous states.

Karl Stratos CS 533: Natural Language Processing 31/32

slide-32
SLIDE 32

LSTM: Details

At each time step i,

◮ Compute a masking vector for the memory cell:

qi = σ

  • U qx + V qhi−1 + W ici−1
  • ∈ [0, 1]d′

◮ Use qi to keep/forget dimensions in previous memory cell:

ci = (1 − qi) ⊙ ci−1 + qi ⊙ tanh (U cx + V chi−1)

◮ Compute another masking vector for the output:

  • i = σ (U ox + V ohi−1 + W oci) ∈ [0, 1]d′

◮ Use oi to keep/forget dimensions in current memory cell:

hi = oi ⊙ tanh(ci)

Karl Stratos CS 533: Natural Language Processing 32/32