Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C - - PowerPoint PPT Presentation

neural language models
SMART_READER_LITE
LIVE PREVIEW

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C - - PowerPoint PPT Presentation

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu With slides from Graham Neubig and Philipp Koehn Roadmap Modeling Sequences First example: language model What are n-gram models? How to


slide-1
SLIDE 1

Neural Language Models

CMSC 723 / LING 723 / INST 725 MARINE CARPUAT

marine@cs.umd.edu

With slides from Graham Neubig and Philipp Koehn

slide-2
SLIDE 2

Roadmap

  • Modeling Sequences

– First example: language model – What are n-gram models? – How to estimate them? – How to evaluate them? – Neural language models

slide-3
SLIDE 3

Probabilistic Language Modeling

  • Goal: compute the probability of a sentence or

sequence of words P(W) = P(w1,w2,w3,w4,w5…wn)

  • Related task: probability of an upcoming word

P(w5|w1,w2,w3,w4)

  • A model that computes either of these:

P(W) or P(wn|w1,w2…wn-1)

is called a language model.

slide-4
SLIDE 4

Evaluation: How good is our model?

  • Does our language model prefer good sentences to bad
  • nes?

– Assign higher probability to “real” or “frequently observed” sentences

  • Than “ungrammatical” or “rarely observed” sentences?
  • Extrinsic vs intrinsic evaluation
slide-5
SLIDE 5

Intrinsic evaluation: intuition

  • The Shannon Game:

– How well can we predict the next word? – Unigrams are terrible at this game. (Why?)

  • A better model of a text assigns a higher

probability to the word that actually occurs

I always order pizza with cheese and ____ The 33rd President of the US was ____ I saw a ____

mushrooms 0.1 pepperoni 0.1 anchovies 0.01 …. fried rice 0.0001 …. and 1e-100

slide-6
SLIDE 6

Intrinsic evaluation metric: perplexity

Perplexity is the inverse probability of the test set, normalized by the number of words: Chain rule: For bigrams: Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set

  • Gives the highest P(sentence)

PP(W) = P(w1w2...wN )

  • 1

N

= 1 P(w1w2...wN )

N

slide-7
SLIDE 7

Perplexity as branching factor

  • Let’s suppose a sentence consisting of N random

digits

  • What is the perplexity of this sentence according to a

model that assign P=1/10 to each digit?

slide-8
SLIDE 8

Lower perplexity = better model

  • Training 38 million words, test 1.5 million words, WSJ

N-gram Order Unigram Bigram Trigram Perplexity 962 170 109

slide-9
SLIDE 9

Pros and cons of n-gram models

  • N-gram models

– Really easy to build, can train on billions and billions of words – Smoothing helps generalize to new data – Only work well for word prediction if the test corpus looks like the training corpus – Only capture short distance context “Smarter” LMs can address some of these issues, but they are order of magnitudes slower…

slide-10
SLIDE 10

Roadmap

  • Modeling Sequences

– First example: language model – What are n-gram models? – How to estimate them? – How to evaluate them? – Neural language models

slide-11
SLIDE 11

NE NEURAL AL NE NETWO WORKS KS

Aside

slide-12
SLIDE 12

Recall the person/not-person classification problem

Given an introductory sentence in Wikipedia predict whether the article is about a person

slide-13
SLIDE 13

Formalizing binary prediction

slide-14
SLIDE 14

The Perceptron:

a “machine” to calculate a weighted sum

sign

𝑗=1 𝐽

𝑥𝑗 ⋅ ϕ𝑗 𝑦

φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

  • 3

2

  • 1
slide-15
SLIDE 15

The Perceptron: Geometric interpretation

O X O X O X

slide-16
SLIDE 16

The Perceptron: Geometric interpretation

O X O X O X

slide-17
SLIDE 17

Limitation of perceptron

  • can only find linear separations between

positive and negative examples

X O O X

slide-18
SLIDE 18

Neural Networks

  • Connect together multiple perceptrons

φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

  • 1
  • Motivation: Can represent non-linear functions!
slide-19
SLIDE 19

Neural Networks: key terms

φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

  • 1
  • Input (aka features)
  • Output
  • Nodes
  • Layers
  • Activation function

(non-linear)

  • Multi-layer

perceptron

slide-20
SLIDE 20

Example

  • Create two classifiers

X O O X

φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}

sign sign

φ0[0] φ0[1] 1 1 1

  • 1
  • 1
  • 1
  • 1

φ0[0] φ0[1]

φ1[0] φ0[0] φ0[1] 1

w0,0 b0,0

φ1[1]

w0,1 b0,1

slide-21
SLIDE 21

Example

  • These classifiers map to a new space

X O O X

φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}

1 1

  • 1
  • 1
  • 1
  • 1

φ1 φ2

φ1[1] φ1[0]

φ1[0] φ1[1]

φ1(x1) = {-1, -1}

X O

φ1(x2) = {1, -1}

O

φ1(x3) = {-1, 1} φ1(x4) = {-1, -1}

slide-22
SLIDE 22

Example

  • In new space, the examples are linearly separable!

X O O X

φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}

1 1

  • 1
  • 1
  • 1
  • 1

φ0[0] φ0[1]

φ1[1] φ1[0] φ1[0] φ1[1] φ1(x1) = {-1, -1} X O φ1(x2) = {1, -1} O φ1(x3) = {-1, 1} φ1(x4) = {-1, -1}

1 1 1

φ2[0] = y

slide-23
SLIDE 23

Example wrap-up: Forward propagation

  • The final net

tanh tanh

φ0[0] φ0[1] 1 φ0[0] φ0[1] 1 1 1

  • 1
  • 1
  • 1
  • 1

1 1 1 1

tanh

φ1[0] φ1[1] φ2[0]

slide-24
SLIDE 24

24

Softmax Function for multiclass classification

  • Sigmoid function for multiple classes
  • Can be expressed using matrix/vector ops

𝑄 𝑧 ∣ 𝑦 = 𝑓𝐱⋅ϕ 𝑦,𝑧

𝑧 𝑓𝐱⋅ϕ 𝑦, 𝑧

Current class Sum of other classes

𝐬 = exp 𝐗 ⋅ ϕ 𝑦 𝐪 = 𝐬

𝑠∈𝐬

𝑠

slide-25
SLIDE 25

Stochastic Gradient Descent

Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw In other words

  • For every training example, calculate the gradient

(the direction that will increase the probability of y)

  • Move in that direction, multiplied by learning rate α
slide-26
SLIDE 26
  • 10

10 0.1 0.2 0.3 0.4 w*phi(x) dp(y|x)/dw*phi(x)

Gradient of the Sigmoid Function

Take the derivative of the probability

𝑒 𝑒𝑥 𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒 𝑒𝑥 𝑓𝐱⋅ϕ 𝑦 1 + 𝑓𝐱⋅ϕ 𝑦 = ϕ 𝑦 𝑓𝐱⋅ϕ 𝑦 1 + 𝑓𝐱⋅ϕ 𝑦

2

𝑒 𝑒𝑥 𝑄 𝑧 = −1 ∣ 𝑦 = 𝑒 𝑒𝑥 1 − 𝑓𝐱⋅ϕ 𝑦 1 + 𝑓𝐱⋅ϕ 𝑦 = −ϕ 𝑦 𝑓𝐱⋅ϕ 𝑦 1 + 𝑓𝐱⋅ϕ 𝑦

2

slide-27
SLIDE 27

Learning: We Don't Know the Derivative for Hidden Units!

For NNs, only know correct tag for last layer

y=1

ϕ 𝑦 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟓 = 𝐢 𝑦 𝑓𝐱𝟓⋅𝐢 𝑦 1 + 𝑓𝐱𝟓⋅𝐢 𝑦

2

𝐢 𝑦 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟐 = ? 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟑 = ? 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟒 = ?

w1 w2 w3 w4

slide-28
SLIDE 28

Answer: Back-Propagation

Calculate derivative with chain rule

𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱𝟐 = 𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱𝟓𝐢 𝐲 𝑒𝐱𝟓𝐢 𝐲 𝑒ℎ1 𝐲 𝑒ℎ1 𝐲 𝑒𝐱𝟐 𝑓𝐱𝟓⋅𝐢 𝑦 1 + 𝑓𝐱𝟓⋅𝐢 𝑦

2

𝑥1,4

Error of next unit (δ4) Weight Gradient of this unit

𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝐱𝐣 = 𝑒ℎ𝑗 𝐲 𝑒𝐱𝐣

𝑘

δ𝑘 𝑥𝑗,𝑘

In General Calculate i based

  • n next units j:
slide-29
SLIDE 29

Backpropagation = Gradient descent + Chain rule

slide-30
SLIDE 30

Feed Forward Neural Nets

All connections point forward

y

ϕ 𝑦

It is a directed acyclic graph (DAG)

slide-31
SLIDE 31

Neural Networks

  • Non-linear classification
  • Prediction: forward propagation

– Vector/matrix operations + non-linearities

  • Training: backpropagation + stochastic gradient

descent

For more details, see Cho chap 3 or CIML Chap 7

slide-32
SLIDE 32

NE NEURAL AL NE NETWO WORKS KS

Aside

slide-33
SLIDE 33

Back to language modeling…

slide-34
SLIDE 34

Representing words

  • “one hot vector”

dog = [ 0, 0, 0, 0, 1, 0, 0, 0 …] cat = [ 0, 0, 0, 0, 0, 0, 1, 0 …] eat = [ 0, 1, 0, 0, 0, 0, 1, 0 …]

  • That’s a large vector! practical solutions:

– limit to most frequent words (e.g., top 20000) – cluster words into classes

  • WordNet classes, frequency binning, etc.
slide-35
SLIDE 35
slide-36
SLIDE 36

Feed-Forward Neural Language Model

Map each word into a lower-dimensional real-valued space using shared weight matrix C Embedding layer Bengio et al. 2003

slide-37
SLIDE 37

Word Embeddings

  • Neural language models produce word embeddings as

a by product

  • Words that occurs in similar contexts tend to have

similar embeddings

  • Embeddings are useful features in many NLP tasks

[Turian et al. 2009]

slide-38
SLIDE 38

Word embeddings illustrated

slide-39
SLIDE 39

Recurrent Neural Networks

slide-40
SLIDE 40

Recurrent Neural Nets (RNN)

Part of the node outputs return as input

y

ϕ𝐮 𝑦 𝐢𝐮−𝟐

Why? It is possible to “memorize”

slide-41
SLIDE 41

Training: backpropagation through time

After processing a few training examples, Update through the unfolded recurrent neural network

slide-42
SLIDE 42

Recurrent neural language models

  • Hidden layer plays double duty

– Memory of the network – Continuous space representation to predict output words

  • Other more elaborate architectures

– Long Short Term Memory – Gated Recurrent Units

slide-43
SLIDE 43

Neural Language Models in practice

  • Much more expensive to train than n-grams!
  • But yielded dramatic improvement in hard extrinsic tasks

– speech recognition (Mikolov et al. 2011) – and more recently machine translation (Devlin et al. 2014)

  • Key practical issue:

– softmax requires normalizing over sum of scores for all possible words – What to do?

  • Ignore – a score is a score (Auli and Gao, 2014)
  • Integrate normalization into objective function (Devlin et al. 2014)
slide-44
SLIDE 44

What we know about modeling sequences so far…

– First example: language model – What are n-gram models? – How to estimate them? – How to evaluate them? – Neural language models