Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia - - PowerPoint PPT Presentation

lecture 2 n gram
SMART_READER_LITE
LIVE PREVIEW

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia - - PowerPoint PPT Presentation

Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS 6501: Natural Language Processing 1 This lecture Language Models What are N-gram models? How to use


slide-1
SLIDE 1

Lecture 2: N-gram

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 CS 6501: Natural Language Processing

slide-2
SLIDE 2

This lecture

 Language Models

 What are N-gram models?

 How to use probabilities

 What does P(Y|X) mean?  How can I manipulate it?  How can I estimate its value in practice?

CS 6501: Natural Language Processing 2

slide-3
SLIDE 3

What is a language model?

 Probability distributions over sentences (i.e., word sequences ) P(W) = P(𝑥1𝑥2𝑥3𝑥4 … 𝑥𝑙)  Can use them to generate strings P(𝑥𝑙 ∣ 𝑥2𝑥3𝑥4 … 𝑥𝑙−1)  Rank possible sentences

 P(“Today is Tuesday”) > P(“Tuesday Today is”)  P(“Today is Tuesday”) > P(“Today is Virginia”)

CS 6501: Natural Language Processing 3

slide-4
SLIDE 4

Language model applications

Context-sensitive spelling correction

CS 6501: Natural Language Processing 4

slide-5
SLIDE 5

Language model applications

Autocomplete

CS 6501: Natural Language Processing 5

slide-6
SLIDE 6

Language model applications

Smart Reply

CS 6501: Natural Language Processing 6

slide-7
SLIDE 7

Language model applications

Language generation https://pdos.csail.mit.edu/archive/scigen/

CS 6501: Natural Language Processing 7

slide-8
SLIDE 8

Bag-of-Words with N-grams

 N-grams: a contiguous sequence of n tokens from a given piece of text

CS 6501: Natural Language Processing 8

http://recognize-speech.com/language-model/n-gram-model/comparison

slide-9
SLIDE 9

N-Gram Models

 Unigram model: 𝑄 𝑥1 𝑄 𝑥2 𝑄 𝑥3 … 𝑄(𝑥𝑜)  Bigram model: 𝑄 𝑥1 𝑄 𝑥2|𝑥1 𝑄 𝑥3|𝑥2 … 𝑄(𝑥𝑜|𝑥𝑜−1)  Trigram model: 𝑄 𝑥1 𝑄 𝑥2|𝑥1 𝑄 𝑥3|𝑥2, 𝑥1 … 𝑄(𝑥𝑜|𝑥𝑜−1𝑥𝑜−2)  N-gram model: 𝑄 𝑥1 𝑄 𝑥2|𝑥1 … 𝑄(𝑥𝑜|𝑥𝑜−1𝑥𝑜−2 … 𝑥𝑜−𝑂)

CS 6501: Natural Language Processing 9

slide-10
SLIDE 10

Random language via n-gram

 http://www.cs.jhu.edu/~jason/465/PowerPo int/lect01,3tr-ngram-gen.pdf  Behind the scenes – probability theory

CS 6501: Natural Language Processing 10

slide-11
SLIDE 11

Sampling with replacement

  • 1. P( ) = ? 2. P() = ? 3. P(red, ) = ?
  • 4. P(blue) = ? 5. P(red | ) = ?
  • 6. P( | red) = ? 7. P( ) = ?
  • 8. P() = ? 9. P(2 x , 3 x , 4 x ) = ?

CS 6501: Natural Language Processing 11

slide-12
SLIDE 12

Sampling words with replacement

CS 6501: Natural Language Processing 12

Example from Julia hockenmaier, Intro to NLP

slide-13
SLIDE 13

Implementation: how to sample?

 Sample from a discrete distribution 𝑞(𝑌)

 Assume 𝑜 outcomes in the event space 𝑌

  • 1. Divide the interval [0,1] into 𝑜 intervals

according to the probabilities of the outcomes

  • 2. Generate a random number 𝑠 between 0 and 1
  • 3. Return 𝑦𝑗 where 𝑠 falls into

CS 6501: Natural Language Processing 13

slide-14
SLIDE 14

Conditional on the previous word

CS 6501: Natural Language Processing 14

Example from Julia hockenmaier, Intro to NLP

slide-15
SLIDE 15

Conditional on the previous word

CS 6501: Natural Language Processing 15

Example from Julia hockenmaier, Intro to NLP

slide-16
SLIDE 16

Recap: probability Theory

 Conditional probability  P(blue | ) = ?  𝑄 𝐶 𝐵 = 𝑄(𝐶, 𝐵)/𝑄(𝐵)  Bayes’ rule: 𝑄 𝐶 𝐵 = 𝑄(A|B)𝑄 𝐶

𝑄(𝐵)

 Verify: P(red | ) , P( | red ), P(), P(red)

 Independent 𝑄 𝐶 𝐵 = P(B)

 Prove: 𝑄 A, B = P A P(B)

CS 6501: Natural Language Processing 16

slide-17
SLIDE 17

The Chain Rule

 The joint probability can be expressed in terms of the conditional probability: P(X,Y) = P(X | Y) P(Y)  More variables: P(X, Y, Z) = P(X | Y, Z) P(Y, Z) = P(X | Y, Z) P(Y | Z) P(Z)  𝑄 X1, 𝑌2, … 𝑌𝑜

= 𝑄 𝑌1 𝑄 𝑌2 𝑌1 𝑄 𝑌3 𝑌2, 𝑌1 … 𝑄 𝑌𝑜 𝑌1, … 𝑌𝑜−1

= 𝑄 𝑌1 Π𝑗=2

𝑜

𝑌𝑗 𝑌1, … 𝑌𝑗−1

CS 6501: Natural Language Processing 17

slide-18
SLIDE 18

Language model for text

 Probability distribution over sentences

 𝑞 𝑥1 𝑥2 … 𝑥𝑜 = 𝑞 𝑥1 𝑞 𝑥2 𝑥1 𝑞 𝑥3 𝑥1, 𝑥2 … 𝑞 𝑥𝑜 𝑥1, 𝑥2, … , 𝑥𝑜−1  Complexity - 𝑃(𝑊𝑜∗)

 𝑜∗ - maximum sentence length

CS 6501: Natural Language Processing 18

Chain rule: from conditional probability to joint probability

A rough estimate: 𝑃(47500014) Average English sentence length is 14.3 words 475,000 main headwords in Webster's Third New International Dictionary 47500014 8𝑐𝑧𝑢𝑓𝑡 × 1024 4 ≈ 3.38𝑓66𝑈𝐶 How large is this?

We need independence assumptions!

slide-19
SLIDE 19

Probability models

 Building a probability model:

 defining the model (making independent assumption)  estimating the model’s parameters  use the model (making inference)

CS 6501: Natural Language Processing 19

Trigram Model (defined in terms of parameters like P(“is”|”today”) )

param Values Θ definition

  • f P
slide-20
SLIDE 20

Independent assumption

 Independent assumption

 even though X and Y are not actually independent, we treat them as independent  Make the model compact (e.g., from 100𝑙14 to 100𝑙2)

CS 6501: Natural Language Processing 20

slide-21
SLIDE 21

Language model with N-gram

 The chain rule:

𝑄 X1, 𝑌2, … 𝑌𝑜 = 𝑄 𝑌1 𝑄 𝑌2 𝑌1 𝑄 𝑌3 𝑌2, 𝑌1 … 𝑄 𝑌𝑜 𝑌1, … 𝑌𝑜−1

 N-gram language model assumes each word depends only on the last n-1 words (Markov assumption)

CS 6501: Natural Language Processing 21

slide-22
SLIDE 22

Language model with N-gram

 Example: trigram (3-gram)

𝑄 𝑥𝑜 𝑥1, … 𝑥𝑜−1 = 𝑄 𝑥𝑜 𝑥𝑜−2, 𝑥𝑜−1 𝑄(𝑥1, … 𝑥𝑜)= P 𝑥1 𝑄 𝑥2 𝑥1 … 𝑄 𝑥𝑜 𝑥𝑜−2, 𝑥𝑜−1

𝑄 "𝑈𝑝𝑒𝑏𝑧 𝑗𝑡 𝑏 𝑡𝑣𝑜𝑜𝑧 𝑒𝑏𝑧" =P(“Today”)P(“is”|”Today”)P(“a”|”is”, “Today”)… P(“day”|”sunny”, “a”)

CS 6501: Natural Language Processing 22

slide-23
SLIDE 23

Unigram model

CS 6501: Natural Language Processing 23

slide-24
SLIDE 24

Bigram model

 Condition on the previous word

CS 6501: Natural Language Processing 24

slide-25
SLIDE 25

Ngram model

CS 6501: Natural Language Processing 25

slide-26
SLIDE 26

More examples

 Yoav’s blog post: http://nbviewer.jupyter.org/gist/yoavg/d761 21dfde2618422139  10-gram character-level LM:

CS 6501: Natural Language Processing 26

First Citizen: Nay, then, that was hers, It speaks against your other service: But since the youth of the circumstance be spoken: Your uncle and one Baptista's daughter. SEBASTIAN: Do I stand till the break off. BIRON: Hide thy head.

slide-27
SLIDE 27

More examples

 Yoav’s blog post: http://nbviewer.jupyter.org/gist/yoavg/d761 21dfde2618422139  10-gram character-level LM:

CS 6501: Natural Language Processing 27

~~/* * linux/kernel/time.c * Please report this on hardware. */ void irq_mark_irq(unsigned long old_entries, eval); /* * Divide only 1000 for ns^2 -> us^2 conversion values don't

  • verflow:

seq_puts(m, "\ttramp: %pS", (void *)class->contending_point]++; if (likely(t->flags & WQ_UNBOUND)) { /* * Update inode information. If the * slowpath and sleep time (abs or rel) * @rmtp: remaining (either due * to consume the state of ring buffer size. */ header_size - size, in bytes, of the chain. */ BUG_ON(!error); } while (cgrp) { if (old) { if (kdb_continue_catastrophic; #endif

slide-28
SLIDE 28

Questions?

CS 6501: Natural Language Processing 28

slide-29
SLIDE 29

Maximum likelihood Estimation

“Best” means “data likelihood reaches maximum”

Unigram Language Model  p(w| )=? Document

text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? assocation ? database ? … query ? …

Estimation A paper (total #words=100)

10/100 5/100 3/100 3/100 1/100

CS 6501: Natural Language Processing 29

෡ 𝜾 = 𝐛𝐬𝐡𝐧𝐛𝐲𝜾𝐐(𝐘|𝜾)

slide-30
SLIDE 30

 Which bag of words more likely generate: aaaDaaaKoaaaa

CS 6501: Natural Language Processing 30

a E K a a a D a a a

  • a

b K a D E P F n

slide-31
SLIDE 31

Parameter estimation

 General setting:

 Given a (hypothesized & probabilistic) model that governs the random experiment  The model gives a probability of any data 𝑞(𝑌|𝜄) that depends on the parameter 𝜄  Now, given actual sample data X={x1,…,xn}, what can we say about the value of 𝜄?

 Intuitively, take our best guess of 𝜄 -- “best” means “best explaining/fitting the data”  Generally an optimization problem

CS 6501: Natural Language Processing 31

slide-32
SLIDE 32

Maximum likelihood estimation

 Data: a collection of words, 𝑥1, 𝑥2, … , 𝑥𝑜  Model: multinomial distribution p(𝑋) with parameters 𝜄𝑗 = 𝑞(𝑥𝑗)  Maximum likelihood estimator: ෠ 𝜄 = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄∈Θ𝑞(𝑋|𝜄)

32

𝑞 𝑋 𝜄 = 𝑂 𝑑 𝑥1 , … , 𝑑(𝑥𝑂) ෑ

𝑗=1 𝑂

𝜄𝑗

𝑑(𝑥𝑗) ∝ ෑ 𝑗=1 𝑂

𝜄𝑗

𝑑(𝑥𝑗)

⇒ log 𝑞 𝑋 𝜄 = ෍

𝑗=1 𝑂

𝑑 𝑥𝑗 log 𝜄𝑗 + 𝑑𝑝𝑜𝑡𝑢

CS 6501: Natural Language Processing

෠ 𝜄 = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄∈Θ ෍

𝑗=1 𝑂

𝑑 𝑥𝑗 log 𝜄𝑗

slide-33
SLIDE 33

Maximum likelihood estimation

33

Lagrange multiplier

Set partial derivatives to zero ML estimate 𝑀 𝑋, 𝜄 = ෍

𝑗=1 𝑂

𝑑 𝑥𝑗 log 𝜄𝑗 + 𝜇 ෍

𝑗=1 𝑂

𝜄𝑗 − 1 𝜖𝑀 𝜖𝜄𝑗 = 𝑑 𝑥𝑗 𝜄𝑗 + 𝜇 → 𝜄𝑗 = − 𝑑 𝑥𝑗 𝜇 σ𝑗=1

𝑂

𝜄𝑗=1 𝜇 = − ෍

𝑗=1 𝑂

𝑑 𝑥𝑗 Since we have 𝜄𝑗 = 𝑑 𝑥𝑗 σ𝑗=1

𝑂

𝑑 𝑥𝑗 Requirement from probability

CS 6501: Natural Language Processing

 ෠ 𝜄 = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄∈Θ σ𝑗=1

𝑂

𝑑 𝑥𝑗 log 𝜄𝑗

slide-34
SLIDE 34

Maximum likelihood estimation

 For N-gram language models

 𝑞 𝑥𝑗 𝑥𝑗−1, … , 𝑥𝑗−𝑜+1 =

𝑑(𝑥𝑗,𝑥𝑗−1,…,𝑥𝑗−𝑜+1) 𝑑(𝑥𝑗−1,…,𝑥𝑗−𝑜+1)

 𝑑 ∅ = 𝑂

CS 6501: Natural Language Processing 34

Length of document or total number of words in a corpus

slide-35
SLIDE 35

A bi-gram example

<S> I am Sam </S> <S> I am legend </S> <S> Sam I am </S> P( I | <S>) = ? P(am | I) = ? P( Sam | am) = ? P( </S> | Sam) = ? P( <S>I am Sam</S> | bigram model) = ?

CS 6501: Natural Language Processing 35

slide-36
SLIDE 36

Practical Issues

 We do everything in the log space

 Avoid underflow  Adding is faster than multiplying log 𝑞1 × 𝑞2 = log 𝑞1 + log 𝑞2

 Toolkits

 KenLM: https://kheafield.com/code/kenlm/  SRILM: http://www.speech.sri.com/projects/srilm

CS 6501: Natural Language Processing 36

slide-37
SLIDE 37

More resources

 Google n-gram: https://research.googleblog.com/2006/08/all-

  • ur-n-gram-are-belong-to-you.html

CS 6501: Natural Language Processing 37

File sizes: approx. 24 GB compressed (gzip'ed) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663

slide-38
SLIDE 38

More resources

 Google n-gram viewer https://books.google.com/ngrams/ Data: http://storage.googleapis.com/books/ngrams/ books/datasetsv2.html

CS 6501: Natural Language Processing 38

circumvallate 1978 335 91 circumvallate 1979 261 91

slide-39
SLIDE 39

CS 6501: Natural Language Processing 39

slide-40
SLIDE 40

CS 6501: Natural Language Processing 40

slide-41
SLIDE 41

CS 6501: Natural Language Processing 41

slide-42
SLIDE 42

CS 6501: Natural Language Processing 42

slide-43
SLIDE 43

How about unseen words/phrases

 Example: Shakespeare corpus consists of N=884,647 word tokens and a vocabulary

  • f V=29,066 word types

 Only 30,000 word types occurred

 Words not in the training data ⇒ 0 probability

 Only 0.04% of all possible bigrams

  • ccurred

CS 6501: Natural Language Processing 43

slide-44
SLIDE 44

Next Lecture

 Dealing with unseen n-grams  Key idea: reserve some probability mass to events that don’t occur in the training data  How much probability mass should we reserve?

CS 6501: Natural Language Processing 44

slide-45
SLIDE 45

Recap

 N-gram language models

 How to generate text from a language model

 How to estimate a language model  Reading: Speech and Language Processing Chapter 4: N-Grams

CS 6501: Natural Language Processing 45