Language Modeling Diyi Yang Some slides borrowed from Yulia - - PowerPoint PPT Presentation

language modeling
SMART_READER_LITE
LIVE PREVIEW

Language Modeling Diyi Yang Some slides borrowed from Yulia - - PowerPoint PPT Presentation

CS 4650/7650: Natural Language Processing Language Modeling Diyi Yang Some slides borrowed from Yulia Tsvetkov at CMU and Kai-Wei Chang at UCLA 1 Logistics HW 1 Due HW 2 Out: Feb 3 rd , 2020, 3:00pm 2 Piazza & Office Hours ~ 11


slide-1
SLIDE 1

CS 4650/7650: Natural Language Processing

Language Modeling

Diyi Yang

1

Some slides borrowed from Yulia Tsvetkov at CMU and Kai-Wei Chang at UCLA

slide-2
SLIDE 2

Logistics

¡ HW 1 Due ¡ HW 2 Out: Feb 3rd, 2020, 3:00pm

2

slide-3
SLIDE 3

Piazza & Office Hours

¡ ~ 11 mins response time

3

slide-4
SLIDE 4

Review

¡ L2: Text classification ¡ L3: Neural network for text classification

4

slide-5
SLIDE 5

This Lecture

¡ Language Models

¡ What are N-gram models

¡ How to use probabilities

5

slide-6
SLIDE 6

This Lecture

¡ What is the probability of “I like Georgia Tech at Atlanta”? ¡ What is the probability of “like I Atlanta at Georgia Tech”?

6

slide-7
SLIDE 7

Language Models Play the Role of …

¡ A judge of grammaticality ¡ A judge of semantic plausibility ¡ An enforcer of stylistic consistency ¡ A repository of knowledge (?)

7

slide-8
SLIDE 8

The Language Modeling Problem

¡ Assign a probability to every sentence (or any string of words)

¡ Finite vocabulary (e.g., words or characters) {the, a, telescope, …} ¡ Infinite set of sequences ¡ A telescope STOP ¡ A STOP ¡ The the the STOP ¡ I saw a woman with a telescope STOP ¡ STOP ¡ …

8

slide-9
SLIDE 9

Example

¡ P(disseminating so much currency STOP) = 10#$% ¡ P(spending so much currency STOP) = 10#&

9

slide-10
SLIDE 10

What Is A Language Model?

¡ Probability distributions over sentences (i.e., word sequences )

P(W) = P(!"!#!$!% … !')

¡ Can use them to generate strings

P(!' ∣ !#!$!% … !')")

¡ Rank possible sentences ¡ P(“Today is Tuesday”) > P(“Tuesday Today is”) ¡ P(“Today is Tuesday”) > P(“Today is Atlanta”)

10

slide-11
SLIDE 11

Language Model Applications

¡ Machine Translation ¡ p(strong winds) > p(large winds) ¡ Spell Correction ¡ The office is about 15 minutes from my house ¡ p(15 minutes from my house) > p(15 minuets from my house) ¡ Speech Recognition ¡ p(I saw a van) >> p(eyes awe of an) ¡ Summarization, question-answering, handwriting recognition, etc..

11

slide-12
SLIDE 12

Language Model Applications

12

slide-13
SLIDE 13

Language Model Applications

Language generation

https://pdos.csail.mit.edu/archive/scigen/

13

slide-14
SLIDE 14

Bag-of-Words with N-grams

¡ N-grams: a contiguous sequence of n tokens from a given piece of text

http://recognize-speech.com/language-model/n-gram-model/comparison

14

slide-15
SLIDE 15

N-grams Models

¡ Unigram model: ! "# ! "$ ! "% … !("() ¡ Bigram model: ! "# ! "$|"# ! "%|"$ … !("(|"(+#) ¡ Trigram model:

! "# ! "$|"# ! "%|"$, "# … !("(|"(+#"(+$)

¡ N-gram model:

! "# ! "$|"# … !("(|"(+#"(+$ … "(+-)

15

slide-16
SLIDE 16

The Language Modeling Problem

¡ Assign a probability to every sentence (or any string of words)

¡ Finite vocabulary (e.g., words or characters) ¡ Infinite set of sequences !

"∈$∗

&'( ) = 1 &'( ) ≥ 0, ∀ ) ∈ Σ∗

16

slide-17
SLIDE 17

A Trivial Model

¡ Assume we have ! training sentences ¡ Let "#, "%, … , "' be a sentence, and c("#, "%, … , "') be the number of

times it appeared in the training data.

¡ Define a language model + "#, "%, … "' =

  • (./,.0,… ,.1)

2

¡ No generalization!

17

slide-18
SLIDE 18

Markov Processes

¡ Markov Processes:

¡ Given a sequence of ! random variables ¡ We want a sequence probability model ¡ "#, "%, … , "', (). +. , ! = 100), "0 ∈ 2 ¡ 3("# = 4#, "% = 4%, … "' = 4')

18

slide-19
SLIDE 19

Markov Processes

¡ Markov Processes:

¡ Given a sequence of ! random variables ¡ We want a sequence probability model ¡ "#, "%, … , "', "( ∈ * ¡ +("# = .#, "% = .%, … "' = .')

¡ There are * ' possible sequences

19

slide-20
SLIDE 20

First-order Markov Processes

¡ Chain Rule:

¡ ! "# = %#, "' = %', … ") = %)

= * "# = %# +

,-' )

!(", = %,|"# = %#, … , ",0# = %,0#)

20

slide-21
SLIDE 21

First-order Markov Processes

¡ Chain Rule:

¡ ! "# = %#, "' = %', … ") = %)

= * "# = %# +

,-' )

!(", = %,|"# = %#, … , ",0# = %,0#) = * "# = %# +

,-' )

!(", = %,|",0# = %,0#) Markov Assumption

21

slide-22
SLIDE 22

First-order Markov Processes

¡ Chain Rule:

¡ ! "# = %#, "' = %', … ") = %)

= * "# = %# +

,-' )

!(", = %,|"# = %#, … , ",0# = %,0#) = * "# = %# +

,-' )

!(", = %,|",0# = %,0#) Markov Assumption

22

slide-23
SLIDE 23

First-order Markov Processes

23

slide-24
SLIDE 24

Second-order Markov Processes

¡ ! "# = %#, "' = %', … ") = %)

= ! "# = %# ×! "' = %'|"# = %# ∏-./

)

!("- = %-|"-1' = %-1', "-1# = %-1#)

¡ Simplify notation: %3 = %1# = ∗

24

slide-25
SLIDE 25

Details: Variable Length

¡ We want probability distribution over sequences of any length

25

slide-26
SLIDE 26

Details: Variable Length

¡ Define always !" = $%&', where STOP is a special symbol ¡ Then use a Markov process as before: ¡ We now have probability distribution over all sequences ¡ Intuition: at every step you have probability (" to stop and 1 − (" to keep going + ,- = !-, ,/ = !/, … , ," = !" = 1

23- "

+(,2 = !2|,26/ = !26/, ,26- = !26-)

26

slide-27
SLIDE 27

Step 1: Initialize ! = 1 and $% = $&' = ∗ Step 2: Generate $) from the distribution * +) = $)|+)&- = $)&-, +)&' = $)&' Step 3: If x0 = 1234 then return the sequence $' ⋯ $). Otherwise, set ! = ! + 1 and return to step 2.

27

The Process of Generating Sentences

slide-28
SLIDE 28

¡ A trigram language model contains

¡ A vocabulary V ¡ A non negative parameter ! " #, %

for every trigram, such that " ∈ ' ∪ STOP , #, % ∈ ' ∪ {∗}

¡ The probability of a sentence 01, 02, … , 04, where 04 = STOP is

3-gram LMs

6 01, … , 04 = 7

891 4

! 08 08:1, 08:2)

28

slide-29
SLIDE 29

¡ A trigram language model contains

¡ A vocabulary V ¡ A non negative parameter ! " #, %

for every trigram, such that " ∈ ' ∪ STOP , #, % ∈ ' ∪ {∗}

¡ The probability of a sentence 01, 02, … , 04, where 04 = STOP is

3-gram LMs

6 01, … , 04 = 7

891 4

! 08 08:1, 08:2)

29

slide-30
SLIDE 30

3-gram LMs: Example

! the dog barks STOP = 2 the ∗,∗) ×

30

slide-31
SLIDE 31

3-gram LMs: Example

! the dog barks STOP = 2 the ∗,∗) × = 2 dog ∗, the) × = 2 barks the, dog) × = 2 STOP dog, barks)

31

slide-32
SLIDE 32

Limitations

¡ Markovian assumption is false

He is from France, so it makes sense that his first language is …

¡ We want to model longer dependencies

32

slide-33
SLIDE 33

N-gram model

33

slide-34
SLIDE 34

More Examples

34

¡ Yoav’s blog post:

¡ http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

¡ 10-gram character-level LM

First Citizen: Nay, then, that was hers, It speaks against your other service: But since the youth of the circumstance be spoken: Your uncle and one Baptista's daughter. SEBASTIAN: Do I stand till the break off.

slide-35
SLIDE 35

Maximum Likelihood Estimation

35

¡ “Best” means “data likelihood reaches maximum”

! " = $%&'$(")(+|")

slide-36
SLIDE 36

Maximum Likelihood Estimation

36

Unigram Language Model q p(w| q)=? Document

text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 … text ? mining ? association ? database ? … query ? …

10/100 5/100 3/100 3/100 1/100

Estimation A paper (total #words=100)

slide-37
SLIDE 37

Which Bag of Words More Likely to be Generated

37

aaaDaaaKoaaaa

a E K a a a D a a a

  • a

b K a D E P F n

slide-38
SLIDE 38

Parameter Estimation

38

¡ General setting:

¡ Given a (hypothesized & probabilistic) model that governs the random experiment ¡ The model gives a probability of any data !(#|%) that depends on the parameter % ¡ Now, given actual sample data X={x1,…,xn}, what can we say about the value of %?

¡ Intuitively, take our best guess of %

¡ “best” means “best explaining/fitting the data”

¡ Generally an optimization problem

slide-39
SLIDE 39

Maximum Likelihood Estimation

39

¡ Data: a collection of words, !", !$, … , !& ¡ Model: multinomial distribution p()) with parameters +, = .(!,) ¡ Maximum likelihood estimator:

/ + = 0123045∈7.()|+)

slide-40
SLIDE 40

Maximum Likelihood Estimation

40

! " # = % & '( , … , &(',) .

/0( ,

#/

1(23) ∝ . /0( ,

#/

1(23)

⇒ log ! " # = 9

/0( ,

& '/ log #/ + &;<=> ? # = @ABC@DE∈G 9

/0( ,

& '/ log #/

slide-41
SLIDE 41

Maximum Likelihood Estimation

41

! " = $%&'$()∈+ ,

  • ./

1 2- log "- 6 7, " = ,

  • ./

1 2- log "- + : ,

  • ./

"- − 1 =6 ="- = 1 2- "- + : → "- = − 1 2- : ∑-./ "- =1 : = − ,

  • ./

1 2- Since we have "- = 1 2- ∑-./ 1 2-

Lagrange multiplier

Set partial derivatives to zero Requirement from probability ML estimate

slide-42
SLIDE 42

Maximum Likelihood Estimation

42

¡ For N-gram language models ¡ ! "# "#$%, … , "#$()% =

+(-.,-./0,…,-./120) +(-./0,…,-./120)

slide-43
SLIDE 43

Practical Issues

43

¡ We do everything in the log space ¡ Avoid underflow ¡ Adding is faster than multiplying

log $%×$' = log $% + log $'

slide-44
SLIDE 44

More Resources

44

¡ Google n-gram

¡

https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html

File sizes: approx. 24 GB compressed (gzip'ed) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663

slide-45
SLIDE 45

More Resources

45

¡ Google n-gram viewer

https://books.google.com/ngrams/ Data: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

slide-46
SLIDE 46

46

slide-47
SLIDE 47

47

slide-48
SLIDE 48

48

slide-49
SLIDE 49

49

slide-50
SLIDE 50

How about Unseen Words/Phrases

50

¡ Example: Shakespeare corpus consists of N=884,647 word tokens and a vocabulary

  • f V=29,066 word types

¡ Only 30,000 word types occurred

¡ Words not in the training data ⇒0 probability

¡ Only 0.04% of all possible bigrams occurred

slide-51
SLIDE 51

How to Estimate Parameters from Training Data

¡ How do we known ! " ℎ$%&'()) ?

¡ Use statistics from data (examples using Google N-Grams) ¡ E.g., what is p(door | the) ?

51

slide-52
SLIDE 52

Increasing N-Gram Order

¡ High orders capture more dependencies

52

slide-53
SLIDE 53

Berkeley Restaurant Project Sentences

¡ can you tell me about any good cantonese restaurants close by ¡ mid priced that food is what i’m looking for ¡ tell me about chez pansies ¡ can you give me a listing of the kinds of food that are available ¡ i’m looking for a good place to eat breakfast ¡ when is cafe venezia open during the day

53

slide-54
SLIDE 54

Bigram Counts (~10K Sentences)

54

slide-55
SLIDE 55

Bigram Probabilities

55

slide-56
SLIDE 56

What Did We Learn

¡ p(English | want) < p(Chinese | want) – people like Chinese stuff more when it

comes to this corpus

¡ English behaves in a certain way

¡ p(to | want) = 0.66 ¡ p(eat | to) = 0.28

56

slide-57
SLIDE 57

Sparseness

¡ Maximum likelihood for estimating q

¡ Let !(#$, #&, … , #() be the number of times that n-gram appears in a corpus

* #+ #+,&, #+,$) = !(#+,&, #+,$, #+) !(#+,&, #+,$)

¡ If vocabulary has 20,000 words à number of parameters is 8 x 10$&! ¡ Most n-grams will never be observed, even if they are linguistically plausible ¡ Most sentences will have zero or undefined probabilities

57

slide-58
SLIDE 58

How To Evaluate

¡ Extrinsic: build a new language model, use it for some task (MT, ASR, etc.) ¡ Intrinsic: measure how good we are at modeling language

58

slide-59
SLIDE 59

Intrinsic Evaluation

¡ Intuitively, language models should assign high probability to real language they

have not seen before

¡ Want to maximize likelihood on test, not training data ¡ Models derived from counts / sufficient statistics require generalization parameters to

be tuned on held-out data to stimulate test generalization

¡ Set hyperparameters to maximize the likelihood of the held-out data (usually with grid

search or EM)

59

slide-60
SLIDE 60

Intrinsic Evaluation

¡ Intuitively, language models should assign high probability to real language they

have not seen before

60