CS 4650/7650: Natural Language Processing
Language Modeling (2)
Diyi Yang
1
Many slides from Dan Jurafsky and Jason Esiner
Language Modeling (2) Diyi Yang Many slides from Dan Jurafsky and - - PowerPoint PPT Presentation
CS 4650/7650: Natural Language Processing Language Modeling (2) Diyi Yang Many slides from Dan Jurafsky and Jason Esiner 1 Recap: Language Model Unigram model: ! " # ! " $ ! " % !(" ( ) Bigram model: ! " # !
1
Many slides from Dan Jurafsky and Jason Esiner
¡ Unigram model: ! "# ! "$ ! "% … !("() ¡ Bigram model: ! "# ! "$|"# ! "%|"$ … !("(|"(+#) ¡ Trigram model:
¡ N-gram model:
2
3
4
5
./$ ()*+
./$ ()*+
./$ ()*+
6
7
8
) ∑+,(
9
10
11
12
13
14
15
16
17
18
19
20
allegations reports claims
attack
request
man
21
allegations
attack man
allegations reports
claims
request
Credit: Dan Klein
22
MLE(wi | wi−1) = c(wi−1,wi)
Add−1(wi | wi−1) = c(wi−1,wi)+1
23
24
25
26
!∗ #$%&#$ = (∗ #$ #$%& ⋅ ! #$%& = ! #$%&#$ + 1 ! #$%& + , ⋅ !(#$%&)
27
28
29
see the abacus
see the abbot
see the abduct
see the above
see the Abram
… see the zygote
Total
20003/20003
30
see the abacus
see the abbot
see the abduct
see the above
see the Abram
… see the zygote
Total
20003/20003
31
see the aaaaa
see the aaaab
see the aaaac
see the aaaad
see the aaaae
… see the zzzzz
Total
(∞+3)/(∞+3)
32
¡ This gives much less probability to novel events.
¡ That is, how much should we smooth?
33
34
35
¡ Depends on how likely novel events really are! ¡ Which may depend on the type of text, size of training corpus, …
¡ We’ll look at a few methods for deciding how much to smooth.
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
1 1 1 1 scounting AbsoluteDi
i i i i i i
discounted bigram Interpolation weight unigram
52
Although Francisco is frequent, it is mainly only frequent in the phrase of San Francisco
53
54
CONTINUATION(w)∝ {wi−1 :c(wi−1,w) > 0}
CONTINUATION(w) =
55
CONTINUATION(w) =
w'
56
KN(wi | wi−1) = max(c(wi−1,wi)− d,0)
CONTINUATION(wi)
57
58
59
¡ e.g. trie
60
https://en.wikipedia.org/wiki/Trie
61
Pauls and Klein (2011), Heafield (2011)
62
63
64
Slides credit from Greg Durrett Words/one-hot vectors Concatenated word embeddings Output distribution Hidden layer
65
66
67
68
69
70
71
72