CS 533: Natural Language Processing
Language Modeling
Karl Stratos
Rutgers University
Karl Stratos CS 533: Natural Language Processing 1/40
Language Modeling Karl Stratos Rutgers University Karl Stratos CS - - PowerPoint PPT Presentation
CS 533: Natural Language Processing Language Modeling Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/40 Motivation How likely are the following sentences? the dog barked the cat barked dog the
CS 533: Natural Language Processing
Rutgers University
Karl Stratos CS 533: Natural Language Processing 1/40
◮ the dog barked ◮ the cat barked ◮ dog the barked ◮ oqc shgwqw#w 1g0
Karl Stratos CS 533: Natural Language Processing 2/40
◮ the dog barked
◮ the cat barked
◮ dog the barked
◮ oqc shgwqw#w 1g0
Karl Stratos CS 533: Natural Language Processing 2/40
Karl Stratos CS 533: Natural Language Processing 3/40
Karl Stratos CS 533: Natural Language Processing 4/40
Karl Stratos CS 533: Natural Language Processing 5/40
Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing
Karl Stratos CS 533: Natural Language Processing 6/40
◮ We’ll assume a finite vocabulary V (i.e., the set of all
◮ Sample space: Ω = {x1 . . . xm ∈ V m : m ≥ 1} ◮ Task: Design a function p over Ω such that
◮ What are some challenges?
Karl Stratos CS 533: Natural Language Processing 7/40
◮ Can we “break up” the probability of a sentence into
◮ Yes: Assume a generative process. ◮ We may assume that each sentence x1 . . . xm is generated as
(1) x1 is drawn from p(·), (2) x2 is drawn from p(·|x1), (3) x3 is drawn from p(·|x1, x2), . . . (m) xm is drawn from p(·|x1, . . . , xm−1), (m + 1) xm+1 is drawn from p(·|x1, . . . , xm).
Karl Stratos CS 533: Natural Language Processing 8/40
◮ Sample space = finite V ◮ The model still defines a proper distribution over all sentences.
Karl Stratos CS 533: Natural Language Processing 9/40
Karl Stratos CS 533: Natural Language Processing 10/40
Karl Stratos CS 533: Natural Language Processing 11/40
Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing
Karl Stratos CS 533: Natural Language Processing 12/40
Karl Stratos CS 533: Natural Language Processing 13/40
Karl Stratos CS 533: Natural Language Processing 14/40
Karl Stratos CS 533: Natural Language Processing 15/40
Karl Stratos CS 533: Natural Language Processing 16/40
◮ Is this a reasonable assumption for language modeling?
Karl Stratos CS 533: Natural Language Processing 17/40
Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing
Karl Stratos CS 533: Natural Language Processing 18/40
◮ Summary so far: We have designed probabilistic language
◮ Bigram model: Stores a table of O(|V |2) values
◮ Q. But where do we get these values?
Karl Stratos CS 533: Natural Language Processing 19/40
◮ Our data is a corpus of N sentences x(1) . . . x(N). ◮ Define count(x, x′) to be the number of times x, x′ appear
N
li+1
xj=x′ xj−1=x
◮ Define count(x) := x′ count(x, x′) (called “unigram
Karl Stratos CS 533: Natural Language Processing 20/40
◮ the dog chased the cat ◮ the cat chased the mouse ◮ the mouse chased the dog
Karl Stratos CS 533: Natural Language Processing 21/40
◮ For all x, x′ with count(x, x′) > 0, set
◮ In the previous example:
◮ Called maximum likelihood estimation (MLE).
Karl Stratos CS 533: Natural Language Processing 22/40
q: q(x′|x)≥0 ∀x,x′
N
li+1
Karl Stratos CS 533: Natural Language Processing 23/40
Karl Stratos CS 533: Natural Language Processing 24/40
Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing
Karl Stratos CS 533: Natural Language Processing 25/40
Karl Stratos CS 533: Natural Language Processing 26/40
◮ True context-word pairs distributed as (c, w) ∼ pCW ◮ We define some language model qW|C ◮ Learning: minimize cross entropy between pW|C and qW|C
(c,w)∼pCW
◮ Evaluation of qW|C: check how small H(pW|C, qW|C) is!
◮ If the model class of qW |C is universally expressive, an optimal
model q∗
W |C will satisfy H(pW |C, q∗ W |C) = H(pW |C) with
q∗
W |C = pW |C.
Karl Stratos CS 533: Natural Language Processing 27/40
◮ Exponentiated cross entropy
◮ Interpretation: effective vocabulary size
◮ Empirical estimation: given (c1, w1) . . . (cN, wN) ∼ pCW ,
N
N
i=1 ln qW |C(wi|ci)
Karl Stratos CS 533: Natural Language Processing 28/40
◮ Using vocabulary size |V | = 50, 000 (Goodman, 2001)
◮ Unigram: 955, Bigram: 137, Trigram: 74
◮ Modern neural language models: probably ≪ 20 ◮ The big question: what is the minimum perplexity achievable
Karl Stratos CS 533: Natural Language Processing 29/40
Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing
Karl Stratos CS 533: Natural Language Processing 30/40
Karl Stratos CS 533: Natural Language Processing 31/40
smoothed(x′′|x, x′) =λ1qα(x′′|x, x′)+
Karl Stratos CS 533: Natural Language Processing 32/40
◮ Kneser-Ney smoothing: Section 3.5 of
◮ Good-Turing estimator: the “missing mass” problem
◮ On the Convergence Rate of Good-Turing Estimators
(McAllester and Schapire, 2001)
Karl Stratos CS 533: Natural Language Processing 33/40
◮ Text: initially a single string
“Call me Ishmael.”
◮ Naive tokenization: by space
[Call, me, Ishmael.]
◮ English-specific tokenization using rules or statistical model:
[Call, me, Ishmael, .]
◮ Language-independent tokenization learned from data
◮ Wordpiece: https://arxiv.org/pdf/1609.08144.pdf ◮ Byte-Pair Encoding:
https://arxiv.org/pdf/1508.07909.pdf
◮ Sentencepiece:
https://www.aclweb.org/anthology/D18-2012.pdf
Karl Stratos CS 533: Natural Language Processing 34/40
Unigram, Bigram, Trigram Models Estimation from Data Evaluation Smoothing
Karl Stratos CS 533: Natural Language Processing 35/40
◮ Model parameters: probabilities q(x′|x)
◮ Training requires constrained optimization
q∗ = arg max
q: q(x′|x)≥0 ∀x,x′
N
li+1
log q(xj|xj−1) Though easy to solve, not clear how to develp more complex functions
◮ Brittle: function of raw n-gram identities
◮ Generalizing to unseen n-grams require explicit smoothing
(cumbersome)
Karl Stratos CS 533: Natural Language Processing 36/40
◮ Design a feature representation φ(x1 . . . xn) ∈ Rd of any
◮ For example,
Karl Stratos CS 533: Natural Language Processing 37/40
◮ Given any v ∈ RD, we define softmax(v) ∈ [0, 1]D to be a
j=1 evj
◮ Check nonnegativity and normalization ◮ Softmax transforms any length-D vector into a distribution
Karl Stratos CS 533: Natural Language Processing 38/40
◮ Model parameter: w ∈ Rd ◮ Given x1 . . . xn−1, defines a conditional distribution over V by
◮ Reason it’s called log-linear:
Karl Stratos CS 533: Natural Language Processing 39/40
◮ Assume N samples (x(i) 1 . . . x(i) n , x(i)), find
w∈R|V |×d N
1 . . . x(i) n ; w)
◮ Unlike n-gram models, there is no closed-form solution for
◮ But actually this optimization problem is more “standard”
◮ It can be checked that J(w) is concave, so doing gradient
ascent will get us W ∗
Karl Stratos CS 533: Natural Language Processing 40/40