CS224N NLP
Bill MacCartney Gerald Penn Winter 2011
Borrows slides from Chris Manning, Bob Carpenter, Dan Klein, Roger Levy, Josh Goodman, Dan Jurafsky
CS224N NLP Bill MacCartney Gerald Penn Winter 2011 Borrows slides - - PowerPoint PPT Presentation
CS224N NLP Bill MacCartney Gerald Penn Winter 2011 Borrows slides from Chris Manning, Bob Carpenter, Dan Klein, Roger Levy, Josh Goodman, Dan Jurafsky Speech Recognition: Acoustic Waves Human speech generates a wave like a
Bill MacCartney Gerald Penn Winter 2011
Borrows slides from Chris Manning, Bob Carpenter, Dan Klein, Roger Levy, Josh Goodman, Dan Jurafsky
s p ee ch l a b
Graphs from Simon Arnfield’s web tutorial on speech, Sheffield: http://www.psyc.leeds.ac.uk/research/cogn/speech/tutorial/
“l” to “a” transition:
Speech Recognition: Acoustic Waves
– like a loudspeaker moving
Acoustic Sampling
signal processing – it let’s you see formants
25 ms 10ms
a1 a2 a3 Result: Acoustic Feature Vectors (after transformation, numbers in roughly R14)
Spectral Analysis
– sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec)
– darkness indicates energy at each frequency – hundreds to thousands of frequency samples
s p ee ch l a b
frequency amplitude
The Speech Recognition Problem
– Build generative model of encoding: We started with English words, they were encoded as an audio signal, and we now wish to decode. – Find most likely sequence w of “words” given the sequence of acoustic observation vectors a – Use Bayes’ rule to create a generative model and then decode – ArgMaxw P(w|a) = ArgMaxw P(a|w) P(w) / P(a) = ArgMaxw P(a|w) P(w)
A probabilistic theory
guessed and inferred considerable about, the powerful new mechanized methods in cryptography—methods which I believe succeed even when one does not know what language has been coded—one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ”
1947)
source P(e) e f decoder
argmax P(e|f) = argmax P(f|e)P(e) e e e f best channel P(f|e)
Language Model Translation Model
Ptext∣strokes∝PtextPstrokes∣text Ptext∣pixels∝Ptext P pixels∣text Ptext∣typos∝Ptext Ptypos∣text
world?
regarded as a different existence from men unfairly.
questions
from this data, and can then do NLP tasks?
recombined in other ways
the chain rule
linear history (a Markov assumption)
Pw1w2wn=∏
i
Pwi∣w1w2wi−1 Pw1w2wn=∏
i
Pwi∣wi−kwi−1
STOP symbol last. (Why?)
never, consider, fall, bungled, davison, that, obtain, price, lines, the, to, sass, the, the, further, board, a, details, machinists, the, companies, which, rivals, an, because, longer, oakes, percent, a, they, three, edward, it, currier, an, within, in, three, wrote, is, you, s., longer, institute, dentistry, pay, however, said, possible, to, rooms, hiding, eggs, approximate, financial, canada, the, so, workers, advancers, half, between, nasdaq]
Pw1w2wn=∏
i
Pwi
w1 w2 wn-1 STOP ………….
gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen]
seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out,
believe, chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching]
Pw1w2wn=∏
i
Pwi∣wi−1
w1 w2 wn-1 STOP
START
“The frog sat on the rock in the hot sun eating a ___.” “The student sat on the rock in the hot sun eating a ___.”
involving, IRS, leaders, and, transportation, prices, .]
from, slightly, more, than, 12, stocks, .]
which maximizes P(Training set|Model)
restaurants close by
are available
discriminate correct utterances from noisy inputs?
model’s performance on some new data
how our model performs on data we haven’t seen
training set. Preferably totally unseen/unused.
good text, not mistake text Correct answer: Andy saw a part of the movie Recognizer output: And he saw apart of the movie
insertions + deletions + substitutions true sentence size
WER: 4/7 = 57%
to the corpus text
When I order pizza, I wipe off the ____ Many children are allergic to ____ I saw a ____
HS∣M= log2P MS ∣S∣ =
i
log2PM si
i
∣si∣
grease 0.5 sauce 0.4 dust 0.05 …. mice 0.0001 …. the 1e-100
j
log2P Mw j∣w j−1
so good
chosen IID and uniformly
helpful as a first thing to measure and optimize on)
don’t count it as a symbol when taking these averages.
PS∣M =2HS∣M=n
1
i=1 n
PM wi∣h
I want want to to eat eat Chinese Chinese food food </s>
Tom Sawyer (71,370 words)
2972, a: 1775, to: 1725, of: 1440, was: 1161, it: 1027, in: 906, that: 877, he: 877, I: 783, his: 772, you: 686, Tom: 679
Frequency Frequency
3993
1292
664
410
243
199
172
131
82
91
540
99
102
this property (try it!)
P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total
allegations
attack man
allegations reports claims
attack
request
man
allegations reports
claims
request
P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total
P(w | denied the) 3 allegations 2 reports 1 claims 1 speculation … 1 request 13 total P(w | affirmed the) 1 award
have to read the textbook!
uniform Dirichlet prior
P ADD−dw∣w−1= cw−1,w d1/V cw−1d
c number of word tokens in training data c(w) count of word w in training data c(w-1,w) joint count of the w-1,w bigram V total vocabulary size (assumed known) Nk number of word types with count k
[Laplace]
as having count r* < r
Pw∣h= cw , h1 chV
vocabulary size of 20,000 words
across 10 times
again, most, in)
a) 5/10 b) 6/11 c) 6/15 d) 6/16 e) 6/20010
in N samples?
doubletons on average might occur 1.78 times)
will really treat it as if I saw it fewer than n times
cases, I will condition on less of my context than in other cases”
contexts that you haven’t learned much about
bigram, otherwise unigram
use less-sparse n-1-gram estimates
approach
constant as a function of the conditioning context
1 1 1 1
( | ) [1 ( , )] ( | ) [ ( , )] ( ) P w w w w P w w w w P w l l
1 1
( | ) [1 ] ( | ) [ ] ( ) P w w P w w P w l l
Pw∣w−1=[ 1−λw−1] Pw∣w−1λw−1Pw
smoothing, we set them to maximize the (log-)likelihood of held-out data
Training Data Held-Out Data Test Data
LLw1...wn∣M λ1...λk=∑
i
logPM λ1...λk wi∣wi−1
LL
1 1
( | ) [1 ] ( | ) [ ] ( ) P w w P w w P w l l
eel = 18 fish
catfish or bass)
trout?
[Slide adapted from Josh Goodman]
training?
times in training?
tokens to be those with training count k
N1 N2 N3 N4417 N3511
N0 N1 N2 N4416 N3510
empirical Nk with a best-fit power law once count counts get unreliable
N1 N2 N3 N4417 N3511
N0 N1 N2 N4416 N3510
N1 N2 N3 N1 N2
PKATZw∣w−1= c∗w ,w−1
w
cw ,w−1 α w−1 Pw
Count in 22M Words Actual c* (Next 22M) GT’s c* 1 0.448 0.446 2 1.25 1.26 3 2.24 2.24 4 3.23 3.24 Mass on New 9.2% 9.2%
trigrams?
us
different words in the vocabulary? Use the (n – 1)-gram estimates to tell us
training (i.e., not in our vocabulary)
PCONTINUATIONw = ∣{w−1:cw ,w−10}∣ ∣w ,w−1:cw ,w−10∣
Count in 22M Words Actual c* (Next 22M) GT’s c* 1 0.448 0.446 2 1.25 1.26 3 2.24 2.24 4 3.23 3.24
PKNw∣w−1= cw ,w−1−D
w '
cw ',w−1 α w−1PCONTINUATIONw
context
there’s enough data)
worth the cost (which is more than it seems, due to how speech recognizers are constructed)
count adjustment
Turing, held-out estimation, Witten-Bell
lower-order models
for tons of graphs!
[Graphs from Joshua Goodman]
6 6.5 7 7.5 8 8.5 9 9.5 10
Co
Entropy
PCACHEw∣history =λPw∣w−1w−21−λc w ∈history ∣history∣ PSKIPw∣w−1w−2=λ1 Pw∣w−1w−2λ2Pw∣w−1__λ3Pw∣__w−2
[Can we work out right size for it??]
(e.g., numbers & proper names)
model thus – it’s a union of basic outcomes)
can use spelling models for them, but people usually don’t