 
              Lecture 2: N-gram Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS 6501: Natural Language Processing 1
This lecture  Language Models  What are N-gram models?  How to use probabilities  What does P(Y|X) mean?  How can I manipulate it?  How can I estimate its value in practice? CS 6501: Natural Language Processing 2
What is a language model?  Probability distributions over sentences (i.e., word sequences ) P(W) = P( 𝑥 1 𝑥 2 𝑥 3 𝑥 4 … 𝑥 𝑙 )  Can use them to generate strings P( 𝑥 𝑙 ∣ 𝑥 2 𝑥 3 𝑥 4 … 𝑥 𝑙−1 )  Rank possible sentences  P(“ Today is Tuesday ”) > P(“ Tuesday Today is ”)  P (“ Today is Tuesday ”) > P(“ Today is Virginia ”) CS 6501: Natural Language Processing 3
Language model applications Context-sensitive spelling correction CS 6501: Natural Language Processing 4
Language model applications Autocomplete CS 6501: Natural Language Processing 5
Language model applications Smart Reply CS 6501: Natural Language Processing 6
Language model applications Language generation https://pdos.csail.mit.edu/archive/scigen/ CS 6501: Natural Language Processing 7
Bag-of-Words with N-grams  N-grams: a contiguous sequence of n tokens from a given piece of text http://recognize-speech.com/language-model/n-gram-model/comparison CS 6501: Natural Language Processing 8
N-Gram Models  Unigram model: 𝑄 𝑥 1 𝑄 𝑥 2 𝑄 𝑥 3 … 𝑄(𝑥 𝑜 )  Bigram model: 𝑄 𝑥 1 𝑄 𝑥 2 |𝑥 1 𝑄 𝑥 3 |𝑥 2 … 𝑄(𝑥 𝑜 |𝑥 𝑜−1 )  Trigram model: 𝑄 𝑥 1 𝑄 𝑥 2 |𝑥 1 𝑄 𝑥 3 |𝑥 2 , 𝑥 1 … 𝑄(𝑥 𝑜 |𝑥 𝑜−1 𝑥 𝑜−2 )  N-gram model: 𝑄 𝑥 1 𝑄 𝑥 2 |𝑥 1 … 𝑄(𝑥 𝑜 |𝑥 𝑜−1 𝑥 𝑜−2 … 𝑥 𝑜−𝑂 ) CS 6501: Natural Language Processing 9
Random language via n-gram  http://www.cs.jhu.edu/~jason/465/PowerPo int/lect01,3tr-ngram-gen.pdf  Behind the scenes – probability theory CS 6501: Natural Language Processing 10
Sampling with replacement 1. P( ) = ? 2. P(  ) = ? 3. P(red,  ) = ? 4. P(blue) = ? 5. P(red |  ) = ? 6. P(  | red) = ? 7. P( ) = ? 8. P(  ) = ? 9. P(2 x , 3 x , 4 x ) = ? CS 6501: Natural Language Processing 11
Sampling words with replacement Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 12
Implementation: how to sample?  Sample from a discrete distribution 𝑞(𝑌)  Assume 𝑜 outcomes in the event space 𝑌 1. Divide the interval [0,1] into 𝑜 intervals according to the probabilities of the outcomes 2. Generate a random number 𝑠 between 0 and 1 3. Return 𝑦 𝑗 where 𝑠 falls into CS 6501: Natural Language Processing 13
Conditional on the previous word Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 14
Conditional on the previous word Example from Julia hockenmaier, Intro to NLP CS 6501: Natural Language Processing 15
Recap: probability Theory  Conditional probability  P(blue |  ) = ?  𝑄 𝐶 𝐵 = 𝑄(𝐶, 𝐵)/𝑄(𝐵) = 𝑄(A|B)𝑄 𝐶  Bayes’ rule: 𝑄 𝐶 𝐵 𝑄(𝐵)  Verify: P(red |  ) , P(  | red ), P(  ), P(red)  Independent 𝑄 𝐶 𝐵 = P(B)  Prove: 𝑄 A, B = P A P(B) CS 6501: Natural Language Processing 16
The Chain Rule  The joint probability can be expressed in terms of the conditional probability: P(X,Y) = P(X | Y) P(Y)  More variables: P(X, Y, Z) = P(X | Y, Z) P(Y, Z) = P(X | Y, Z) P(Y | Z) P(Z)  𝑄 X 1 , 𝑌 2 , … 𝑌 𝑜 = 𝑄 𝑌 1 𝑄 𝑌 2 𝑌 1 𝑄 𝑌 3 𝑌 2 , 𝑌 1 … 𝑄 𝑌 𝑜 𝑌 1 , … 𝑌 𝑜−1 𝑜 = 𝑄 𝑌 1 Π 𝑗=2 𝑌 𝑗 𝑌 1 , … 𝑌 𝑗−1 CS 6501: Natural Language Processing 17
Language model for text  Probability distribution over sentences We need independence assumptions!  𝑞 𝑥 1 𝑥 2 … 𝑥 𝑜 = 𝑞 𝑥 1 𝑞 𝑥 2 𝑥 1 𝑞 𝑥 3 𝑥 1 , 𝑥 2 … 𝑞 𝑥 𝑜 𝑥 1 , 𝑥 2 , … , 𝑥 𝑜−1  Complexity - 𝑃(𝑊 𝑜 ∗ )  𝑜 ∗ - maximum sentence length Chain rule: from conditional  475,000 main headwords in Webster's Third New International Dictionary probability to joint probability  Average English sentence length is 14.3 words  A rough estimate: 𝑃(475000 14 ) 475000 14 How large is this? 8𝑐𝑧𝑢𝑓𝑡 × 1024 4 ≈ 3.38𝑓 66 𝑈𝐶 CS 6501: Natural Language Processing 18
Probability models  Building a probability model:  defining the model (making independent assumption)  estimating the model’s parameters  use the model (making inference) Trigram Model param (defined in terms of definition parameters like Values Θ of P P(“is”|”today”) ) CS 6501: Natural Language Processing 19
Independent assumption  Independent assumption  even though X and Y are not actually independent, we treat them as independent  Make the model compact (e.g., from 100𝑙 14 to 100𝑙 2 ) CS 6501: Natural Language Processing 20
Language model with N-gram  The chain rule: 𝑄 X 1 , 𝑌 2 , … 𝑌 𝑜 = 𝑄 𝑌 1 𝑄 𝑌 2 𝑌 1 𝑄 𝑌 3 𝑌 2 , 𝑌 1 … 𝑄 𝑌 𝑜 𝑌 1 , … 𝑌 𝑜−1  N-gram language model assumes each word depends only on the last n-1 words (Markov assumption) CS 6501: Natural Language Processing 21
Language model with N-gram  Example: trigram (3-gram) 𝑄 𝑥 𝑜 𝑥 1 , … 𝑥 𝑜−1 = 𝑄 𝑥 𝑜 𝑥 𝑜−2 , 𝑥 𝑜−1 𝑄(𝑥 1 , … 𝑥 𝑜 ) = P 𝑥 1 𝑄 𝑥 2 𝑥 1 … 𝑄 𝑥 𝑜 𝑥 𝑜−2 , 𝑥 𝑜−1 𝑄 "𝑈𝑝𝑒𝑏𝑧 𝑗𝑡 𝑏 𝑡𝑣𝑜𝑜𝑧 𝑒𝑏𝑧" = P(“Today”)P(“is”|”Today”)P(“a”|”is”, “Today”)… P(“day”|”sunny”, “a”) CS 6501: Natural Language Processing 22
Unigram model CS 6501: Natural Language Processing 23
Bigram model  Condition on the previous word CS 6501: Natural Language Processing 24
Ngram model CS 6501: Natural Language Processing 25
More examples  Yoav’s blog post: http://nbviewer.jupyter.org/gist/yoavg/d761 21dfde2618422139  10-gram character-level LM: First Citizen: Nay, then, that was hers, It speaks against your other service: But since the youth of the circumstance be spoken: Your uncle and one Baptista's daughter. SEBASTIAN: Do I stand till the break off. BIRON: Hide thy head. CS 6501: Natural Language Processing 26
More examples ~~/* * linux/kernel/time.c * Please report this on hardware. */  Yoav’s blog post: void irq_mark_irq(unsigned long old_entries, eval); /* http://nbviewer.jupyter.org/gist/yoavg/d761 * Divide only 1000 for ns^2 -> us^2 conversion values don't overflow: 21dfde2618422139 seq_puts(m, "\ttramp: %pS", (void *)class->contending_point]++;  10-gram character-level LM: if (likely(t->flags & WQ_UNBOUND)) { /* * Update inode information. If the * slowpath and sleep time (abs or rel) * @rmtp: remaining (either due * to consume the state of ring buffer size. */ header_size - size, in bytes, of the chain. */ BUG_ON(!error); } while (cgrp) { if (old) { if (kdb_continue_catastrophic; #endif CS 6501: Natural Language Processing 27
Questions? CS 6501: Natural Language Processing 28
Maximum likelihood Estimation “Best” means “data likelihood reaches maximum”  𝜾 = 𝐛𝐬𝐡𝐧𝐛𝐲 𝜾 𝐐(𝐘|𝜾) Unigram Language Model  Estimation Document p(w|  )=? … text 10 10/100 text ? mining 5 5/100 mining ? association 3 3/100 assocation ? database ? 3/100 database 3 … algorithm 2 1/100 query ? … … query 1 efficient 1 A paper (total #words=100) CS 6501: Natural Language Processing 29
 Which bag of words more likely generate: aaaDaaaKoaaaa a K a K a o o P D a a a a D F E b a E a n CS 6501: Natural Language Processing 30
Parameter estimation  General setting:  Given a (hypothesized & probabilistic) model that governs the random experiment  The model gives a probability of any data 𝑞(𝑌|𝜄) that depends on the parameter 𝜄  Now, given actual sample data X={x 1 ,…, x n }, what can we say about the value of 𝜄 ?  Intuitively, take our best guess of 𝜄 -- “best” means “best explaining/fitting the data”  Generally an optimization problem CS 6501: Natural Language Processing 31
Maximum likelihood estimation  Data: a collection of words, 𝑥 1 , 𝑥 2 , … , 𝑥 𝑜  Model: multinomial distribution p(𝑋) with parameters 𝜄 𝑗 = 𝑞(𝑥 𝑗 )  Maximum likelihood estimator:  𝜄 = 𝑏𝑠𝑛𝑏𝑦 𝜄∈Θ 𝑞(𝑋|𝜄) 𝑂 𝑂 𝑂 𝑑(𝑥 𝑗 ) ∝ ෑ 𝑑(𝑥 𝑗 ) 𝑞 𝑋 𝜄 = 𝑑 𝑥 1 , … , 𝑑(𝑥 𝑂 ) ෑ 𝜄 𝑗 𝜄 𝑗 𝑗=1 𝑗=1 𝑂 ⇒ log 𝑞 𝑋 𝜄 =  𝑑 𝑥 𝑗 log 𝜄 𝑗 + 𝑑𝑝𝑜𝑡𝑢 𝑗=1 𝑂  𝜄 = 𝑏𝑠𝑛𝑏𝑦 𝜄∈Θ  𝑑 𝑥 𝑗 log 𝜄 𝑗 𝑗=1 CS 6501: Natural Language Processing 32
Recommend
More recommend