Topics in Computational Linguistics Topics in Computational - - PowerPoint PPT Presentation

topics in computational linguistics
SMART_READER_LITE
LIVE PREVIEW

Topics in Computational Linguistics Topics in Computational - - PowerPoint PPT Presentation

. . . . .. . . .. . . .. . . .. . . .. .. . . Topics in Computational Linguistics Topics in Computational Linguistics March 28, 2014 GIL, National Taiwan University Lab of Ontologies, Language Processing and e-Humanities


slide-1
SLIDE 1

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Topics in Computational Linguistics

Week 5: ngrams and language model Shu-Kai Hsieh

Lab of Ontologies, Language Processing and e-Humanities GIL, National Taiwan University

March 28, 2014

Topics in Computational Linguistics Shu-Kai Hsieh

slide-2
SLIDE 2

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

. 1 N-grams model

Evaluation Smoothing Techniques

. 2 Web-scaled N-grams . . 3 Related Topics . . 4 The Entropy of Natural Languages . . 5 Lab

Topics in Computational Linguistics Shu-Kai Hsieh

slide-3
SLIDE 3

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Language models

  • Statistical/probabilistic language models aim to compute
  • either the prob. of a sentence or sequence of words,

P(S) = P(w1, w2, w3, ...wn), or

  • the prob. of the upcoming word

P(wn|w1, w2, w3, ...wn−1) (which will turn out to be closely related to computing the probability of a sequence of words.)

  • N-gram model is one of the most important tools in speech

and language processing.

  • Varied applications: spelling checker, MT, Speech Recognition,

QA, etc.

Topics in Computational Linguistics Shu-Kai Hsieh

slide-4
SLIDE 4

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

. 1 N-grams model

Evaluation Smoothing Techniques

. . 2 Web-scaled N-grams . . 3 Related Topics . . 4 The Entropy of Natural Languages . . 5 Lab

Topics in Computational Linguistics Shu-Kai Hsieh

slide-5
SLIDE 5

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Simple n-gram model

  • Let’s start with calculating the P(S), say,

P(S) = P(學, 語言, 很, 有趣)

Topics in Computational Linguistics Shu-Kai Hsieh

slide-6
SLIDE 6

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Review of Joint and Conditional Probability

  • Recall that the conditional prob. of X given Y, P(X|Y), is

defined in terms of the prob. of Y, P(Y), and the joint prob.

  • f X and Y, P(X, Y):

P(X|Y) = P(X, Y) P(Y)

Topics in Computational Linguistics Shu-Kai Hsieh

slide-7
SLIDE 7

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Review of Chain Rule of Probability

Conversely, the joint prob. P(X, Y) can be expressed in terms of the conditional prob. P(X|Y). P(X, Y) = P(X|Y) P(Y) which leads to the chain rule P(X1, X2, X3, · · · , Xn) = P(X1)P(X2|X1)P(X3|X1, X2) · · · P(Xn|X1, · · · , Xn−1) = P(X1) ∏n

i=2 P(Xi|X1, · · · , Xi−1)

Topics in Computational Linguistics Shu-Kai Hsieh

slide-8
SLIDE 8

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

The Chain Rule applied to calculate joint probability of words in sentence

chain rule of probability

P(S) = P(wn

1) = P(w1)P(w2|w1)P(w3|w2 1)...P(wn|wn−1 1

) = ∏n

k=1 P(wk|wk−1 1

) = P(學) * P(語言|學) * P(很|學 語言) * P(有趣|學 語言 很)

Topics in Computational Linguistics Shu-Kai Hsieh

slide-9
SLIDE 9

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

How to Estimate these Probabilities?

  • Maximum Likelihood Estimation (MLE): by dividing simply

counting in a corpus and normalize them so that they lie between 0 and 1. (There are of course more sophisticated algorithms) 1

count and divide

P(嗎 | 學 語言 很 有趣) = Count(學 語言 很 有趣 嗎) / Count(學 語言 很 有趣)

1MLE sometimes called relative frequency Topics in Computational Linguistics Shu-Kai Hsieh

slide-10
SLIDE 10

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Markov Assumption: Don’t look too far into the past

Simplified idea: instead of computing the prob. of a word given its entire history, we can approximate the history by just the last few words. P(嗎 | 學 語言 很 有趣) ≈ P( 嗎 | 有趣) OR, P(嗎 | 學 語言 很 有趣) ≈ P( 嗎 | 很 有趣 )

Topics in Computational Linguistics Shu-Kai Hsieh

slide-11
SLIDE 11

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

In other words

  • Bi-gram model: approximates the prob. of a word give all the

previous P(wn|wn−1

1

) by using only the conditional prob. of the preceding words P(wn|wn−1). Thus generalized as P(wn|wn−1

1

) ≈ P(wn|wn−1

n−N+1)

  • Tri-gram: (your turn)
  • We can extend to trigrams, 4-grams, 5-grams, knowing that

in general this is an insufficient model of language (because language has long-distance dependencies). 我 在 一 個 非 常 奇特 的 機緣巧合 之下 學 梵文

Topics in Computational Linguistics Shu-Kai Hsieh

slide-12
SLIDE 12

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

In other words

  • So given the bi-gram assumption for the prob. of an individual

word, we can compute the prob. of the entire sentence as P(S) = P(wn

1) ≈ n

k=1

P(wk|wk−1)

  • recall MLE on JM book equation (4.13)-(4.14)

Topics in Computational Linguistics Shu-Kai Hsieh

slide-13
SLIDE 13

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Example: Language Modeling of Alice.txt

Topics in Computational Linguistics Shu-Kai Hsieh

slide-14
SLIDE 14

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Topics in Computational Linguistics Shu-Kai Hsieh

slide-15
SLIDE 15

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Exercise

  • Walk through the example of Berkeley Restaurant Project

sentences (PP90-91) BTW, we used to do everything in log space to avoid underflow (also adding is faster than multiplying) log(p1 ∗ p2 ∗ p3) = logp1 + logp2 + logp3

Topics in Computational Linguistics Shu-Kai Hsieh

slide-16
SLIDE 16

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Google n-gram and Google Suggestion

Topics in Computational Linguistics Shu-Kai Hsieh

slide-17
SLIDE 17

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Generating the Wall Street Journal vs Generating Shakespeare

Topics in Computational Linguistics Shu-Kai Hsieh

slide-18
SLIDE 18

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Generating the Wall Street Journal vs Generating Shakespeare

Topics in Computational Linguistics Shu-Kai Hsieh

slide-19
SLIDE 19

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

  • Quadrigrams looks like Shakespeare because it is Shakespeare.
  • N-gram model is very sensitive to the training corpus!

Overfitting issue

  • N-grams only work well for word prediction if the test corpus

looks like the training corpus, but in real life, it often doesn’t.

  • We need to train a more robust model that generalize, e.g.

Zeros issue, i.e., Things that don’t ever occur in the training set but occur in the test set.

Topics in Computational Linguistics Shu-Kai Hsieh

slide-20
SLIDE 20

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Evaluation

. . 1 N-grams model

Evaluation Smoothing Techniques

. 2 Web-scaled N-grams . . 3 Related Topics . . 4 The Entropy of Natural Languages . . 5 Lab

Topics in Computational Linguistics Shu-Kai Hsieh

slide-21
SLIDE 21

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Evaluation

Evaluating n-gram models

How good is our model? How to make it better(robust)?

  • N-gram language models are evaluated by separating the

corpus into a training set and a test set, training the model on the training set, and evaluating on the test set. An evaluation metric tells us how well our model does on the test set.

  • Extrinsic (in vivo) evaluation
  • intrinsic evaluation: perplexity (2H of of the language model
  • n a test set is used to compare language models.)

Topics in Computational Linguistics Shu-Kai Hsieh

slide-22
SLIDE 22

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Evaluation

Evaluation the N-gram Model

But the model relies heavily on the corpus the models were trained

  • n, and thus often results in overfitting!

Example

  • Given a vocabulary of 20,000 types, the potential number of

bigrams is 20, 0002 = 400, 000, 000, and with tri-grams, it amounts to the astronomic figure of 20, 0003. No corpus yet has the size to cover the corresponding word combinations.

  • MLE gives no hint on how to estimate their prob.
  • Here we use smoothing (or discounting) techniques to

estimate prob. of unseen ngrams, presumably because a distribution without zeros is smoother than one with zeros.

Topics in Computational Linguistics Shu-Kai Hsieh

slide-23
SLIDE 23

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Evaluation

Perplexity

  • The best language model is one that best predicts an unseen

test set (i.e., Gives the highest P(sentence)).

  • Perplexity is defined as the inverse probability of the test set,

normalized by the number of words.

Topics in Computational Linguistics Shu-Kai Hsieh

slide-24
SLIDE 24

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

The intuition of smoothing (from Dan Klein)

Topics in Computational Linguistics Shu-Kai Hsieh

slide-25
SLIDE 25

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Smoothing Techniques

Smoothing n-gram probabilities

  • sparse data: the corpus is not big enough to have all the

bigrams covered with a realistic estimate.

  • Smoothing algorithms provide a better way of estimating the

probability of n-grams than Maximum Likelihood Estimation.

Topics in Computational Linguistics Shu-Kai Hsieh

slide-26
SLIDE 26

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Smoothing Techniques

  • Laplace Smoothing (a.k.a. add-one method)
  • Interpolation
  • Backoff
  • Good-Turing Estimation(/Discounting)
  • Kneser-Ney Smoothing

Topics in Computational Linguistics Shu-Kai Hsieh

slide-27
SLIDE 27

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Laplace Smoothing

  • Pretend we saw each word one more time than we did.
  • Re-estimate the counts by just add one to all the counts!
  • Read the BeRP examples (JM pp99-100)

Topics in Computational Linguistics Shu-Kai Hsieh

slide-28
SLIDE 28

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Laplace Smoothing: Comparing with Raw Bigram Counts

Topics in Computational Linguistics Shu-Kai Hsieh

slide-29
SLIDE 29

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Laplace Smoothing: It’s a blunt estimation

  • Too much probability mass is moved to all the zeros.

喧賓奪主: 為了處理大量的 zero,Chinese food 可以少 10 倍!

Topics in Computational Linguistics Shu-Kai Hsieh

slide-30
SLIDE 30

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

(Katz) Backoff and Interpolation

Intuition

Sometimes it helps to use less context. Condition on less context for contexts you haven’t learned much about.

  • Backoff and Interpolation are another two strategies that

utilize n-grams of variable length.

  • Backoff: use trigram if you have good evidence, otherwise

bigram, otherwise unigram.

  • Interpolation: mix unigram, bigram, trigram.

Topics in Computational Linguistics Shu-Kai Hsieh

slide-31
SLIDE 31

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Katz Back-off

  • The idea is to use the frequency of longest available n-grams,

and if no n-gram is available to back-off to the (n-1)-gram, and then to (n-2)-gram, and so on.

  • If n = 3, we first try trigrams, then bigrams, and finally

unigrams.

Topics in Computational Linguistics Shu-Kai Hsieh

slide-32
SLIDE 32

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

P∗ and α?

  • P∗: the discounted probability rather than MLE probabilities,

such as Good-Turing.

  • α: the normalizing factor

Topics in Computational Linguistics Shu-Kai Hsieh

slide-33
SLIDE 33

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Linear Interpolation 線性插值

將高階模型和低階模型作線性組合

  • Simple interpolation
  • Lambdas conditional on context

Topics in Computational Linguistics Shu-Kai Hsieh

slide-34
SLIDE 34

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Advanced Discounting Techniques

Intuition

To use the count of things you’ve seen once to help estimate the count of things you’ve never seen.

  • Good-Turing
  • Witten-Bell
  • Kneser-Ney

Topics in Computational Linguistics Shu-Kai Hsieh

slide-35
SLIDE 35

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Good-Turing Smoothing: Notations

  • A word or N-gram (or any event) that occurs once is called

singleton or a hapax legomenon.

  • Nc: the number of things we’ve seen c times, i.e., the

frequency of frequency c.

Example (In terms of bigrams)

N0 is the number of bigrams with count 0, N1 the number of bigrams with count 1 (singleton), etc

Topics in Computational Linguistics Shu-Kai Hsieh

slide-36
SLIDE 36

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Good-Turing Smoothing:Intuition

[2]:pp101-102

Topics in Computational Linguistics Shu-Kai Hsieh

slide-37
SLIDE 37

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Good-Turing Smoothing: Answer

Topics in Computational Linguistics Shu-Kai Hsieh

slide-38
SLIDE 38

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab Smoothing Techniques

Other advanced Smoothing Techniques

Topics in Computational Linguistics Shu-Kai Hsieh

slide-39
SLIDE 39

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

. 1 N-grams model

Evaluation Smoothing Techniques

. . 2 Web-scaled N-grams . . 3 Related Topics . . 4 The Entropy of Natural Languages . . 5 Lab

Topics in Computational Linguistics Shu-Kai Hsieh

slide-40
SLIDE 40

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

How to deal with huge web-scaled ngrams

How might one build a language model (ngrams model) that allows scaling to very large amounts of training data?

  • Naive Pruning: Only store N-grams with count geq

threshold, and remove singletons of higher-order n-grams.

  • Entropy-based pruning

Topics in Computational Linguistics Shu-Kai Hsieh

slide-41
SLIDE 41

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing for Web-scaled N-grams

“Standard backoff” uses variations of context-dependent backoff, where p are pre-computed and stored probabilities, and λ are back-off weights.

Topics in Computational Linguistics Shu-Kai Hsieh

slide-42
SLIDE 42

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Smoothing for Web-scaled N-grams

“Stupid backoff” [1] don’t apply any discounting and instead directly use the relative frequencies (S is used instead of P to emphasize that these are not probabilities but scores):

Topics in Computational Linguistics Shu-Kai Hsieh

slide-43
SLIDE 43

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

LM Tools and n-gram Resources

  • CMU Statistical Language Modeling Toolkit

http://www.speech.cs.cmu.edu/SLM/toolkit.html

  • SRILM http://www.speech.sri.com/projects/srilm/
  • Google Web1T5-gram http://googleresearch.blogspot.

com/2006/08/all-our-n-gram-are-belong-to-you.html

  • Google Book N-grams
  • Chinese Web 5-gram http://www.ldc.upenn.edu/

Catalog/catalogEntry.jsp?catalogId=LDC2010T06

Topics in Computational Linguistics Shu-Kai Hsieh

slide-44
SLIDE 44

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Quick demo of CMU-LM

Topics in Computational Linguistics Shu-Kai Hsieh

slide-45
SLIDE 45

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Google book ngrams

Topics in Computational Linguistics Shu-Kai Hsieh

slide-46
SLIDE 46

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

From Corpus-based to Google-based Linguistics

Enhancing Linguistic Search with the Google Books Ngram Viewer

Topics in Computational Linguistics Shu-Kai Hsieh

slide-47
SLIDE 47

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

From Corpus-based to Google-based Linguistics

Syntactic N-grams are coming out too! http://commondatastorage.googleapis.com/books/ syntactic-ngrams/index.html

Topics in Computational Linguistics Shu-Kai Hsieh

slide-48
SLIDE 48

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Exercise

The Google Web 1T 5-Gram Database — SQLite Index & Web Interface

Topics in Computational Linguistics Shu-Kai Hsieh

slide-49
SLIDE 49

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Applications

What Next Words Predication (based on Probabilistic Language Models) can do today?

source: fandywang,2012

Example

Product: Swift Key, XT9 by Nuance / pierre... Topics in Computational Linguistics Shu-Kai Hsieh

slide-50
SLIDE 50

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

You’d definitely like to try this

An Automatic CS Paper Generator http://pdos.csail.mit.edu/scigen/

Topics in Computational Linguistics Shu-Kai Hsieh

slide-51
SLIDE 51

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Collocations

  • Collocations are recurrent combinations of words.

Example

  • Simple collocations are fixed ngrams, such as The Wall Street,
  • Collocations with predicative relations involves

morpho-syntactic variations, such as the one linking make and decision: to make a decision, decisions to be made, made an important decision, etc.

Topics in Computational Linguistics Shu-Kai Hsieh

slide-52
SLIDE 52

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Collocations

  • Statistically, collocates are events co-occur more often than

by chance.

  • Measures used to calculate the strength of word preference are

Mutual Information, t-score and the likelihood ratio. MI

Topics in Computational Linguistics Shu-Kai Hsieh

slide-53
SLIDE 53

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Lab

  • ngramR for Google book ngram
  • python nltk [see extra ipython notebook]

Example

For newbie in python https://www.coursera.org/course/interactivepython For quick starter (Develop and host Python from your browser):https://www.pythonanywhere.com/

Topics in Computational Linguistics Shu-Kai Hsieh

slide-54
SLIDE 54

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Homework.week5

80% (4.3, JM book p122) 20% 預習 chapter 5 [2]

Topics in Computational Linguistics Shu-Kai Hsieh

slide-55
SLIDE 55

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Homework.week6

20% 閱讀中研院平衡語料庫說明手冊( http://app.sinica.edu.tw/kiwi/mkiwi/98-04.pdf),預 習 chapter 6. 80% 實作服貿論述的 language model (data will be provided),由 此建立自動 PRO/CON 文本產生器。

Topics in Computational Linguistics Shu-Kai Hsieh

slide-56
SLIDE 56

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

. . . . . . . . . . . . . . . . . . N-grams model Web-scaled N-grams Related Topics The Entropy of Natural Languages Lab

Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and Jeffrey Dean. Large language models in machine translation. In In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Citeseer, 2007. Dan Jurafsky and James H Martin. Speech & Language Processing. Pearson Education India, 2000.

Topics in Computational Linguistics Shu-Kai Hsieh