Introduction to Markov Models Estimating the probability of phrases - - PowerPoint PPT Presentation

introduction to markov models
SMART_READER_LITE
LIVE PREVIEW

Introduction to Markov Models Estimating the probability of phrases - - PowerPoint PPT Presentation

Introduction to Markov Models Estimating the probability of phrases of words, sentences, etc . But first: A few preliminaries CIS 391 - Intro to AI 2 What counts as a word? A tricky question. CIS 391 - Intro to AI 3 How to find


slide-1
SLIDE 1

Introduction to Markov Models

Estimating the probability of phrases of words, sentences, etc.…

slide-2
SLIDE 2

But first: A few preliminaries

CIS 391 - Intro to AI 2

slide-3
SLIDE 3

CIS 391 - Intro to AI 3

What counts as a word? A tricky question….

slide-4
SLIDE 4

CIS 391 - Intro to AI 4

How to find Sentences??

slide-5
SLIDE 5

Q1: How to estimate the probability of a given sentence W?

 A crucial step in speech recognition (and lots of

  • ther applications)

 First guess: products of unigrams Given word lattice: Unigram counts (in 1.7 * 106 words of AP text): Not quite right…

CIS 391 - Intro to AI 5

ˆ( ) ( )

w W

P W P w

 

form subsidy for farm subsidies far form 183 subsidy 15 for 18185 farm 74 subsidies 55 far 570

slide-6
SLIDE 6

Predicting a word sequence II

 Next guess: products of bigrams

  • For W=w1w2w3… wn,

Given word lattice: Bigram counts (in 1.7 * 106 words of AP text): Better (if not quite right) … (But the counts are tiny! Why?)

CIS 391 – Intr)o to AI 6

1 1 1

ˆ( ) ( )

n i i i

P W P w w

  



form subsidy for farm subsidies far form subsidy 0 subsidy for 2 form subsidies 0 subsidy far 0 farm subsidy 0 subsidies for 6 farm subsidies 4 subsidies far 0

slide-7
SLIDE 7

How can we estimate P correctly?

 Problem: Naïve Bayes model for bigrams violates independence assumptions.

Let’s do this right….

 Let W=w1w2w3… wn. Then, by the chain rule,  We can estimate P(w2|w1) by the Maximum Likelihood Estimator and P(w3|w1w2) by and so on…

CIS 391 - Intro to AI 7

1 1 1

1 2 3 2 1

( ) ( )* ( | )* ( | )*...* ( | ... )

n n

P W P w P w w P w w w P w w w  

1 2 1

( ) ( ) Count w w Count w

1 2 3 1 2

( ) ( ) Count w w w Count w w

1 1 1

1 2 3 2 1

( ) ( )* ( | )* ( | )*...* ( | ... )

n n

P W P w P w w P w w w P w w w  

slide-8
SLIDE 8

CIS 391 - Intro to AI 8

and finally, Estimating P(wn|w1w2…wn-1)

Again, we can estimate P(wn|w1w2…wn-1) with the MLE So to decide pat vs. pot in Heat up the oil in a large p?t, compute for pot

1 2 1 2 1

( ... ) ( ... )

n n

Count w w w Count w w w  ("Heat up the oil in a large pot") ("Heat up the oil in a larg ") e Count Count 

slide-9
SLIDE 9

CIS 391 - Intro to AI 9

Hmm..The Web Changes Things (2008 or so)

Even the web in 2008 yields low counts!

slide-10
SLIDE 10

CIS 391 - Intro to AI 10

Statistics and the Web II

So, P(“pot”|”heat up the oil in a large___”) = 8/49  0.16

slide-11
SLIDE 11

But the web has grown!!!

CIS 391 - Intro to AI 11

slide-12
SLIDE 12

….

CIS 391 - Intro to AI 12

165/891=0.185

slide-13
SLIDE 13

CIS 391 - Intro to AI 13

So….

 A larger corpus won’t help much unless it’s

HUGE …. but the web is!!!

But what if we only have 100 million words for our estimates??

slide-14
SLIDE 14

A BOTEC Estimate of What We Can Estimate

What parameters can we estimate with 100 million words of training data??

Assuming (for now) uniform distribution over only 5000 words So even with 108 words of data, for even trigrams we encounter the sparse data problem…..

CIS 391 - Intro to AI 14

slide-15
SLIDE 15

CIS 391 - Intro to AI 15

The Markov Assumption: Only the Immediate Past Matters

slide-16
SLIDE 16

The Markov Assumption: Estimation

We estimate the probability of each wi given previous context by which can be estimated by So we’re back to counting only unigrams and bigrams!! AND we have a correct practical estimation method for P(W) given the Markov assumption!

CIS 391 - Intro to AI 16

P(wi|w1w2…wi-1) = P(wi|wi-1)

1 1

( ) ( )

i i i

Count w w Count w

 

slide-17
SLIDE 17

CIS 391 - Intro to AI 17

Markov Models

slide-18
SLIDE 18

CIS 391 - Intro to AI 18

Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method  To generate a sequence of n words given unigram estimates:

  • Fix some ordering of the vocabulary v1 v2 v3 …vk.
  • For each word wi , 1 ≤ i ≤ n

—Choose a random value ri between 0 and 1 — wi = the first vj such that

1

( )

j m i m

P v r

slide-19
SLIDE 19

CIS 391 - Intro to AI 19

Visualizing an n-gram based language model: the Shannon/Miller/Selfridge method  To generate a sequence of n words given a 1st

  • rder Markov model (i.e. conditioned on one

previous word):

  • Fix some ordering of the vocabulary v1 v2 v3 …vk.
  • Use unigram method to generate an initial word w1
  • For each remaining wi , 2 ≤ i ≤ n

—Choose a random value ri between 0 and 1

— wi = the first vj such that

1 1

( | )

j m i i m

P v w r

 

slide-20
SLIDE 20

CIS 391 - Intro to AI 20

The Shannon/Miller/Selfridge method trained on Shakespeare

(This and next two slides from Jurafsky)

slide-21
SLIDE 21

Wall Street Journal just isn’t Shakespeare

slide-22
SLIDE 22

Shakespeare as corpus  N=884,647 tokens, V=29,066  Shakespeare produced 300,000 bigram types

  • ut of V2= 844 million possible bigrams.
  • So 99.96% of the possible bigrams were never seen

(have zero entries in the table)

 Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare

slide-23
SLIDE 23

The Sparse Data Problem Again

 How likely is a 0 count? Much more likely than I let on!!!

CIS 391 - Intro to AI 23

slide-24
SLIDE 24

English word frequencies well described by Zipf’s Law

 Zipf (1949) characterized the relation between word frequency and rank as:  Purely Zipfian data plots as a straight line on a log- log scale

*Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ).

(f) log

  • log(C)

log(r) C/f r ) constant (for     C C r f

CIS 391 - Intro to AI 24

slide-25
SLIDE 25

Word frequency & rank in Brown Corpus vs Zipf

CIS 391 - Intro to AI 25

From: Interactive mathematics http://www.intmath.com Lots of area under the tail

  • f this curve!
slide-26
SLIDE 26

CIS 391 - Intro to AI 26

Zipf’s law for the Brown corpus

slide-27
SLIDE 27

Smoothing

This black art is why NLP is taught in the engineering school – Jason Eisner

slide-28
SLIDE 28

CIS 391 - Intro to AI 28

Smoothing

 At least one unknown word likely per sentence given Zipf!!  To fix 0’s caused by this, we can smooth the data.

  • Assume we know how many types never occur in the data.
  • Steal probability mass from types that occur at least once.
  • Distribute this probability mass over the types that never occur.
slide-29
SLIDE 29

CIS 391 - Intro to AI 29

Smoothing

….is like Robin Hood:

  • it steals from the rich
  • and gives to the poor
slide-30
SLIDE 30

Review: Add-One Smoothing

 Estimate probabilities by assuming every possible word type v  V actually occurred one extra time (as if by appending an unabridged dictionary)  So if there were N words in our corpus, then instead of estimating we estimate

CIS 391 - Intro to AI 30

ˆ P

( ) ˆ( ) Count w P w N 

( 1) ˆ( ) Count w P w N V   

slide-31
SLIDE 31

CIS 391 - Intro to AI 31

Add-One Smoothing (again)

 Pro: Very simple technique  Cons:

  • Probability of frequent n-grams is underestimated
  • Probability of rare (or unseen) n-grams is overestimated
  • Therefore, too much probability mass is shifted towards unseen n-

grams

  • All unseen n-grams are smoothed in the same way

 Using a smaller added-count improves things but only some  More advanced techniques (Kneser Ney, Witten-Bell) use properties of component n-1 grams and the like... (Hint for this homework  )