Language Modeling Professor Marie Roch for details on N-gram - - PDF document

language modeling
SMART_READER_LITE
LIVE PREVIEW

Language Modeling Professor Marie Roch for details on N-gram - - PDF document

Language Modeling Professor Marie Roch for details on N-gram models, see chapter 4: Jurafsky, D., and Martin, J. H. (2009). Speech and Language Processing (Pearson Prentice Hall, Upper Saddle River, NJ) Narrowing search with a language model


slide-1
SLIDE 1

1

Language Modeling

Professor Marie Roch

for details on N-gram models, see chapter 4: Jurafsky, D., and Martin, J. H. (2009). Speech and Language Processing (Pearson Prentice Hall, Upper Saddle River, NJ)

Narrowing search with a language model

  • Don’t move or I’ll …
  • Get ‘er …
  • What will she think of …
  • This enables …

2

slide-2
SLIDE 2

2

Applications

  • Speech recognition
  • Handwriting recognition
  • Spelling correction
  • Augmentative communication

and more…

3

Constituencies

  • Groupings of words
  • I didn’t see you behind the bush.
  • She ate quickly as she was late for the meeting.
  • Movement within the sentence:

As she was late for the meeting, she ate quickly As she was late for, she ate quickly the meeting.

  • Constituencies aid in prediction.

4

slide-3
SLIDE 3

3

Strategies for construction

  • Formal grammar

– Requires intimate knowledge of the language – Usually context free and cannot be represented by a regular language – We will not be covering this in detail

5

N-gram models

  • Suppose we wish to compute the probability

the sentence: She sells seashells down by the seashore.

  • We can think of this as a sequence of

words:

6

 

  

1 2 4 6 3 7 5

7 1 1 2 3 4 5 6 7

seashells down by the seashore ( ) She se ( , , , lls ) , , ,

w w w w w w w

P w P w w w w w w w =        

Think further: How would you determine if a die is fair?

slide-4
SLIDE 4

4

Estimating word probability

  • Suppose we wish to compute the probability

w2 (sells in the previous example).

We could estimate using a relative frequency but this ignores what we could have learned with the first word.

7

2 2

# times ( ) # of times all w

  • c
  • rds occur

curs w P w =

Conditional probability

By defn. of conditional probability

  • r in our problem:

8

( ( ) | ( ) ) P A P B B A P B ∩ =

1 2 1 2 1 1 1 1 2 1 2

( ) ( ) | ( ) ( ) ( , ) defn for words ) ( ) ( P w w P w w w P w P w P P w w w P w ∩ ∩ = = = ∩

slide-5
SLIDE 5

5

Conditional probability

Next, consider

9

1 2 2 1 1 1 2 2 1 1

( , ) as ( | ) , ( ) clearly ( Since , ( ) ) | ( ) P w w P w w P w P w w w P P w w = =

1 2

( , ) P w w

Chain rule

  • Now let us consider:
  • By applying conditional probability

repeatedly we end up with the chain rule:

10

1 2 we just did this part 3 3 1 2 1 2 3 1 2 2 1 1

( , , ( ) | , ) ( , ) | , ) | ( ) ( ( ) P w w w w w P w P w P w P w w w w w P w = =    

1 2 1 2 1 3 1 2 1 2 1 1 2 1 1

P( ) P( ) P( )P( | )P( | ) P( | ) P( | )

n n n n i i i

W w w w w w w w w w w w w w w w w w

− − =

= = =∏    

slide-6
SLIDE 6

6

11

Sparse problem space

  • Suppose V distinct words.
  • 𝑥

has Vi possible sequences of words.

  • Tokens N – The number of N-grams

(including repetitions) occurring in a corpus

  • Problem: In general, unique(N)<<valid tokens

for the language.

“The gently rolling hills were covered with bluebonnets” had no hits on Google at the time this slide was published.

Markov assumption

  • A prediction is dependent on the

current state but independent of previous conditions

  • In our context:

which at times relax to N-1 words:

12

Andrei Markov 1856-1922

1 1 1

| ) by the M P( | ) ( arkov assumption

n n n n

w w w P w

− −

=

1 1 1 1

P( | ) ( | )

n n n n N n

w w w P w

− − − +

=

slide-7
SLIDE 7

7

13

Special N-grams

  • Unigram

– Only depends upon the word itself. – P(wi)

  • Bigram

– P(wi|wi-1)

  • Trigram

– P(wi|wi-1, wi-2)

  • Quadrigram

– P(wi|wi-1, wi-2, wi-3)

Preparing a corpus

  • Make case independent
  • Remove punctuation and add start & end of

sentence markers <s> </s>

  • Other possibilities

– part of speech tagging – lemmas: mapping of words with similar roots e.g. sing, sang, sung  sing – stemming: mapping of derived words to their root e.g. parted  part, ostriches  ostrich

14

slide-8
SLIDE 8

8

15

An Example

<s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>

( ) ( ) ( )

1 1 1 1 1 1

|

n n N n n n n N n w N

C w w w P w w C

− − + − − + − − +

=

  • Dr. Seuss, Green Eggs and Ham, 1960.

16

Berkeley Restaurant Project Sentences

  • can you tell me about any good cantonese restaurants

close by

  • mid priced thai food is what i’m looking for
  • tell me about chez panisse
  • can you give me a listing of the kinds of food that are

available

  • i’m looking for a good place to eat breakfast
  • when is caffe venezia open during the day
slide-9
SLIDE 9

9

17

Bigram Counts from 9,222 sentences

1 i

w −

i

w “i want”

18

Bigram Probabilities

i

w

1 i

w − Unigram counts

(i want) (i want) (i) 827 0.33 2533 C P C ≈ = =

slide-10
SLIDE 10

10

Bigram Estimates of Sentence Probabilities

19

(<s> I want english food </s>) =P(I|<s>)P(want|I)P(english|want)P(food|english)P(</s>|food) =.000 031 P

Shakespeare:

N=884,647 tokens, V=29,066

20

How will this work on Huckleberry Finn?

slide-11
SLIDE 11

11

21

The need for n-gram smoothing

  • Data for estimation is sparse.
  • On a sample text with several million words

– 50% of trigrams only occurred once – 80% of trigrams occurred less than 5 times

  • Example: When pigs fly

( , , ) P( | , ) ( , ) if "when pigs fly" unseen ( , ) C when pigs fly fly when pigs C when pigs C when pigs = =

Smoothing strategies

  • Suppose P(fly | when, pigs) = 0
  • Backoff strategies do the following

– When estimating P(Z | X, Y) where C(XYZ)>0, – don’t assign all of the probability, save some of it for the cases we haven’t seen. This is called discounting and is based on Good-Turing counts

22

slide-12
SLIDE 12

12

Smoothing strategies

  • For things that have C(X, Y, Z) = 0,

use P(Z|Y), but scale it by the amount of leftover probability

  • To handle C(Y, Z) = 0, this process can be

computed recursively.

23

Neural language models

  • Advantages

– As the net learns a representation, similarities can be captured Example: Consider food

  • Possible to learn common things about foods
  • Yet the individual items can still be considered

distinct There are approaches to capture commonality in N- gram models (e.g. Knesser-Ney), but they lose the ability to distinguish the words

24

slide-13
SLIDE 13

13

Perplexity

  • Measure of ability of language model to

predict next word

  • Related to cross entropy of language, H(L),

perplexity is 2H(L)

  • Lower perplexity indicates better modeling

(theoretically)

25

( )

1 2 1 2 1 2

1 ( , , , ) 1 ( ) lim lim ( , , , )log ( , , , )

n n n n L n W

H w w w n P w w w P w w w n H L

→∞ →∞ ∈

… = … = − …

Neural language models

– Word embeddings can learn low dimensional representations of words that can capture semantic information

  • Disadvantages

– Traditional prediction uses one-hot vectors over vocabulary.

  • High dimensional output space
  • Computationally expensive

26

slide-14
SLIDE 14

14

Consider a softmax output layer

  • Suppose

– V words in vocabulary 𝕎 – nh units in last hidden layer

  • ∴ softmax input to & output/unit:
  • total cost of softmax layer O(Vnh):

27

,

where 1

i i i j j j

a b W h i V = ≤ + ≤

1

ˆ

i k

a i V a k

e y e

=

=

𝑊 ≈ 𝑙 × 10, 𝑜 ≈ 𝑙 × 10

Short List

(hybrid neural/n-gram)

  • Create subset 𝕄 of frequently used words
  • Train a neural net on 𝕄
  • Treat tail (remainder of vocabulary) using

n-gram models: 𝕌 = 𝕎\𝕄

  • Reduces complexity, but the words we

don’t model are the hard ones…

28

slide-15
SLIDE 15

15

Hierarchical softmax

  • Addresses large V problem
  • Basic idea:

– build binary hierarchy of word categories – words assigned to classes – each subnet has a small softmax layer – last subnet has manageable size

  • Higher perplexity than

non-hierarchical model

29

Importance sampling

  • Consider large V softmax layer

30

log ( ) ( | ) log log ( | )

y i i

y a a i a y i y i

softmax a logP y C e e a e a P y i C θ θ θ θ θ ∂ = ∂ ∂ ∂ = ∂ ∂   = −   ∂   ∂   = =   ∂ −  

  

compute P

  • f every other

word

slide-16
SLIDE 16

16

Importance sampling

  • What if we could approximate the second

half of:

  • 𝑏 − 𝑄(𝑧 = 𝑗|𝐷)
  • ?
  • We could sample, but to sample we would

need to know 𝑄(𝑧 = 𝑗|𝐷)… seems like a dead end…

31

Importance sampling

  • Importance sampling lets us sample from a

different distribution

  • Suppose we want to sample a function f on

elements from distribution p, e.g. 𝐹 𝑔 𝑌 = ∑ 𝑞 𝑦 𝑔(𝑦)

  • ,

but we cannot draw from p.

32

slide-17
SLIDE 17

17

Importance sampling

  • Consider a new distribution q:
  • We can sample x based on q

33

( ) ( ) ( ) ( [ ( )] )

q i x i i i q i

p x p x E x p f f x X =

~

[ ( )] ( ) ( ) 1 ˆ ( )

i

x i q i x i q

p x f x E N p x f X =

Importance sampling

  • We can use an n-gram model as q
  • Now we can sample and produce a cheaper

estimate of probability

  • See Goodfellow et al. 17.2 for more details
  • n importance sampling

34