language modeling
play

Language Modeling Professor Marie Roch for details on N-gram - PDF document

Language Modeling Professor Marie Roch for details on N-gram models, see chapter 4: Jurafsky, D., and Martin, J. H. (2009). Speech and Language Processing (Pearson Prentice Hall, Upper Saddle River, NJ) Narrowing search with a language model


  1. Language Modeling Professor Marie Roch for details on N-gram models, see chapter 4: Jurafsky, D., and Martin, J. H. (2009). Speech and Language Processing (Pearson Prentice Hall, Upper Saddle River, NJ) Narrowing search with a language model • Don’t move or I’ll … • Get ‘er … • What will she think of … • This enables … 2 1

  2. Applications • Speech recognition • Handwriting recognition • Spelling correction • Augmentative communication and more… 3 Constituencies • Groupings of words • I didn’t see you behind the bush . • She ate quickly as she was late for the meeting. • Movement within the sentence: As she was late for the meeting, she ate quickly As she was late for, she ate quickly the meeting. • Constituencies aid in prediction. 4 2

  3. Strategies for construction • Formal grammar – Requires intimate knowledge of the language – Usually context free and cannot be represented by a regular language – We will not be covering this in detail 5 N-gram models • Suppose we wish to compute the probability the sentence: She sells seashells down by the seashore. • We can think of this as a sequence of words: She se lls seashells down by the seashore              w w w w w w w 1 2 4 6 3 7 5 7 P w ( ) P w w w w w w w ( , , , , , , ) = 1 1 2 3 4 5 6 7 Think further: How would you determine if a die is fair? 6 3

  4. Estimating word probability • Suppose we wish to compute the probability w 2 ( sells in the previous example). We could estimate using a relative frequency # times w oc curs P w ( ) 2 = 2 # of times all w ords occur but this ignores what we could have learned with the first word. 7 Conditional probability By defn. of conditional probability P A ( B ) ∩ P ( A | B ) = P ( B ) or in our problem: P w ( w ) P w ( w ) ∩ ∩ P w ( | w ) 2 1 1 2 = = 2 1 P w ( ) P w ( ) 1 1 P ( w w , ) defn 1 2 for words = ∩ P w ( ) 1 8 4

  5. Conditional probability P w w ( , ) Next, consider 1 2 P w w ( , ) Since as ( P w | w ) 1 2 , = 2 1 P w ( ) 1 clearly ( P w w , ) P w ( | w P ) ( w ) = 1 2 2 1 1 9 Chain rule • Now let us consider: P w w ( , , w ) P w ( | w w , ) P w w ( , ) =     1 2 3 3 1 2 1 2 we just did this part P w ( | w w , ) P w ( | w P w ) ( ) = 3 1 2 2 1 1 • By applying conditional probability repeatedly we end up with the chain rule: P( W ) P( w w  w ) = 1 2 n P( w )P( w | w )P( w | w w )  P( w | w w  w ) = 1 2 1 3 1 2 n 1 2 n 1 − n = ∏ P( w w w |  w ) i 1 2 i 1 − 10 i 1 = 5

  6. Sparse problem space • Suppose V distinct words. � has V i possible sequences of words. • 𝑥 � • Tokens N – The number of N-grams (including repetitions) occurring in a corpus • Problem: In general, unique( N)<< valid tokens for the language. “The gently rolling hills were covered with bluebonnets” had no hits on Google at the time this slide was published. 11 Markov assumption • A prediction is dependent on the current state but independent of previous conditions Andrei Markov • In our context: 1856-1922 n 1 P( w | w − ) P w ( | w ) by the M arkov assumption = n 1 n n 1 − which at times relax to N-1 words: n 1 n 1 P( w | w − ) P w ( | w − ) = n 1 n n N 1 − + 12 6

  7. Special N-grams • Unigram • Trigram – Only depends upon the – P( w i |w i-1 , w i-2 ) word itself. – P( w i ) • Quadrigram – P( w i |w i-1 , w i-2 , w i-3 ) • Bigram – P( w i |w i-1 ) 13 Preparing a corpus • Make case independent • Remove punctuation and add start & end of sentence markers <s> </s> • Other possibilities – part of speech tagging – lemmas: mapping of words with similar roots e.g. sing, sang, sung  sing – stemming: mapping of derived words to their root e.g. parted  part, ostriches  ostrich 14 7

  8. An Example <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> Dr. Seuss, Green Eggs and Ham , 1960. ( ) n 1 C w − w n N 1 n ( ) − + P w | w n 1 − = n n N 1 ( ) − + n 1 C w − w N 1 − + 15 Berkeley Restaurant Project Sentences • can you tell me about any good cantonese restaurants close by • mid priced thai food is what i ’ m looking for • tell me about chez panisse • can you give me a listing of the kinds of food that are available • i ’ m looking for a good place to eat breakfast • when is caffe venezia open during the day 16 8

  9. Bigram Counts from 9,222 sentences “i want” w i w − i 1 17 Bigram Probabilities Unigram counts C (i want) 827 P (i want) 0.33 = = ≈ w C (i) 2533 i w − i 1 18 9

  10. Bigram Estimates of Sentence Probabilities P (<s> I want english food </s>) =P(I|<s>)P(want|I)P(english|want)P(food|english)P(</s>|food) =.000 031 19 Shakespeare: N=884,647 tokens, V=29,066 How will this work on Huckleberry Finn ? 20 10

  11. The need for n-gram smoothing • Data for estimation is sparse. • On a sample text with several million words – 50% of trigrams only occurred once – 80% of trigrams occurred less than 5 times • Example: When pigs fly C when pigs fly ( , , ) P( fly when pigs | , ) = C when pigs ( , ) 0 if "when pigs fly" unseen = C when pigs ( , ) 21 Smoothing strategies • Suppose P(fly | when, pigs) = 0 • Backoff strategies do the following – When estimating P(Z | X, Y) where C(XYZ)>0, – don’t assign all of the probability, save some of it for the cases we haven’t seen. This is called discounting and is based on Good-Turing counts 22 11

  12. Smoothing strategies • For things that have C(X, Y, Z) = 0, use P(Z|Y), but scale it by the amount of leftover probability • To handle C(Y, Z) = 0, this process can be computed recursively. 23 Neural language models • Advantages – As the net learns a representation, similarities can be captured Example: Consider food • Possible to learn common things about foods • Yet the individual items can still be considered distinct There are approaches to capture commonality in N- gram models (e.g. Knesser-Ney), but they lose the ability to distinguish the words 24 12

  13. Perplexity • Measure of ability of language model to predict next word • Related to cross entropy of language, H(L), perplexity is 2 H(L) 1 H L ( ) lim H w w ( , , , w ) = … n 1 2 n →∞ n 1  lim P w w ( , , , w )log ( P w w ( , , , w ) ) = − … … n →∞ 1 2 n 1 2 n n W L ∈ • Lower perplexity indicates better modeling (theoretically) 25 Neural language models – Word embeddings can learn low dimensional representations of words that can capture semantic information • Disadvantages – Traditional prediction uses one-hot vectors over vocabulary. • High dimensional output space • Computationally expensive 26 13

  14. Consider a softmax output layer • Suppose – V words in vocabulary 𝕎 – n h units in last hidden layer • ∴ softmax input to & output/unit: a e i  ˆ y = a b W h where 1 i V = + ≤ ≤ i V i i i j , j  a e k j k 1 = • total cost of softmax layer O(Vn h ) : 𝑊 ≈ 𝑙 � × 10 � , 𝑜 � ≈ 𝑙 � × 10 � 27 Short List (hybrid neural/n-gram) • Create subset 𝕄 of frequently used words • Train a neural net on 𝕄 • Treat tail (remainder of vocabulary) using n-gram models: 𝕌 = 𝕎\𝕄 • Reduces complexity, but the words we don’t model are the hard ones… 28 14

  15. Hierarchical softmax • Addresses large V problem • Basic idea: – build binary hierarchy of word categories – words assigned to classes – each subnet has a small softmax layer – last subnet has manageable size • Higher perplexity than non-hierarchical model 29 Importance sampling • Consider large V softmax layer log softmax a ( ) logP y C ( | ) ∂ y = ∂ θ ∂ θ a e ∂ y log = ∂  a e θ i compute P i of every other ∂    a word a log e = −  i  y   ∂ θ i ∂    a P y ( i C | ) =  − =  y   ∂ θ i 30 15

  16. Importance sampling • What if we could approximate the second � �� 𝑏 � − � 𝑄(𝑧 = 𝑗|𝐷) half of: ? � • We could sample, but to sample we would need to know 𝑄(𝑧 = 𝑗|𝐷) … seems like a dead end… 31 Importance sampling • Importance sampling lets us sample from a different distribution • Suppose we want to sample a function f on elements from distribution p, e.g. 𝐹 𝑔 𝑌 = ∑ 𝑞 � 𝑦 � 𝑔(𝑦 � ) , � but we cannot draw from p . 32 16

  17. Importance sampling • Consider a new distribution q: p ( ) x p ( ) ( ) x f x =  q i x i i E [ ( f X )] p ( x ) i q i • We can sample x based on q 1 p ( ) ( ) x f x  ˆ E [ ( f X )] x i i = N p ( ) x x ~ q q i i 33 Importance sampling • We can use an n-gram model as q • Now we can sample and produce a cheaper estimate of probability • See Goodfellow et al. 17.2 for more details on importance sampling 34 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend