count based language modeling
play

Count-based Language Modeling CMSC 473/673 UMBC Some slides - PowerPoint PPT Presentation

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner Outline Defining Language Models Breaking & Fixing Language Models Evaluating Language Models Goal of Language Modeling p ( ) [text..]


  1. Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

  2. Outline Defining Language Models Breaking & Fixing Language Models Evaluating Language Models

  3. Goal of Language Modeling p θ ( ) […text..] Learn a probabilistic model of text Accomplished through observing text and updating model parameters to make text more likely

  4. Goal of Language Modeling p θ ( ) […text..] Learn a probabilistic model of 0 ≤ 𝑞 𝜄 [… 𝑢𝑓𝑦𝑢 … ] ≤ 1 text Accomplished through ෍ 𝑞 𝜄 𝑢 = 1 observing text and updating model parameters to make 𝑢:𝑢 is valid text text more likely

  5. Design Question 1: What Part of Language Do We Estimate? p θ ( ) […text..] Is […text..] a • Full document? A: It’s task - • Sequence of sentences? dependent! • Sequence of words? • Sequence of characters?

  6. Design Question 2: How do we estimate robustly? p θ ( ) […typo -text..] What if […text..] has a typo?

  7. Design Question 3: How do we generalize? p θ ( ) […synonymous -text..] What if […text..] has a word (or character or…) we’ve never seen before?

  8. “The Unreasonable Effectiveness of Recurrent Neural Networks” http://karpathy.github.io/2015/05/21/rnn-effectiveness/

  9. “The Unreasonable Effectiveness of Recurrent Neural Networks” http://karpathy.github.io/2015/05/21/rnn-effectiveness/ “The Unreasonable Effectiveness of Character - level Language Models” (and why RNNs are still cool) http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

  10. Simple Count-Based 𝑞 item

  11. Simple Count-Based “proportional to” 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

  12. Simple Count-Based “proportional to” 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item 𝑑𝑝𝑣𝑜𝑢(item) = σ all items 𝑧 𝑑𝑝𝑣𝑜𝑢(y)

  13. Simple Count-Based “proportional to” 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item 𝑑𝑝𝑣𝑜𝑢(item) = σ all items 𝑧 𝑑𝑝𝑣𝑜𝑢(y) constant

  14. In Simple Count-Based Models, What Do We Count? 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item) sequence of characters → pseudo-words sequence of words → pseudo-phrases

  15. Shakespearian Sequences of Characters

  16. Shakespearian Sequences of Words

  17. Novel Words, Novel Sentences “Colorless green ideas sleep furiously” – Chomsky (1957) Let’s observe and record all sentences with our big, bad supercomputer Red ideas? Read ideas?

  18. Probability Chain Rule 𝑞 𝑦 1 , 𝑦 2 , … , 𝑦 𝑇 = 𝑞 𝑦 1 𝑞 𝑦 2 𝑦 1 )𝑞 𝑦 3 𝑦 1 , 𝑦 2 ) ⋯ 𝑞 𝑦 𝑇 𝑦 1 , … , 𝑦 𝑗 = 𝑇 ෑ 𝑞 𝑦 𝑗 𝑦 1 , … , 𝑦 𝑗−1 ) 𝑗

  19. N-Grams Maintaining an entire inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously)

  20. N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) *

  21. N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) *

  22. N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

  23. N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * apply the p(ideas | Colorless green) * chain rule p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

  24. N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * apply the p(ideas | Colorless green) * chain rule p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

  25. N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?”

  26. N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info

  27. N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info p(furiously | Colorless green ideas sleep) ≈ p(furiously | Colorless green ideas sleep)

  28. N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info p(furiously | Colorless green ideas sleep) ≈ p(furiously | ideas sleep)

  29. N-Grams p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

  30. N-Grams p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

  31. Trigrams p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

  32. Trigrams p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

  33. Trigrams p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS> ) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) Consistent notation : Pad the left with <BOS> (beginning of sentence) symbols

  34. Trigrams p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS> ) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) * p( <EOS> | sleep furiously) Consistent notation : Pad the left with <BOS> (beginning of sentence) symbols Fully proper distribution : Pad the right with a single <EOS> symbol

  35. N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously)

  36. N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep)

  37. N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep) trigram 3 2 p(furiously | ideas sleep) (3-gram)

  38. N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep) trigram 3 2 p(furiously | ideas sleep) (3-gram) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(w i | w i-n+1 … w i-1 )

  39. N-Gram Probability 𝑞 𝑥 1 , 𝑥 2 , 𝑥 3 , ⋯ , 𝑥 𝑇 = 𝑇 ෑ 𝑞 𝑥 𝑗 𝑥 𝑗−𝑂+1 , ⋯ , 𝑥 𝑗−1 ) 𝑗=1

  40. Count-Based N-Grams (Unigrams) 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

  41. Count-Based N-Grams (Unigrams) 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢(z)

  42. Count-Based N-Grams (Unigrams) 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z 𝑑𝑝𝑣𝑜𝑢 z = σ 𝑤 𝑑𝑝𝑣𝑜𝑢(v)

  43. Count-Based N-Grams (Unigrams) word type word type 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z 𝑑𝑝𝑣𝑜𝑢 z = σ 𝑤 𝑑𝑝𝑣𝑜𝑢(v) word type

  44. Count-Based N-Grams (Unigrams) word type word type 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z = 𝑑𝑝𝑣𝑜𝑢 z 𝑋 number of tokens observed

  45. Count-Based N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) z Raw Count count(z) Normalization Probability p(z) The 1 film 2 got 1 a 2 great 1 opening 1 and 1 the 1 went 1 on 1 to 1 become 1 hit 1 . 1

  46. Count-Based N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) z Raw Count count(z) Normalization Probability p(z) The 1 film 2 got 1 a 2 great 1 opening 1 and 1 16 the 1 went 1 on 1 to 1 become 1 hit 1 . 1

  47. Count-Based N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) z Raw Count count(z) Normalization Probability p(z) The 1 1/16 film 2 1/8 got 1 1/16 a 2 1/8 great 1 1/16 opening 1 1/16 and 1 1/16 16 the 1 1/16 went 1 1/16 on 1 1/16 to 1 1/16 become 1 1/16 hit 1 1/16 . 1 1/16

  48. Count-Based N-Grams (Trigrams) order matters in order matters in conditioning count 𝑞 z|x, y ∝ 𝑑𝑝𝑣𝑜𝑢(x, y, z) Count of the sequence of items “x y z”

  49. Count-Based N-Grams (Trigrams) order matters in order matters in conditioning count 𝑞 z|x, y ∝ 𝑑𝑝𝑣𝑜𝑢(x, y, z) count(x, y, z) ≠ count(x, z, y) ≠ count(y, x, z) ≠ …

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend