language modeling
play

Language Modeling Diyi Yang Some slides borrowed from Yulia - PowerPoint PPT Presentation

CS 4650/7650: Natural Language Processing Language Modeling Diyi Yang Some slides borrowed from Yulia Tsvetkov at CMU and Kai-Wei Chang at UCLA 1 Logistics HW 1 Due HW 2 Out: Feb 3 rd , 2020, 3:00pm 2 Piazza & Office Hours ~ 11


  1. CS 4650/7650: Natural Language Processing Language Modeling Diyi Yang Some slides borrowed from Yulia Tsvetkov at CMU and Kai-Wei Chang at UCLA 1

  2. Logistics ¡ HW 1 Due ¡ HW 2 Out: Feb 3 rd , 2020, 3:00pm 2

  3. Piazza & Office Hours ¡ ~ 11 mins response time 3

  4. Review ¡ L2: Text classification ¡ L3: Neural network for text classification 4

  5. This Lecture ¡ Language Models ¡ What are N-gram models ¡ How to use probabilities 5

  6. This Lecture ¡ What is the probability of “ I like Georgia Tech at Atlanta ”? ¡ What is the probability of “like I Atlanta at Georgia Tech”? 6

  7. Language Models Play the Role of … ¡ A judge of grammaticality ¡ A judge of semantic plausibility ¡ An enforcer of stylistic consistency ¡ A repository of knowledge (?) 7

  8. The Language Modeling Problem ¡ Assign a probability to every sentence (or any string of words) ¡ Finite vocabulary (e.g., words or characters) {the, a, telescope, …} ¡ Infinite set of sequences ¡ A telescope STOP ¡ A STOP ¡ The the the STOP ¡ I saw a woman with a telescope STOP ¡ STOP 8 ¡ …

  9. Example ¡ P(disseminating so much currency STOP) = 10 #$% ¡ P(spending so much currency STOP) = 10 #& 9

  10. What Is A Language Model? ¡ Probability distributions over sentences (i.e., word sequences ) P(W) = P( ! " ! # ! $ ! % … ! ' ) ¡ Can use them to generate strings P( ! ' ∣ ! # ! $ ! % … ! ')" ) ¡ Rank possible sentences ¡ P(“ Today is Tuesday ”) > P(“ Tuesday Today is ”) ¡ P(“ Today is Tuesday ”) > P(“ Today is Atlanta ”) 10

  11. Language Model Applications ¡ Machine Translation ¡ p(strong winds) > p(large winds) ¡ Spell Correction ¡ The office is about 15 minutes from my house ¡ p(15 minutes from my house) > p(15 minuets from my house) ¡ Speech Recognition ¡ p(I saw a van) >> p(eyes awe of an) ¡ Summarization, question-answering, handwriting recognition, etc.. 11

  12. Language Model Applications 12

  13. Language Model Applications Language generation https://pdos.csail.mit.edu/archive/scigen/ 13

  14. Bag-of-Words with N-grams ¡ N-grams: a contiguous sequence of n tokens from a given piece of text http://recognize-speech.com/language-model/n-gram-model/comparison 14

  15. N-grams Models ¡ Unigram model: ! " # ! " $ ! " % … !(" ( ) ¡ Bigram model: ! " # ! " $ |" # ! " % |" $ … !(" ( |" (+# ) ¡ Trigram model: ! " # ! " $ |" # ! " % |" $ , " # … !(" ( |" (+# " (+$ ) ¡ N-gram model: ! " # ! " $ |" # … !(" ( |" (+# " (+$ … " (+- ) 15

  16. The Language Modeling Problem ¡ Assign a probability to every sentence (or any string of words) ¡ Finite vocabulary (e.g., words or characters) ¡ Infinite set of sequences ! & '( ) = 1 "∈$ ∗ & '( ) ≥ 0, ∀ ) ∈ Σ ∗ 16

  17. A Trivial Model ¡ Assume we have ! training sentences ¡ Let " # , " % , … , " ' be a sentence, and c(" # , " % , … , " ' ) be the number of times it appeared in the training data. -(. / ,. 0 ,… ,. 1 ) ¡ Define a language model + " # , " % , … " ' = 2 ¡ No generalization! 17

  18. Markov Processes ¡ Markov Processes: ¡ Given a sequence of ! random variables ¡ We want a sequence probability model ¡ " # , " % , … , " ' , (). +. , ! = 100), " 0 ∈ 2 ¡ 3(" # = 4 # , " % = 4 % , … " ' = 4 ' ) 18

  19. Markov Processes ¡ Markov Processes: ¡ Given a sequence of ! random variables ¡ We want a sequence probability model ¡ " # , " % , … , " ' , " ( ∈ * ¡ +(" # = . # , " % = . % , … " ' = . ' ) ¡ There are * ' possible sequences 19

  20. First-order Markov Processes ¡ Chain Rule: ¡ ! " # = % # , " ' = % ' , … " ) = % ) ) = * " # = % # + !(" , = % , |" # = % # , … , " ,0# = % ,0# ) ,-' 20

  21. First-order Markov Processes ¡ Chain Rule: ¡ ! " # = % # , " ' = % ' , … " ) = % ) ) = * " # = % # + !(" , = % , |" # = % # , … , " ,0# = % ,0# ) ,-' ) = * " # = % # + !(" , = % , |" ,0# = % ,0# ) Markov Assumption ,-' 21

  22. First-order Markov Processes ¡ Chain Rule: ¡ ! " # = % # , " ' = % ' , … " ) = % ) ) = * " # = % # + !(" , = % , |" # = % # , … , " ,0# = % ,0# ) ,-' ) = * " # = % # + !(" , = % , |" ,0# = % ,0# ) Markov Assumption ,-' 22

  23. First-order Markov Processes 23

  24. Second-order Markov Processes ¡ ! " # = % # , " ' = % ' , … " ) = % ) ) = ! " # = % # ×! " ' = % ' |" # = % # ∏ -./ !(" - = % - |" -1' = % -1' , " -1# = % -1# ) ¡ Simplify notation: % 3 = % 1# = ∗ 24

  25. Details: Variable Length ¡ We want probability distribution over sequences of any length 25

  26. Details: Variable Length ¡ Define always ! " = $%&' , where STOP is a special symbol ¡ Then use a Markov process as before: " + , - = ! - , , / = ! / , … , , " = ! " = 1 +(, 2 = ! 2 |, 26/ = ! 26/ , , 26- = ! 26- ) 23- ¡ We now have probability distribution over all sequences ¡ Intuition: at every step you have probability ( " to stop and 1 − ( " to keep going 26

  27. The Process of Generating Sentences Step 1: Initialize ! = 1 and $ % = $ &' = ∗ Step 2: Generate $ ) from the distribution * + ) = $ ) |+ )&- = $ )&- , + )&' = $ )&' Step 3: If x 0 = 1234 then return the sequence $ ' ⋯ $ ) . Otherwise, set ! = ! + 1 and return to step 2. 27

  28. 3-gram LMs ¡ A trigram language model contains ¡ A vocabulary V ¡ A non negative parameter ! " #, % for every trigram, such that " ∈ ' ∪ STOP , #, % ∈ ' ∪ {∗} ¡ The probability of a sentence 0 1 , 0 2 , … , 0 4 , where 0 4 = STOP is 4 6 0 1 , … , 0 4 = 7 ! 0 8 0 8:1 , 0 8:2 ) 891 28

  29. 3-gram LMs ¡ A trigram language model contains ¡ A vocabulary V ¡ A non negative parameter ! " #, % for every trigram, such that " ∈ ' ∪ STOP , #, % ∈ ' ∪ {∗} ¡ The probability of a sentence 0 1 , 0 2 , … , 0 4 , where 0 4 = STOP is 4 6 0 1 , … , 0 4 = 7 ! 0 8 0 8:1 , 0 8:2 ) 891 29

  30. 3-gram LMs: Example ! the dog barks STOP = 2 the ∗,∗) × 30

  31. 3-gram LMs: Example ! the dog barks STOP = 2 the ∗,∗) × = 2 dog ∗, the) × = 2 barks the, dog) × = 2 STOP dog, barks) 31

  32. Limitations ¡ Markovian assumption is false He is from France, so it makes sense that his first language is … ¡ We want to model longer dependencies 32

  33. N-gram model 33

  34. More Examples ¡ Yoav’s blog post: ¡ http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139 ¡ 10-gram character-level LM First Citizen: Nay, then, that was hers, It speaks against your other service: But since the youth of the circumstance be spoken: Your uncle and one Baptista's daughter. SEBASTIAN: Do I stand till the break off. 34

  35. Maximum Likelihood Estimation ¡ “Best” means “data likelihood reaches maximum” ! " = $%&'$( " )(+|") 35

  36. Maximum Likelihood Estimation Unigram Language Model q Estimation Document p(w| q )=? text 10 … mining 5 text ? 10/100 association 3 mining ? 5/100 association ? database 3 3/100 database ? algorithm 2 3/100 … … query ? query 1 1/100 … efficient 1 A paper (total #words=100) 36

  37. Which Bag of Words More Likely to be Generated aaaDaaaKoaaaa a K a K a o o P D a a a a D F E b a E a n 37

  38. Parameter Estimation ¡ General setting: ¡ Given a (hypothesized & probabilistic) model that governs the random experiment ¡ The model gives a probability of any data !(#|%) that depends on the parameter % ¡ Now, given actual sample data X={x 1 ,…,x n }, what can we say about the value of % ? ¡ Intuitively, take our best guess of % ¡ “best” means “best explaining/fitting the data” ¡ Generally an optimization problem 38

  39. Maximum Likelihood Estimation ¡ Data: a collection of words, ! " , ! $ , … , ! & ¡ Model: multinomial distribution p()) with parameters + , = .(! , ) / + = 012304 5∈7 .()|+) ¡ Maximum likelihood estimator: 39

  40. Maximum Likelihood Estimation , , % 1(2 3 ) ∝ . 1(2 3 ) ! " # = & ' ( , … , &(' , ) . # / # / /0( /0( , ⇒ log ! " # = 9 & ' / log # / + &;<=> /0( , ? # = @ABC@D E∈G 9 & ' / log # / /0( 40

  41. Maximum Likelihood Estimation 0 ! " = $%&'$( )∈+ , 1 2 - log " - -./ 0 0 6 7, " = , 1 2 - log " - + : , " - − 1 Lagrange multiplier -./ -./ =6 = 1 2 - → " - = − 1 2 - + : Set partial derivatives to zero =" - " - : 0 0 : = − , 1 2 - ∑ -./ " - =1 Requirement from probability Since we have -./ 1 2 - " - = ML estimate 0 ∑ -./ 1 2 - 41

  42. Maximum Likelihood Estimation ¡ For N-gram language models +(- . ,- ./0 ,…,- ./120 ) ¡ ! " # " #$% , … , " #$()% = +(- ./0 ,…,- ./120 ) 42

  43. Practical Issues ¡ We do everything in the log space ¡ Avoid underflow ¡ Adding is faster than multiplying log $ % ×$ ' = log $ % + log $ ' 43

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend