language modeling
play

Language Modeling Michael Collins, Columbia University Overview - PowerPoint PPT Presentation

Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting methods The


  1. Language Modeling Michael Collins, Columbia University

  2. Overview ◮ The language modeling problem ◮ Trigram models ◮ Evaluating language models: perplexity ◮ Estimation techniques: ◮ Linear interpolation ◮ Discounting methods

  3. The Language Modeling Problem ◮ We have some (finite) vocabulary, say V = { the, a, man, telescope, Beckham, two , . . . } ◮ We have an (infinite) set of strings, V † the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP

  4. The Language Modeling Problem (Continued) ◮ We have a training sample of example sentences in English

  5. The Language Modeling Problem (Continued) ◮ We have a training sample of example sentences in English ◮ We need to “learn” a probability distribution p i.e., p is a function that satisfies � p ( x ) ≥ 0 for all x ∈ V † p ( x ) = 1 , x ∈V †

  6. The Language Modeling Problem (Continued) ◮ We have a training sample of example sentences in English ◮ We need to “learn” a probability distribution p i.e., p is a function that satisfies � p ( x ) ≥ 0 for all x ∈ V † p ( x ) = 1 , x ∈V † p ( the STOP ) = 10 − 12 p ( the fan STOP ) = 10 − 8 p ( the fan saw Beckham STOP ) = 2 × 10 − 8 p ( the fan saw saw STOP ) = 10 − 15 . . . p ( the fan saw Beckham play for Real Madrid STOP ) = 2 × 10 − 9 . . .

  7. Why on earth would we want to do this?! ◮ Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.)

  8. Why on earth would we want to do this?! ◮ Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.) ◮ The estimation techniques developed for this problem will be VERY useful for other problems in NLP

  9. A Naive Method ◮ We have N training sentences ◮ For any sentence x 1 . . . x n , c ( x 1 . . . x n ) is the number of times the sentence is seen in our training data ◮ A naive estimate: p ( x 1 . . . x n ) = c ( x 1 . . . x n ) N

  10. Overview ◮ The language modeling problem ◮ Trigram models ◮ Evaluating language models: perplexity ◮ Estimation techniques: ◮ Linear interpolation ◮ Discounting methods

  11. Markov Processes ◮ Consider a sequence of random variables X 1 , X 2 , . . . X n . Each random variable can take any value in a finite set V . For now we assume the length n is fixed (e.g., n = 100 ). ◮ Our goal: model P ( X 1 = x 1 , X 2 = x 2 , . . . , X n = x n )

  12. First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n )

  13. First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X 1 = x 1 ) P ( X i = x i | X 1 = x 1 , . . . , X i − 1 = x i − 1 ) i =2

  14. First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X 1 = x 1 ) P ( X i = x i | X 1 = x 1 , . . . , X i − 1 = x i − 1 ) i =2 n � = P ( X 1 = x 1 ) P ( X i = x i | X i − 1 = x i − 1 ) i =2

  15. First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X 1 = x 1 ) P ( X i = x i | X 1 = x 1 , . . . , X i − 1 = x i − 1 ) i =2 n � = P ( X 1 = x 1 ) P ( X i = x i | X i − 1 = x i − 1 ) i =2 The first-order Markov assumption: For any i ∈ { 2 . . . n } , for any x 1 . . . x i , P ( X i = x i | X 1 = x 1 . . . X i − 1 = x i − 1 ) = P ( X i = x i | X i − 1 = x i − 1 )

  16. Second-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n )

  17. Second-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) = P ( X 1 = x 1 ) × P ( X 2 = x 2 | X 1 = x 1 ) n � × P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =3

  18. Second-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) = P ( X 1 = x 1 ) × P ( X 2 = x 2 | X 1 = x 1 ) n � × P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =3 n � = P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =1 (For convenience we assume x 0 = x − 1 = *, where * is a special “start” symbol.)

  19. Modeling Variable Length Sequences ◮ We would like the length of the sequence, n , to also be a random variable ◮ A simple solution: always define X n = STOP where STOP is a special symbol

  20. Modeling Variable Length Sequences ◮ We would like the length of the sequence, n , to also be a random variable ◮ A simple solution: always define X n = STOP where STOP is a special symbol ◮ Then use a Markov process as before: P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =1 (For convenience we assume x 0 = x − 1 = *, where * is a special “start” symbol.)

  21. Trigram Language Models ◮ A trigram language model consists of: 1. A finite set V 2. A parameter q ( w | u, v ) for each trigram u, v, w such that w ∈ V ∪ { STOP } , and u, v ∈ V ∪ { * } .

  22. Trigram Language Models ◮ A trigram language model consists of: 1. A finite set V 2. A parameter q ( w | u, v ) for each trigram u, v, w such that w ∈ V ∪ { STOP } , and u, v ∈ V ∪ { * } . ◮ For any sentence x 1 . . . x n where x i ∈ V for i = 1 . . . ( n − 1) , and x n = STOP, the probability of the sentence under the trigram language model is n � p ( x 1 . . . x n ) = q ( x i | x i − 2 , x i − 1 ) i =1 where we define x 0 = x − 1 = *.

  23. An Example For the sentence the dog barks STOP we would have p ( the dog barks STOP ) = q ( the | *, * ) × q ( dog | *, the ) × q ( barks | the, dog ) × q ( STOP | dog, barks )

  24. The Trigram Estimation Problem Remaining estimation problem: q ( w i | w i − 2 , w i − 1 ) For example: q ( laughs | the, dog )

  25. The Trigram Estimation Problem Remaining estimation problem: q ( w i | w i − 2 , w i − 1 ) For example: q ( laughs | the, dog ) A natural estimate (the “maximum likelihood estimate”): q ( w i | w i − 2 , w i − 1 ) = Count ( w i − 2 , w i − 1 , w i ) Count ( w i − 2 , w i − 1 ) q ( laughs | the, dog ) = Count ( the, dog, laughs ) Count ( the, dog )

  26. Sparse Data Problems A natural estimate (the “maximum likelihood estimate”): q ( w i | w i − 2 , w i − 1 ) = Count ( w i − 2 , w i − 1 , w i ) Count ( w i − 2 , w i − 1 ) q ( laughs | the, dog ) = Count ( the, dog, laughs ) Count ( the, dog ) Say our vocabulary size is N = |V| , then there are N 3 parameters in the model. 20 , 000 3 = 8 × 10 12 parameters e.g., N = 20 , 000 ⇒

  27. Overview ◮ The language modeling problem ◮ Trigram models ◮ Evaluating language models: perplexity ◮ Estimation techniques: ◮ Linear interpolation ◮ Discounting methods

  28. Evaluating a Language Model: Perplexity ◮ We have some test data, m sentences s 1 , s 2 , s 3 , . . . , s m

  29. Evaluating a Language Model: Perplexity ◮ We have some test data, m sentences s 1 , s 2 , s 3 , . . . , s m ◮ We could look at the probability under our model � m i =1 p ( s i ) . Or more conveniently, the log probability m m � � log p ( s i ) = log p ( s i ) i =1 i =1

  30. Evaluating a Language Model: Perplexity ◮ We have some test data, m sentences s 1 , s 2 , s 3 , . . . , s m ◮ We could look at the probability under our model � m i =1 p ( s i ) . Or more conveniently, the log probability m m � � log p ( s i ) = log p ( s i ) i =1 i =1 ◮ In fact the usual evaluation measure is perplexity m l = 1 Perplexity = 2 − l � where log p ( s i ) M i =1 and M is the total number of words in the test data.

  31. Some Intuition about Perplexity ◮ Say we have a vocabulary V , and N = |V| + 1 and model that predicts q ( w | u, v ) = 1 N for all w ∈ V ∪ { STOP } , for all u, v ∈ V ∪ { * } . ◮ Easy to calculate the perplexity in this case: l = log 1 Perplexity = 2 − l where N ⇒ Perplexity = N Perplexity is a measure of effective “branching factor”

  32. Typical Values of Perplexity ◮ Results from Goodman (“A bit of progress in language modeling”), where |V| = 50 , 000 ◮ A trigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 2 , x i − 1 ) . Perplexity = 74

  33. Typical Values of Perplexity ◮ Results from Goodman (“A bit of progress in language modeling”), where |V| = 50 , 000 ◮ A trigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 2 , x i − 1 ) . Perplexity = 74 ◮ A bigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 1 ) . Perplexity = 137

  34. Typical Values of Perplexity ◮ Results from Goodman (“A bit of progress in language modeling”), where |V| = 50 , 000 ◮ A trigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 2 , x i − 1 ) . Perplexity = 74 ◮ A bigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 1 ) . Perplexity = 137 ◮ A unigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i ) . Perplexity = 955

  35. Some History ◮ Shannon conducted experiments on entropy of English i.e., how good are people at the perplexity game? C. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:50–64, 1951.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend