language models
play

Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Language Models What are language models?


  1. CSE 6240: Web Search and Text Mining. Spring 2020 Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  2. Language Models • What are language models? • Statistical language models – Unigram, bigram and n-gram language model • Neural language models 2 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  3. Language Models: Objective • Key question: How well does a model represent the language? – Character language model: Given alphabet vocabulary V, models the probability of generating strings in the language – Word language model: Given word vocabulary V, models the probability of generating sentences in the language 3 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  4. Language Model: Applications • Assign a probability to sentences – Machine translation: P( high wind tonight) > P( large wind tonight) • – Spell correction: The office is about fifteen minuets from my house • P(about fifteen minutes from) > P(about fifteen minuets from) • – Speech recognition: P(I saw a van) >> P(eyes awe of an) • – Information retrieval: use words that you expect to find in matching documents as your query – Many more: Summarization, question-answering , and more 4 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  5. Language Models • What are language models? • Statistical language models • Neural language models 5 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  6. Language Model: Definition • Goal: Compute the probability of a sentence or sequence of words P(s) = P(w 1 , w 2 , … w n ) • Related task: Probability of an upcoming word: P(w 5 | w 1 , w 2 , w 3 , w 4 ) • A model that computes either of these is a language model • How to compute the joint probability? – Intuition: apply the chain rule 6 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  7. How To Compute Sentence Probability? • Given sentence s = t 1 t 2 t 3 t 4 • Applying the chain rule under language model M 7 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  8. Complexity of Language Models • The complexity of language models depends on the window of the word-word or character-character dependency they can handle • Common types are: – Unigram language model – Bigram language model – N-gram language model 8 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  9. Unigram Model • Unigram language model only models the probability of each word according to the model – Does NOT model word-word dependency – The word order is irrelevant – Akin to the “bag of words” model 9 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  10. Bigram Model • Bigram language model models the consecutive word dependency – Does NOT model longer dependency – Word order is relevant here 10 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  11. N-gram Model • Bigram language model models the longer sequences of word dependency – Most complex among all three 11 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  12. Unigram Language Model: Example • What is the probability of the sentence s under language model M? • Example: Word Probability the 0.2 “the man likes the woman” a 0.1 0.2 x 0.01 x 0.02 x 0.2 x 0.01 man 0.01 = 0.00000008 woman 0.01 P (s | M) = 0.00000008 said 0.03 likes 0.02 Language Model M 12 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  13. Comparing Language Models • Given two language models, how can we decide which language model is better? • Solution: – Take a set S of sentences we desire to model – For each language model: Find the probability of each sentence • Average the probability scores • – The language model with the highest average probability is the best fit for language model 13 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  14. Comparing Language Models • s: “the man likes the woman” • M1: 0.2 x 0.01 x 0.02 x 0.2 x 0.01 è P(s|M1) = 0.00000008 • M2 : 0.1 x 0.1 x 0.01 x 0.1 x 0.1 è P(s|M2) = 0.000001 • P(s|M2) > P(s|M1) è M2 is a better language model Word Probability Word Probability the 0.2 the 0.1 Language Language a 0.1 a 0.02 Model M1 Model M2 man 0.01 man 0.1 woman 0.01 woman 0.1 said 0.03 said 0.02 likes 0.02 likes 0.01 14 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  15. Estimating Probabilities • N-gram conditional probability can be estimated based on the raw occurrence counts in the observed corpus • Uni-gram • Bi-gram • N-gram 15 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  16. Estimating Bigram Probabilities: Case Study • Corpus: Berkeley Restaurant Project sentences – – – – – – 16 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  17. Raw Bigram Counts: Case Study • Bigram matrix created from 9222 sentences 17 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  18. Raw Bigram Probabilities: Case Study • Unigram counts • Normalize by unigrams P(want | i) = C(i, want)/C(i) = 827 / 2533 = 0.33 18 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  19. Language Models • What are language models? • Statistical language models – Unigram, bigram, and n-gram language models • Neural language models • Language models for IR 19 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  20. Neural Language Models • So far, the language models have been statistics and counting based • Now, language models are created using neural networks/deep learning • Key question: how to model sequences? 20 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  21. Neural-based Bigram Language Mode 1-hot encoding 1-hot encoding 1-hot encoding 1-hot encoding Problem: Does not model sequential information (too local) 21 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  22. Sequences in Inputs or Outputs? 22 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  23. Sequences in Inputs or Outputs? 23 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  24. Key Conceptual Ideas • Parameter Sharing – in computational graphs = adding gradients • “Unrolling” – in computational graphs with parameter sharing • Parameter Sharing + “Unrolling” – Allows modeling arbitrary length sequences! – Keeps number of parameters in check 24 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  25. Recurrent Neural Network 25 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  26. Recurrent Neural Network • We can process a sequence of vectors x by applying a recurrence formula at every time step • f W is used at every time step and shared across all data 26 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  27. (Vanilla) Recurrent Neural Network Learned matrix weights 27 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  28. RNN Computational Graph Initial hidden state Final hidden state Input at time 1 Input at time 2 Input at time 3 28 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  29. RNN Computational Graph • The same weight matrices W is shared for all time steps Shared weight matrix 29 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  30. RNN Computational Graph: Many to Many • Many-to-many architecture has one output per time step Output at time 3 Final output Output at time 1 Output at time 2 30 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  31. RNN Computational Graph: Many to Many Loss at time 1 Loss at time 2 Loss at time 3 Final loss 31 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  32. RNN Computational Graph: Many to Many Total loss 32 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  33. RNN Computational Graph: Many to one • Many-to-one architecture has one final output 33 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  34. RNN Computational Graph: One to many • Many-to-one architecture has one input and several outputs 34 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  35. Example : Character-level Language Model • Input: one hot representation of the characters • Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 35 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  36. Example : Character-level Language Model • Transform every input into the hidden vector 36 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

  37. Example : Character-level Language Model • Transform each hidden vector into a output vector 37 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend