Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Language Models • What are language models? • Statistical language models – Unigram, bigram and n-gram language model • Neural language models 2 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Language Models: Objective • Key question: How well does a model represent the language? – Character language model: Given alphabet vocabulary V, models the probability of generating strings in the language – Word language model: Given word vocabulary V, models the probability of generating sentences in the language 3 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Language Model: Applications • Assign a probability to sentences – Machine translation: P( high wind tonight) > P( large wind tonight) • – Spell correction: The office is about fifteen minuets from my house • P(about fifteen minutes from) > P(about fifteen minuets from) • – Speech recognition: P(I saw a van) >> P(eyes awe of an) • – Information retrieval: use words that you expect to find in matching documents as your query – Many more: Summarization, question-answering , and more 4 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Language Models • What are language models? • Statistical language models • Neural language models 5 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Language Model: Definition • Goal: Compute the probability of a sentence or sequence of words P(s) = P(w 1 , w 2 , … w n ) • Related task: Probability of an upcoming word: P(w 5 | w 1 , w 2 , w 3 , w 4 ) • A model that computes either of these is a language model • How to compute the joint probability? – Intuition: apply the chain rule 6 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

How To Compute Sentence Probability? • Given sentence s = t 1 t 2 t 3 t 4 • Applying the chain rule under language model M 7 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Complexity of Language Models • The complexity of language models depends on the window of the word-word or character-character dependency they can handle • Common types are: – Unigram language model – Bigram language model – N-gram language model 8 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Unigram Model • Unigram language model only models the probability of each word according to the model – Does NOT model word-word dependency – The word order is irrelevant – Akin to the “bag of words” model 9 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Bigram Model • Bigram language model models the consecutive word dependency – Does NOT model longer dependency – Word order is relevant here 10 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

N-gram Model • Bigram language model models the longer sequences of word dependency – Most complex among all three 11 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Unigram Language Model: Example • What is the probability of the sentence s under language model M? • Example: Word Probability the 0.2 “the man likes the woman” a 0.1 0.2 x 0.01 x 0.02 x 0.2 x 0.01 man 0.01 = 0.00000008 woman 0.01 P (s | M) = 0.00000008 said 0.03 likes 0.02 Language Model M 12 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Comparing Language Models • Given two language models, how can we decide which language model is better? • Solution: – Take a set S of sentences we desire to model – For each language model: Find the probability of each sentence • Average the probability scores • – The language model with the highest average probability is the best fit for language model 13 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Comparing Language Models • s: “the man likes the woman” • M1: 0.2 x 0.01 x 0.02 x 0.2 x 0.01 è P(s|M1) = 0.00000008 • M2 : 0.1 x 0.1 x 0.01 x 0.1 x 0.1 è P(s|M2) = 0.000001 • P(s|M2) > P(s|M1) è M2 is a better language model Word Probability Word Probability the 0.2 the 0.1 Language Language a 0.1 a 0.02 Model M1 Model M2 man 0.01 man 0.1 woman 0.01 woman 0.1 said 0.03 said 0.02 likes 0.02 likes 0.01 14 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Estimating Probabilities • N-gram conditional probability can be estimated based on the raw occurrence counts in the observed corpus • Uni-gram • Bi-gram • N-gram 15 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Estimating Bigram Probabilities: Case Study • Corpus: Berkeley Restaurant Project sentences – – – – – – 16 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Raw Bigram Counts: Case Study • Bigram matrix created from 9222 sentences 17 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Raw Bigram Probabilities: Case Study • Unigram counts • Normalize by unigrams P(want | i) = C(i, want)/C(i) = 827 / 2533 = 0.33 18 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Language Models • What are language models? • Statistical language models – Unigram, bigram, and n-gram language models • Neural language models • Language models for IR 19 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Neural Language Models • So far, the language models have been statistics and counting based • Now, language models are created using neural networks/deep learning • Key question: how to model sequences? 20 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Neural-based Bigram Language Mode 1-hot encoding 1-hot encoding 1-hot encoding 1-hot encoding Problem: Does not model sequential information (too local) 21 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Sequences in Inputs or Outputs? 22 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Sequences in Inputs or Outputs? 23 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Key Conceptual Ideas • Parameter Sharing – in computational graphs = adding gradients • “Unrolling” – in computational graphs with parameter sharing • Parameter Sharing + “Unrolling” – Allows modeling arbitrary length sequences! – Keeps number of parameters in check 24 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Recurrent Neural Network 25 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Recurrent Neural Network • We can process a sequence of vectors x by applying a recurrence formula at every time step • f W is used at every time step and shared across all data 26 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

(Vanilla) Recurrent Neural Network Learned matrix weights 27 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

RNN Computational Graph Initial hidden state Final hidden state Input at time 1 Input at time 2 Input at time 3 28 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

RNN Computational Graph • The same weight matrices W is shared for all time steps Shared weight matrix 29 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

RNN Computational Graph: Many to Many • Many-to-many architecture has one output per time step Output at time 3 Final output Output at time 1 Output at time 2 30 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

RNN Computational Graph: Many to Many Loss at time 1 Loss at time 2 Loss at time 3 Final loss 31 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

RNN Computational Graph: Many to Many Total loss 32 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

RNN Computational Graph: Many to one • Many-to-one architecture has one final output 33 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

RNN Computational Graph: One to many • Many-to-one architecture has one input and several outputs 34 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Example : Character-level Language Model • Input: one hot representation of the characters • Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’] 35 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Example : Character-level Language Model • Transform every input into the hidden vector 36 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Example : Character-level Language Model • Transform each hidden vector into a output vector 37 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Language Models What are language models?

Models of Language Evolution models thereof its evolution language Models of Language Evolution

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Language Models Philipp Koehn 8 September 2020 Philipp Koehn Machine Translation: Language

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

N-grams & Language ID If N-gram models represent language models, can we use N-gram

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

The intergalactic medium as a cosmological tool MATTEO VIEL INAF & INFN Trieste

1 Other names are dynamic treatment regimes, treatment algorithms, stepped care models, expert

Schedule is Subject to Change Without Notice Conference Welcome DNA & Dinosaurs pt1

Software Engineering Bertrand Meyer, Martin Nordio ETH Zurich Peter Kolb Red Expel Christian

Uncertainties in atomic data and how they propagate in chemical abundances: L i & Na Karin

Reflections on Data In Integration for SDN Anduo Wang Jason Croft* Eduard Dragut

Du Python, du Bash, un Raspberry Pi et des clefs USB Kitten Groomer, le nettoyeur de cl e USB.

First-Person Vision Kristen Grauman Department of Computer Science University of Texas at Austin

Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Language Models What are language models?

Models of Language Evolution models thereof its evolution language Models of Language Evolution

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Language Models Philipp Koehn 8 September 2020 Philipp Koehn Machine Translation: Language

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

CSE 490 Natural Language Processing Spring 2016 Language Models Yejin Choi Slides adapted from

CSE 447/547 Natural Language Processing Winter 2020 Language Models Yejin Choi Slides adapted

The intergalactic medium as a cosmological tool MATTEO VIEL INAF &amp; INFN Trieste

1 Other names are dynamic treatment regimes, treatment algorithms, stepped care models, expert

Schedule is Subject to Change Without Notice Conference Welcome DNA &amp; Dinosaurs pt1

Software Engineering Bertrand Meyer, Martin Nordio ETH Zurich Peter Kolb Red Expel Christian

Uncertainties in atomic data and how they propagate in chemical abundances: L i &amp; Na Karin

Reflections on Data In Integration for SDN Anduo Wang Jason Croft* Eduard Dragut

Du Python, du Bash, un Raspberry Pi et des clefs USB Kitten Groomer, le nettoyeur de cl e USB.

First-Person Vision Kristen Grauman Department of Computer Science University of Texas at Austin

N-grams & Language ID If N-gram models represent language models, can we use N-gram

The intergalactic medium as a cosmological tool MATTEO VIEL INAF & INFN Trieste

Schedule is Subject to Change Without Notice Conference Welcome DNA & Dinosaurs pt1

Uncertainties in atomic data and how they propagate in chemical abundances: L i & Na Karin