Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from - PowerPoint PPT Presentation

Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from 3SLP

n-grams = Chain Rule + Backoff (Markov assumption)

N-Gram Terminology how to (efficiently) compute p(Colorless green ideas sleep furiously)? Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep) trigram 3 2 p(furiously | ideas sleep) (3-gram) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(w i | w i-n+1 … w i-1 )

Language Models & Smoothing Maximum likelihood (MLE): simple counting Laplace smoothing, add- λ Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Language Models & Smoothing Maximum likelihood Q: Why do we have all (MLE): simple counting these options? Why is Laplace smoothing, add- λ MLE not sufficient? Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Language Models & Smoothing Maximum likelihood Q: Why do we have all (MLE): simple counting these options? Why is Laplace smoothing, add- λ MLE not sufficient? Interpolation models Discounted backoff A: Do we trust our training corpus? Interpolated (modified) (insufficient counts → Kneser-Ney 0s; corpora have Good-Turing lexical biases; …) Witten-Bell

Language Models & Smoothing Maximum likelihood (MLE): simple counting Q: What are the Laplace smoothing, add- λ parameters we learn? Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Language Models & Smoothing Maximum likelihood (MLE): simple counting Q: What are the Laplace smoothing, add- λ parameters we learn? Interpolation models Discounted backoff A: The counts or Interpolated (modified) normalized Kneser-Ney probability values Good-Turing Witten-Bell

Language Models & Smoothing Maximum likelihood (MLE): simple counting Q: What are the Laplace smoothing, add- λ hyperparameters ? Interpolation models Discounted backoff Interpolated (modified) Kneser-Ney Good-Turing Witten-Bell

Language Models & Smoothing Maximum likelihood (MLE): simple counting Q: What are the Laplace smoothing, add- λ hyperparameters ? Interpolation models Discounted backoff A: Laplace, backoff, KN: Interpolated (modified) The adjustments to Kneser-Ney counts Good-Turing Witten-Bell Interpolation: reweighting values

Evaluation Framework fine-tune any secondary (hyper)parameters Dev Test Training Data Data Data acquire primary statistics for perform final learning model parameters evaluation DO NOT ITERATE ON THE TEST DATA

Setting Hyperparameters Use a development corpus Dev Test Training Data Data Data Choose hyperparameters to maximize the likelihood of dev data: Fix the N-gram probabilities/counts (on the training data) Search for λs that give largest probability to held-out set

Evaluating Language Models What is “correct?” What is working “well?” Extrinsic : Evaluate LM in downstream task Test an MT, ASR, etc. system and see which LM does better Propagate & conflate errors Intrinsic : Treat LM as its own downstream task Use perplexity (from information theory)

Perplexity Lower is better: lower perplexity --> less surprised n-gram history (n-1 items) 𝑁 log 𝑞 𝑥 𝑗 −1 𝑁 σ 𝑗=1 perplexity = exp( ℎ 𝑗 ))

Implementation: Unknown words Create an unknown word token <UNK> Training: 1. Create a fixed lexicon L of size V 2. Change any word not in L to <UNK> 3. Train LM as normal Evaluation: Use UNK probabilities for any word not in training

Implementation: EOS Padding Create an end of sentence (“chunk”) token <EOS> Don’t estimate p(<BOS> | <EOS>) Training & Evaluation: 1. Identify “chunks” that are relevant (sentences, paragraphs, documents) 2. Append the <EOS> token to the end of the chunk 3. Train or evaluate LM as normal Post 33

An Extended Example The film got a great opening and the film went on to become a hit . Q: With OOV, EOS, and BOS, how many types (for normalization)? Context: x y Word (Type): z Raw Count Add-1 count Norm. Probability p(z | x y) The film The 0 The film film 0 The film got 1 The film went 0 The film OOV 0 The film EOS 0 … a great great 0 a great opening 1 a great and 0 a great the 0 …

An Extended Example The film got a great opening and the film went on to become a hit . Q: With OOV, EOS, and BOS, A: 16 how many types (for (why don’t we count BOS?) normalization)? Context: x y Word (Type): z Raw Count Add-1 count Norm. Probability p(z | x y) The film The 0 The film film 0 The film got 1 The film went 0 The film OOV 0 The film EOS 0 … a great great 0 a great opening 1 a great and 0 a great the 0 …

An Extended Example The film got a great opening and the film went on to become a hit . Q: With OOV, EOS, and BOS, A: 16 how many types (for (why don’t we count BOS?) normalization)? Context: x y Word (Type): z Raw Count Add-1 count Norm. Probability p(z | x y) The film The 0 1 1/17 The film film 0 1 1/17 The film got 1 2 2/17 17 The film went 0 1 1/17 (=1+16*1) … … The film OOV 0 1 1/17 The film EOS 0 1 1/17 … a great great 0 1 1/17 a great opening 1 2 2/17 17 a great and 0 1 1/17 a great the 0 1 1/17 …

An Extended Example The film got a great opening and the film went on to become a hit . Context: x y Word (Type): z Raw Count Add-1 count Norm. Probability p(z | x y) The film The 0 1 1/17 The film film 0 1 1/17 The film got 1 2 2/17 17 The film went 0 1 1/17 (=1+16*1) … … The film OOV 0 1 1/17 The film EOS 0 1 1/17 … a great great 0 1 1/17 a great opening 1 2 2/17 17 a great and 0 1 1/17 a great the 0 1 1/17 … Q: What is the perplexity for the sentence “The film , a hit !”

What are the tri-grams for “The film , a hit !” Trigrams MLE p(trigram) <BOS> <BOS> The 1 <BOS> The film 1 The film , 0 film , a 0 , a hit 0 a hit ! 0 hit ! <EOS> 0 Perplexity ???

What are the tri-grams for “The film , a hit !” Trigrams MLE p(trigram) <BOS> <BOS> The 1 <BOS> The film 1 The film , 0 film , a 0 , a hit 0 a hit ! 0 hit ! <EOS> 0 Perplexity Infinity

What are the tri-grams for “The film , a hit !” Trigrams MLE p(trigram) UNK-ed trigrams <BOS> <BOS> The 1 <BOS> <BOS> The <BOS> The film 1 <BOS> The film The film , 0 The film <UNK> film , a 0 film <UNK> a , a hit 0 <UNK> a hit a hit ! 0 a hit <UNK> hit ! <EOS> 0 hit <UNK> <EOS> Perplexity Infinity

What are the tri-grams for “The film , a hit !” Smoothed Trigrams MLE p(trigram) UNK-ed trigrams p(trigram) <BOS> <BOS> The 1 <BOS> <BOS> The 2/17 <BOS> The film 1 <BOS> The film 2/17 The film , 0 The film <UNK> 1/17 film , a 0 film <UNK> a 1/16 , a hit 0 <UNK> a hit 1/16 a hit ! 0 a hit <UNK> 1/17 hit ! <EOS> 0 hit <UNK> <EOS> 1/16 Perplexity Infinity Perplexity ???

What are the tri-grams for “The film , a hit !” Smoothed Trigrams MLE p(trigram) UNK-ed trigrams p(trigram) <BOS> <BOS> The 1 <BOS> <BOS> The 2/17 <BOS> The film 1 <BOS> The film 2/17 The film , 0 The film <UNK> 1/17 film , a 0 film <UNK> a 1/16 , a hit 0 <UNK> a hit 1/16 a hit ! 0 a hit <UNK> 1/17 hit ! <EOS> 0 hit <UNK> <EOS> 1/16 Perplexity Infinity Perplexity 13.59

How to Compute Perplexity • If you have a list of the probabilities for each observed n- gram “token:” numpy.exp(-numpy.mean(numpy.log(probs_per_trigram_token))) • If you have a list of observed n- gram “types” t and counts c, and log-prob. function lp: numpy.exp(-numpy.mean(c*lp(t) for (t, c) in ngram_types.items()))

Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from - PowerPoint PPT Presentation

Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from 3SLP n-grams = Chain Rule + Backoff (Markov assumption) N-Gram Terminology how to (efficiently) compute p(Colorless green ideas sleep furiously)? Commonly History Size

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Semiotics: Recap Examples References Jrg Cassens Data and Process Visualization SoSe 2017

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Access Methods 1 / 44 Recap Recap 2 / 44 Recap A More Detailed Architecture granularity:

Trees (Part 2) 1 / 59 Trees (Part 2) Recap Recap 2 / 59 Trees (Part 2) Recap B + Tree A B

Trees (Part 1) 1 / 57 Trees (Part 1) Recap Recap 2 / 57 Trees (Part 1) Recap Hash Tables

Proof of Stake Recap Bitcoin Incentives Block subsidy Transaction fees Recap

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

Ruby Monstas Session 14 Agenda Recap Standard Library: RSS Exercises Recap Recap: TodoList

Language Modeling Michael Collins, Columbia University Overview The language modeling problem

Language Modeling (2) Diyi Yang Many slides from Dan Jurafsky and Jason Esiner 1 Recap:

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

& Information Theory Problems with Unseen Sequences Suppose we want to evaluate bigram

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

4.2: Isomorphism of Grammars In this section, we study grammar isomorphism, i.e., the way in

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lectures 5-6: Language Models

Analogies Explained Towards Understanding Word Embeddings Carl Allen, Tim Hospedales June 13

N-grams & Language ID If N-gram models represent language models, can we use N-gram

Language models Chapter 3 in Martin/Jurafsky Language model as a generative model Choose a

Antimicrobial resistance and strategies for Gram-negative bacteria Y Glupczynski UCL

Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from - PowerPoint PPT Presentation

Language Modeling Recap CMSC 473/673 UMBC Some slides adapted from 3SLP n-grams = Chain Rule + Backoff (Markov assumption) N-Gram Terminology how to (efficiently) compute p(Colorless green ideas sleep furiously)? Commonly History Size

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Semiotics: Recap Examples References Jrg Cassens Data and Process Visualization SoSe 2017

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Access Methods 1 / 44 Recap Recap 2 / 44 Recap A More Detailed Architecture granularity:

Trees (Part 2) 1 / 59 Trees (Part 2) Recap Recap 2 / 59 Trees (Part 2) Recap B + Tree A B

Trees (Part 1) 1 / 57 Trees (Part 1) Recap Recap 2 / 57 Trees (Part 1) Recap Hash Tables

Proof of Stake Recap Bitcoin Incentives Block subsidy Transaction fees Recap

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

Ruby Monstas Session 14 Agenda Recap Standard Library: RSS Exercises Recap Recap: TodoList

Language Modeling Michael Collins, Columbia University Overview The language modeling problem

Language Modeling (2) Diyi Yang Many slides from Dan Jurafsky and Jason Esiner 1 Recap:

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

&amp; Information Theory Problems with Unseen Sequences Suppose we want to evaluate bigram

CSC321 Lecture 7: Distributed Representations Roger Grosse Roger Grosse CSC321 Lecture 7:

4.2: Isomorphism of Grammars In this section, we study grammar isomorphism, i.e., the way in

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lectures 5-6: Language Models

Analogies Explained Towards Understanding Word Embeddings Carl Allen, Tim Hospedales June 13

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

Language models Chapter 3 in Martin/Jurafsky Language model as a generative model Choose a

Antimicrobial resistance and strategies for Gram-negative bacteria Y Glupczynski UCL

& Information Theory Problems with Unseen Sequences Suppose we want to evaluate bigram

N-grams & Language ID If N-gram models represent language models, can we use N-gram