Count-based Language Modeling CMSC 473/673 UMBC Some slides - PowerPoint PPT Presentation

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

Outline Defining Language Models Breaking & Fixing Language Models Evaluating Language Models

Goal of Language Modeling p θ ( ) […text..] Learn a probabilistic model of text Accomplished through observing text and updating model parameters to make text more likely

Goal of Language Modeling p θ ( ) […text..] Learn a probabilistic model of 0 ≤ 𝑞 𝜄 [… 𝑢𝑓𝑦𝑢 … ] ≤ 1 text Accomplished through ෍ 𝑞 𝜄 𝑢 = 1 observing text and updating model parameters to make 𝑢:𝑢 is valid text text more likely

Design Question 1: What Part of Language Do We Estimate? p θ ( ) […text..] Is […text..] a • Full document? A: It’s task - • Sequence of sentences? dependent! • Sequence of words? • Sequence of characters?

Design Question 2: How do we estimate robustly? p θ ( ) […typo -text..] What if […text..] has a typo?

Design Question 3: How do we generalize? p θ ( ) […synonymous -text..] What if […text..] has a word (or character or…) we’ve never seen before?

“The Unreasonable Effectiveness of Recurrent Neural Networks” http://karpathy.github.io/2015/05/21/rnn-effectiveness/

“The Unreasonable Effectiveness of Recurrent Neural Networks” http://karpathy.github.io/2015/05/21/rnn-effectiveness/ “The Unreasonable Effectiveness of Character - level Language Models” (and why RNNs are still cool) http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139

Simple Count-Based 𝑞 item

Simple Count-Based “proportional to” 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

Simple Count-Based “proportional to” 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item 𝑑𝑝𝑣𝑜𝑢(item) = σ all items 𝑧 𝑑𝑝𝑣𝑜𝑢(y)

Simple Count-Based “proportional to” 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢 item 𝑑𝑝𝑣𝑜𝑢(item) = σ all items 𝑧 𝑑𝑝𝑣𝑜𝑢(y) constant

In Simple Count-Based Models, What Do We Count? 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item) sequence of characters → pseudo-words sequence of words → pseudo-phrases

Shakespearian Sequences of Characters

Shakespearian Sequences of Words

Novel Words, Novel Sentences “Colorless green ideas sleep furiously” – Chomsky (1957) Let’s observe and record all sentences with our big, bad supercomputer Red ideas? Read ideas?

Probability Chain Rule 𝑞 𝑦 1 , 𝑦 2 , … , 𝑦 𝑇 = 𝑞 𝑦 1 𝑞 𝑦 2 𝑦 1 )𝑞 𝑦 3 𝑦 1 , 𝑦 2 ) ⋯ 𝑞 𝑦 𝑇 𝑦 1 , … , 𝑦 𝑗 = 𝑇 ෑ 𝑞 𝑦 𝑗 𝑦 1 , … , 𝑦 𝑗−1 ) 𝑗

N-Grams Maintaining an entire inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously)

N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) *

N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) *

N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

N-Grams Maintaining an entire joint inventory over sentences could be too much to ask Store “smaller” pieces? p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * apply the p(ideas | Colorless green) * chain rule p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?”

N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info

N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info p(furiously | Colorless green ideas sleep) ≈ p(furiously | Colorless green ideas sleep)

N-Grams p(furiously | Colorless green ideas sleep) How much does “Colorless” influence the choice of “furiously?” Remove history and contextual info p(furiously | Colorless green ideas sleep) ≈ p(furiously | ideas sleep)

N-Grams p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | Colorless green ideas) * p(furiously | Colorless green ideas sleep)

Trigrams p(Colorless green ideas sleep furiously) = p(Colorless) * p(green | Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep)

Trigrams p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS> ) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) Consistent notation : Pad the left with <BOS> (beginning of sentence) symbols

Trigrams p(Colorless green ideas sleep furiously) = p(Colorless | <BOS> <BOS> ) * p(green | <BOS> Colorless) * p(ideas | Colorless green) * p(sleep | green ideas) * p(furiously | ideas sleep) * p( <EOS> | sleep furiously) Consistent notation : Pad the left with <BOS> (beginning of sentence) symbols Fully proper distribution : Pad the right with a single <EOS> symbol

N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously)

N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep)

N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep) trigram 3 2 p(furiously | ideas sleep) (3-gram)

N-Gram Terminology Commonly History Size n Example called (Markov order) 1 unigram 0 p(furiously) 2 bigram 1 p(furiously | sleep) trigram 3 2 p(furiously | ideas sleep) (3-gram) 4 4-gram 3 p(furiously | green ideas sleep) n n-gram n-1 p(w i | w i-n+1 … w i-1 )

N-Gram Probability 𝑞 𝑥 1 , 𝑥 2 , 𝑥 3 , ⋯ , 𝑥 𝑇 = 𝑇 ෑ 𝑞 𝑥 𝑗 𝑥 𝑗−𝑂+1 , ⋯ , 𝑥 𝑗−1 ) 𝑗=1

Count-Based N-Grams (Unigrams) 𝑞 item ∝ 𝑑𝑝𝑣𝑜𝑢(item)

Count-Based N-Grams (Unigrams) 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢(z)

Count-Based N-Grams (Unigrams) 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z 𝑑𝑝𝑣𝑜𝑢 z = σ 𝑤 𝑑𝑝𝑣𝑜𝑢(v)

Count-Based N-Grams (Unigrams) word type word type 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z 𝑑𝑝𝑣𝑜𝑢 z = σ 𝑤 𝑑𝑝𝑣𝑜𝑢(v) word type

Count-Based N-Grams (Unigrams) word type word type 𝑞 z ∝ 𝑑𝑝𝑣𝑜𝑢 z = 𝑑𝑝𝑣𝑜𝑢 z 𝑋 number of tokens observed

Count-Based N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) z Raw Count count(z) Normalization Probability p(z) The 1 film 2 got 1 a 2 great 1 opening 1 and 1 the 1 went 1 on 1 to 1 become 1 hit 1 . 1

Count-Based N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) z Raw Count count(z) Normalization Probability p(z) The 1 film 2 got 1 a 2 great 1 opening 1 and 1 16 the 1 went 1 on 1 to 1 become 1 hit 1 . 1

Count-Based N-Grams (Unigrams) The film got a great opening and the film went on to become a hit . Word (Type) z Raw Count count(z) Normalization Probability p(z) The 1 1/16 film 2 1/8 got 1 1/16 a 2 1/8 great 1 1/16 opening 1 1/16 and 1 1/16 16 the 1 1/16 went 1 1/16 on 1 1/16 to 1 1/16 become 1 1/16 hit 1 1/16 . 1 1/16

Count-Based N-Grams (Trigrams) order matters in order matters in conditioning count 𝑞 z|x, y ∝ 𝑑𝑝𝑣𝑜𝑢(x, y, z) Count of the sequence of items “x y z”

Count-Based N-Grams (Trigrams) order matters in order matters in conditioning count 𝑞 z|x, y ∝ 𝑑𝑝𝑣𝑜𝑢(x, y, z) count(x, y, z) ≠ count(x, z, y) ≠ count(y, x, z) ≠ …

Count-based Language Modeling CMSC 473/673 UMBC Some slides - PowerPoint PPT Presentation

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner Outline Defining Language Models Breaking & Fixing Language Models Evaluating Language Models Goal of Language Modeling p ( ) [text..]

Making Every Contact Count (MECC) Content What is Making Every Contact Count? Who is

Recitation 4 Question 3: Flying off the handle Parent Child fork() count++; print(count); 1

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

FIT Count Training (Flower Insect Timed Count) Denise McGowan (Government of Jersey) Nadine

What is the Point-In- Time Count? The Point-in-Time (PIT) count is a count of sheltered and

Count Controlled CSCI-UA.0002-008 Loops Count Controlled Loops A count controlled loop is a

WHAT IS THE POINT-IN-TIME COUNT? The Point-in-Time (PIT) count is a count of sheltered and

Count 2020 2020 Count The 2020 Everybody Counts Point-in-Time Count effort is one way the City of

2019 Annual Passenger Count JPB Board of Directors September 5, 2019 Agenda Item #13 OVERVIEW

Systems You Can Systems You Can Count On Count On Mission Statement Mission Statement ! To

Merge and Count Merge and count step. Given two sorted halves, count number of inversions

pop-count update draft-ietf-pim-pop-count-03 pop-count version 3 changes Mainly changes to

2019 Annual Passenger Count Citizens Advisory Committee (CAC) June 19 th , 2019 Agenda Item #8

The Count of Monte Cristo by Alexandre Dumas, pre Chapter 41: The Presentation The Count of

Complete Count Committee Profiling the Hard to Count December 3, 2018 Purpose of this

C C Census 2010 Census 2010 2010 2010 Wyoming Complete Count Committee Complete Count

Gods Character Set the table What is Law? What is Law? A system of principles

3. Lattice QCD Or: Using Large Computers for Fun References: [(Path Integral: Ryd 5; Sakurai:

Introd u ction to dates W OR K IN G W ITH DATE S AN D TIME S IN R Charlo e Wickham Instr

Taking differences of datetimes W OR K IN G W ITH DATE S AN D TIME S IN R Charlo e

Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin

Topic 8 Iterators "First things first, but not necessarily in that order " -Dr. Who

THE DECLARATION OF INDEPENDENCE Bell Work: 1. Describe one cause of the American Revolution

Parallel Numerical Algorithms Chapter 7 Differential Equations Section 7.4 Electronic

Count-based Language Modeling CMSC 473/673 UMBC Some slides - PowerPoint PPT Presentation

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner Outline Defining Language Models Breaking & Fixing Language Models Evaluating Language Models Goal of Language Modeling p ( ) [text..]

Making Every Contact Count (MECC) Content What is Making Every Contact Count? Who is

Recitation 4 Question 3: Flying off the handle Parent Child fork() count++; print(count); 1

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

FIT Count Training (Flower Insect Timed Count) Denise McGowan (Government of Jersey) Nadine

What is the Point-In- Time Count? The Point-in-Time (PIT) count is a count of sheltered and

Count Controlled CSCI-UA.0002-008 Loops Count Controlled Loops A count controlled loop is a

WHAT IS THE POINT-IN-TIME COUNT? The Point-in-Time (PIT) count is a count of sheltered and

Count 2020 2020 Count The 2020 Everybody Counts Point-in-Time Count effort is one way the City of

2019 Annual Passenger Count JPB Board of Directors September 5, 2019 Agenda Item #13 OVERVIEW

Systems You Can Systems You Can Count On Count On Mission Statement Mission Statement ! To

Merge and Count Merge and count step. Given two sorted halves, count number of inversions

pop-count update draft-ietf-pim-pop-count-03 pop-count version 3 changes Mainly changes to

2019 Annual Passenger Count Citizens Advisory Committee (CAC) June 19 th , 2019 Agenda Item #8

The Count of Monte Cristo by Alexandre Dumas, pre Chapter 41: The Presentation The Count of

Complete Count Committee Profiling the Hard to Count December 3, 2018 Purpose of this

C C Census 2010 Census 2010 2010 2010 Wyoming Complete Count Committee Complete Count

Gods Character Set the table What is Law? What is Law? A system of principles

3. Lattice QCD Or: Using Large Computers for Fun References: [(Path Integral: Ryd 5; Sakurai:

Introd u ction to dates W OR K IN G W ITH DATE S AN D TIME S IN R Charlo e Wickham Instr

Taking differences of datetimes W OR K IN G W ITH DATE S AN D TIME S IN R Charlo e

Principles of Parallel Algorithm Design Ananth Grama, Anshul Gupta, George Karypis, and Vipin

Topic 8 Iterators &quot;First things first, but not necessarily in that order &quot; -Dr. Who

THE DECLARATION OF INDEPENDENCE Bell Work: 1. Describe one cause of the American Revolution

Parallel Numerical Algorithms Chapter 7 Differential Equations Section 7.4 Electronic

Topic 8 Iterators "First things first, but not necessarily in that order " -Dr. Who