Language Modeling CSE354 - Spring 2020 Task Language Modeling - PowerPoint PPT Presentation

Language Modeling CSE354 - Spring 2020

Task ● Language Modeling ● Probabilistic Modeling h o w ? (i.e. auto-complete) ○ Probability Theory ○ Logistic Regression ○ Sequence Modeling

Language Modeling -- assigning a probability to sequences of words. Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words

Language Modeling -- assigning a probability to sequences of words. Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words Version 2: Compute P(w5| w1, w2, w3, w4) = P(w n | w 1 , w 2 , …, w n-1 ) :probability of a next word given history

Language Modeling Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = ? Version 2: Compute P(w5| w1, w2, w3, w4) = P(w n | w 1 , w 2 , …, w n-1 ) :probability of a next word given history P(fork | He ate the cake with the) = ?

Language Modeling Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words Applications: P(He ate the cake with the fork) = ? ● Auto-complete: What word is next? ● Machine Translation: Which translation is most likely? ● Spell Correction: Which word is most likely given Version 2: Compute P(w5| w1, w2, w3, w4) error? = P(w n | w 1 , w 2 , …, w n-1 ) ● Speech Recognition: What did they just say? :probability of a next word given history “eyes aw of an” P(fork | He ate the cake with the) = ? (example from Jurafsky, 2017)

Language Modeling Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = ? Version 2: Compute P(w5| w1, w2, w3, w4) = P(w n | w 1 , w 2 , …, w n-1 ) :probability of a next word given history P(fork | He ate the cake with the) = ?

Simple Solution Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork) _ count( * * * * * * *)

Simple Solution: The Maximum Likelihood Estimate Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork) _ total number of count( * * * * * * *) observed 7grams

Simple Solution: The Maximum Likelihood Estimate Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork) _ count( * * * * * * *) P(fork | He ate the cake with the) = _count(He ate the cake with the fork) _ count(He ate the cake with the * )

Simple Solution: The Maximum Likelihood Estimate Problem: even the Web isn’t large enough to enable Version 1: Compute P(w1, w2, w3, w4, w5) = P(W) good estimates of most phrases. :probability of a sequence of words P(He ate the cake with the fork) = _count(He ate the cake with the fork) _ count( * * * * * * *) P(fork | He ate the cake with the) = _count(He ate the cake with the fork) _ count(He ate the cake with the * )

Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory.

Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) Example from (Jurafsky, 2017)

Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) Example from (Jurafsky, 2017)

Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule: P(X1, X2,…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1) Example from (Jurafsky, 2017)

Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule: P(X1, X2,…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

Markov Assumption: Problem: even the Web isn’t large enough to enable good estimates of most phrases. Solution: Estimate from shorter sequences, use more sophisticated probability theory. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule: P(X1, X2,…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

Markov Assumption: Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(Xn| X1…, Xn-1) ≈ P(Xn| Xn-k, …, Xn-1) where k < n What about Logistic Regression? Y = next word Solution: Estimate from shorter sequences, use more P(Y|X) = P(Xn | Xn-1, Xn-2, Xn-3, ...) sophisticated probability theory. Not a terrible option, but Xn-1 through Xn-k would be modeled as independent dimensions. P(B|A) = P(B, A) / P(A) ⇔ P(A)P(B|A) = P(B,A) = P(A,B) Let’s revisit later. P(A, B, C) = P(A)P(B|A)P(C| A, B) The Chain Rule: P(X1, X2,…, Xn) = P(X1)P(X2|X1)P(X3|X1, X2)...P(Xn|X1, ..., Xn-1)

Markov Assumption: Problem: even the Web isn’t large enough to enable good estimates of most phrases. P(Xn| X1…, Xn-1) ≈ P(Xn| Xn-k, …, Xn-1) where k < n Bigram Model: k = 1; Example generated sentence: outside, new, car, parking, lot, of, the, agreement, reached P(X1 = “outside”, X2=”new”, X3 = “car”, ....) ≈ P(X1=“outside”) * P(X2=”new”|X1 = “outside) * P(X3=”car” | X2=”new”) * ... Example from (Jurafsky, 2017)

Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence?

Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? How to build?

Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? How to build? training Training Corpus (fit, learn)

Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? training Training Corpus (fit, learn)

first word \ second word Bigram Counts Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? Example from (Jurafsky, 2017) training Training Corpus (fit, learn)

first word \ second word Bigram Counts Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? Example from (Jurafsky, 2017) training Bigram model: Training Corpus (fit, learn) Need to estimate: P(Xi | Xi-1) = count(Xi-1 Xi) / count(Xi-1)

first word: xi-1 \ second word: xi P(Xi | Xi-1) Language Modeling Building a model (or system / API) that can answer the following: How common is this sequence? Language a sequence of natural language Model What is the next word in the sequence? Example from (Jurafsky, 2017) training Bigram model: Training Corpus (fit, learn) Need to estimate: P(Xi | Xi-1) = count(Xi-1 Xi) / count(Xi-1)

Language Modeling CSE354 - Spring 2020 Task Language Modeling - PowerPoint PPT Presentation

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling h o w ? (i.e. auto-complete) Probability Theory Logistic Regression Sequence Modeling Language Modeling -- assigning a

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Language Modeling Michael Collins, Columbia University Overview The language modeling problem

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

NEST Modeling Language: A modeling language for spiking neuron and synapse models for NEST

Topics Why E Field Modeling What is E Field Modeling Case Studies Questions 2 Why

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Verilog HDL:Digital Design and Modeling Chapter 5 Gate-Level Modeling Chapter 5 Gate-Level

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Nonlinear Control Lecture # 10 Time Varying and Perturbed Systems Nonlinear Control Lecture #

About this class Two-Sided Matching (mostly from Roth and Sotomayor) 1 Basic Structure Two

Optimizing Collective Communication on Multicores Rajesh Nishtala 1 Katherine Yelick 1 1

Methodical Approximate Hardware Design and Reuse Amir Yazdanbakhsh

CSCI 3110 Fun with Algorithms Norbert Zeh nzeh@cs.dal.ca Faculty of Computer Science Dalhousie

Bandit Multiclass Linear Classification: Efficient Algorithms for the Separable Case Alina

Stable Matching CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Stable Matching A type of

Spotnik Designing Distributed M a chine Le a rning for Tr a nsient Cloud Resources M a rcel W a