language modeling CS 685, Fall 2020 Introduction to Natural Language - PowerPoint PPT Presentation

language modeling CS 685, Fall 2020 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs685/ Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst some slides from Dan Jurafsky and Richard Socher

questions from last time… • Cheating concerns on exam? • we’re still thinking about ways to mitigate this • Final project group size? • Will be 4 with few exceptions • Please use Piazza to form teams by 9/4 (otherwise we will randomly assign you) • HW0? • Out today, due 9/4. Start early especially if you have a limited coding / math background! 2

Let’s say I want to train a model for sentiment analysis 3

Let’s say I want to train a model for sentiment analysis In the past, I would simply train a supervised model on labeled sentiment examples (i.e., review text / score pairs from IMDB) Sentiment model supervised training Labeled reviews from IMDB 4

Let’s say I want to train a model for sentiment analysis Nowadays, however, we use transfer learning : A huge self- supervised model step 1: unsupervised pretraining A ton of unlabeled text 5

Let’s say I want to train a model for sentiment analysis Nowadays, however, we use transfer learning : A huge self- Sentiment- supervised specialized model model step 1: step 2: unsupervised supervised pretraining fine-tuning A Labeled ton of reviews from unlabeled text IMDB 6

This lecture: language modeling , which forms the core of most self-supervised NLP approaches A huge self- Sentiment- supervised specialized model model step 1: step 2: unsupervised supervised pretraining fine-tuning A Labeled ton of reviews from unlabeled text IMDB 7

Language models assign a probability to a piece of text • why would we ever want to do this? • translation: • P(i flew to the movies) <<<<< P(i went to the movies) • speech recognition: • P(i saw a van) >>>>> P(eyes awe of an)

You use Language Models every day! 9

You use Language Models every day! 10

Probabilistic Language Modeling • Goal: compute the probability of a sentence or sequence of words: P(W) = P(w 1 ,w 2 ,w 3 ,w 4 ,w 5 …w n ) • Related task: probability of an upcoming word: P(w 5 |w 1 ,w 2 ,w 3 ,w 4 ) • A model that computes either of these: P(W) or P(w n |w 1 ,w 2 …w n-1 ) is called a language model or LM 11

How to compute P(W) • How to compute this joint probability: • P(its, water, is, so, transparent, that) • Intuition: let’s rely on the Chain Rule of Probability 12

The Chain Rule applied to compute joint probability of words in sentence P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so) 14

The Chain Rule applied to compute joint In HW0, we refer to probability of words in sentence this as a “prefix” } P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so) 15

How to estimate these probabilities • Could we just count and divide? P (t he | i t s w a t e r i s s o t ra ns pa re nt t ha t ) = Count (i t s w a t e r i s s o t ra ns pa re nt t ha t t he ) Count (i t s w a t e r i s s o t ra ns pa re nt t ha t ) 16

How to estimate these probabilities • Could we just count and divide? P (t he | i t s w a t e r i s s o t ra ns pa re nt t ha t ) = Count (i t s w a t e r i s s o t ra ns pa re nt t ha t t he ) Count (i t s w a t e r i s s o t ra ns pa re nt t ha t ) • No! Too many possible sentences! • We’ll never see enough data for estimating these 17

        Markov Assumption • Simplifying assumption:   Andrei Markov (1856~1922) P (t he | i t s w a t e r i s s o t ra ns pa re nt t ha t ) ≈ P (t he | t ha t ) • Or maybe P (t he | i t s w a t e r i s s o t ra ns pa re nt t ha t ) ≈ P (t he | t ra ns pa re nt t ha t ) 18

Markov Assumption • In other words, we approximate each component in the product 19

Simplest case: Unigram model Some automatically generated sentences from a unigram model: fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass thrift, did, eighty, said, hard, 'm, july, bullish that, or, limited, the How can we generate text from a language model? 20

Approximating'Shakespeare –To him swallowed confess hear both. Which. Of save on trail for are ay device and 1 rote life have gram –Hill he late speaks; or! a more to leg less first you enter –Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live 2 king. Follow. gram –What means, sir. I confess she? then all sorts, he is trim, captain. –Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, 3 ’tis done. gram –This shall forbid it should be branded, if renown made it empty. –King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A 4 great banquet serv’d in; gram –It cannot be but so. Figure 4.3 Eight sentences randomly generated from four N -grams computed from Shakespeare’s works. All 21

N-gram models • We can extend to trigrams, 4-grams, 5-grams • In general this is an insufficient model of language • because language has long-distance dependencies : “The computer which I had just put into the machine room on the fifth floor crashed.” • But we can often get away with N-gram models In the next video, we will look at some models that can theoretically handle some of these longer-term dependencies 22

Estimating bigram probabilities • The Maximum Likelihood Estimate (MLE) - relative frequency based on the empirical counts on a training set P ( w i | w i − 1 ) = c ount ( w i − 1 , w i ) c ount ( w i − 1 ) P ( w i | w i − 1 ) = c ( w i − 1 , w i ) c — count c ( w i − 1 ) 23

An example <s> I am Sam </s> P ( w i | w i − 1 ) = c ( w i − 1 , w i ) MLE <s> Sam I am </s> c ( w i − 1 ) <s> I do not like green eggs and ham </s> ??? ??? 24

An example <s> I am Sam </s> P ( w i | w i − 1 ) = c ( w i − 1 , w i ) MLE <s> Sam I am </s> c ( w i − 1 ) <s> I do not like green eggs and ham </s> 25

Important terminology: a word type is a unique word in our vocabulary, while a token is an occurrence of a An example word type in a dataset. <s> I am Sam </s> P ( w i | w i − 1 ) = c ( w i − 1 , w i ) MLE <s> Sam I am </s> c ( w i − 1 ) <s> I do not like green eggs and ham </s> 26

A bigger example:   Berkeley Restaurant Project sentences • can you tell me about any good cantonese restaurants close by • mid priced thai food is what i’m looking for • tell me about chez panisse • can you give me a listing of the kinds of food that are available • i’m looking for a good place to eat breakfast • when is caffe venezia open during the day 27

note: this is only a subset of Raw bigram counts the (much bigger) bigram count table • Out of 9222 sentences 28

P ( w i | w i − 1 ) = c ( w i − 1 , w i ) MLE Raw bigram probabilities c ( w i − 1 ) • Normalize by unigrams: • Result: 29

Bigram estimates of sentence probabilities P(<s> I want english food </s>) = P(I|<s>) × P(want|I) × P(english|want) × P(food|english) × P(</s>|food) = .000031 these probabilities get super tiny when we have longer inputs w/ more infrequent words… how can we get around this? 30

logs to avoid underflow log ∏ p ( w i | w i − 1 ) = ∑ log p ( w i | w i − 1 ) Example with unigram model on a sentiment dataset: 31

logs to avoid underflow log ∏ p ( w i | w i − 1 ) = ∑ log p ( w i | w i − 1 ) Example with unigram model on a sentiment dataset: p ( i ) ⋅ p ( love ) 5 ⋅ p ( the ) ⋅ p ( movie ) = 5.95374181e-7 log p ( i ) + 5 log p ( love ) + log p ( the ) + log p ( movie ) = -14.3340757538 31

Language Modeling Toolkits • SRILM • http://www.speech.sri.com/projects/ srilm/ • KenLM • https://kheafield.com/code/kenlm/ 33

Evaluation: How good is our model? • Does our language model prefer good sentences to bad ones? • Assign higher probability to “real” or “frequently observed” sentences • Than “ungrammatical” or “rarely observed” sentences? • We train parameters of our model on a training set . • We test the model’s performance on data we haven’t seen. • A test set is an unseen dataset that is different from our training set, totally unused. • An evaluation metric tells us how well our model does on the test set. 34

language modeling CS 685, Fall 2020 Introduction to Natural Language - PowerPoint PPT Presentation

language modeling CS 685, Fall 2020 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs685/ Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst some slides from Dan Jurafsky

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Language Modeling Michael Collins, Columbia University Overview The language modeling problem

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

NEST Modeling Language: A modeling language for spiking neuron and synapse models for NEST

Topics Why E Field Modeling What is E Field Modeling Case Studies Questions 2 Why

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Verilog HDL:Digital Design and Modeling Chapter 5 Gate-Level Modeling Chapter 5 Gate-Level

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

TRANSFORMING AGRICULTURE WITH INTELLIGENT INFRASTRUCTURE SHASHI SHEKHAR AAAS Fellow McKnight

ICELAND Slkeradreifing ehf Established in 1906 Introduction n Company established in 1906

Educating Text Autoencoders: Latent Representation Guidance via Denoising Tianxiao Shen Jonas

Purdue BGSA Black Graduate Student Association What Can BGSA Do for Y ou? BGSA Mission I.

Super-AI : controversy, conflict and cooperation(?) 1..Computing and Artificial Intelligence

ETHICS OVERVIEW October 17 & 24, 2017 This presentation is posted at:

By Nigussie Abadi, Ataklti Techane and Girmay Tesfay Mekelle University, Mekelle , Ethiopia

WELCOME! Building a Food Systems Network for Niagara April 20, 2017 People working in

language modeling CS 685, Fall 2020 Introduction to Natural Language - PowerPoint PPT Presentation

language modeling CS 685, Fall 2020 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs685/ Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst some slides from Dan Jurafsky

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Language Modeling Michael Collins, Columbia University Overview The language modeling problem

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

NEST Modeling Language: A modeling language for spiking neuron and synapse models for NEST

Topics Why E Field Modeling What is E Field Modeling Case Studies Questions 2 Why

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Verilog HDL:Digital Design and Modeling Chapter 5 Gate-Level Modeling Chapter 5 Gate-Level

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

TRANSFORMING AGRICULTURE WITH INTELLIGENT INFRASTRUCTURE SHASHI SHEKHAR AAAS Fellow McKnight

ICELAND Slkeradreifing ehf Established in 1906 Introduction n Company established in 1906

Educating Text Autoencoders: Latent Representation Guidance via Denoising Tianxiao Shen Jonas

Purdue BGSA Black Graduate Student Association What Can BGSA Do for Y ou? BGSA Mission I.

Super-AI : controversy, conflict and cooperation(?) 1..Computing and Artificial Intelligence

ETHICS OVERVIEW October 17 &amp; 24, 2017 This presentation is posted at:

By Nigussie Abadi, Ataklti Techane and Girmay Tesfay Mekelle University, Mekelle , Ethiopia

WELCOME! Building a Food Systems Network for Niagara April 20, 2017 People working in

ETHICS OVERVIEW October 17 & 24, 2017 This presentation is posted at: