Language Modeling Michael Collins, Columbia University Overview - PowerPoint PPT Presentation

Language Modeling Michael Collins, Columbia University

Overview ◮ The language modeling problem ◮ Trigram models ◮ Evaluating language models: perplexity ◮ Estimation techniques: ◮ Linear interpolation ◮ Discounting methods

The Language Modeling Problem ◮ We have some (finite) vocabulary, say V = { the, a, man, telescope, Beckham, two , . . . } ◮ We have an (infinite) set of strings, V † the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP

The Language Modeling Problem (Continued) ◮ We have a training sample of example sentences in English

The Language Modeling Problem (Continued) ◮ We have a training sample of example sentences in English ◮ We need to “learn” a probability distribution p i.e., p is a function that satisfies � p ( x ) ≥ 0 for all x ∈ V † p ( x ) = 1 , x ∈V †

The Language Modeling Problem (Continued) ◮ We have a training sample of example sentences in English ◮ We need to “learn” a probability distribution p i.e., p is a function that satisfies � p ( x ) ≥ 0 for all x ∈ V † p ( x ) = 1 , x ∈V † p ( the STOP ) = 10 − 12 p ( the fan STOP ) = 10 − 8 p ( the fan saw Beckham STOP ) = 2 × 10 − 8 p ( the fan saw saw STOP ) = 10 − 15 . . . p ( the fan saw Beckham play for Real Madrid STOP ) = 2 × 10 − 9 . . .

Why on earth would we want to do this?! ◮ Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.)

Why on earth would we want to do this?! ◮ Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.) ◮ The estimation techniques developed for this problem will be VERY useful for other problems in NLP

A Naive Method ◮ We have N training sentences ◮ For any sentence x 1 . . . x n , c ( x 1 . . . x n ) is the number of times the sentence is seen in our training data ◮ A naive estimate: p ( x 1 . . . x n ) = c ( x 1 . . . x n ) N

Markov Processes ◮ Consider a sequence of random variables X 1 , X 2 , . . . X n . Each random variable can take any value in a finite set V . For now we assume the length n is fixed (e.g., n = 100 ). ◮ Our goal: model P ( X 1 = x 1 , X 2 = x 2 , . . . , X n = x n )

First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n )

First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X 1 = x 1 ) P ( X i = x i | X 1 = x 1 , . . . , X i − 1 = x i − 1 ) i =2

First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X 1 = x 1 ) P ( X i = x i | X 1 = x 1 , . . . , X i − 1 = x i − 1 ) i =2 n � = P ( X 1 = x 1 ) P ( X i = x i | X i − 1 = x i − 1 ) i =2

First-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X 1 = x 1 ) P ( X i = x i | X 1 = x 1 , . . . , X i − 1 = x i − 1 ) i =2 n � = P ( X 1 = x 1 ) P ( X i = x i | X i − 1 = x i − 1 ) i =2 The first-order Markov assumption: For any i ∈ { 2 . . . n } , for any x 1 . . . x i , P ( X i = x i | X 1 = x 1 . . . X i − 1 = x i − 1 ) = P ( X i = x i | X i − 1 = x i − 1 )

Second-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n )

Second-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) = P ( X 1 = x 1 ) × P ( X 2 = x 2 | X 1 = x 1 ) n � × P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =3

Second-Order Markov Processes P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) = P ( X 1 = x 1 ) × P ( X 2 = x 2 | X 1 = x 1 ) n � × P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =3 n � = P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =1 (For convenience we assume x 0 = x − 1 = *, where * is a special “start” symbol.)

Modeling Variable Length Sequences ◮ We would like the length of the sequence, n , to also be a random variable ◮ A simple solution: always define X n = STOP where STOP is a special symbol

Modeling Variable Length Sequences ◮ We would like the length of the sequence, n , to also be a random variable ◮ A simple solution: always define X n = STOP where STOP is a special symbol ◮ Then use a Markov process as before: P ( X 1 = x 1 , X 2 = x 2 , . . . X n = x n ) n � = P ( X i = x i | X i − 2 = x i − 2 , X i − 1 = x i − 1 ) i =1 (For convenience we assume x 0 = x − 1 = *, where * is a special “start” symbol.)

Trigram Language Models ◮ A trigram language model consists of: 1. A finite set V 2. A parameter q ( w | u, v ) for each trigram u, v, w such that w ∈ V ∪ { STOP } , and u, v ∈ V ∪ { * } .

Trigram Language Models ◮ A trigram language model consists of: 1. A finite set V 2. A parameter q ( w | u, v ) for each trigram u, v, w such that w ∈ V ∪ { STOP } , and u, v ∈ V ∪ { * } . ◮ For any sentence x 1 . . . x n where x i ∈ V for i = 1 . . . ( n − 1) , and x n = STOP, the probability of the sentence under the trigram language model is n � p ( x 1 . . . x n ) = q ( x i | x i − 2 , x i − 1 ) i =1 where we define x 0 = x − 1 = *.

An Example For the sentence the dog barks STOP we would have p ( the dog barks STOP ) = q ( the | *, * ) × q ( dog | *, the ) × q ( barks | the, dog ) × q ( STOP | dog, barks )

The Trigram Estimation Problem Remaining estimation problem: q ( w i | w i − 2 , w i − 1 ) For example: q ( laughs | the, dog )

The Trigram Estimation Problem Remaining estimation problem: q ( w i | w i − 2 , w i − 1 ) For example: q ( laughs | the, dog ) A natural estimate (the “maximum likelihood estimate”): q ( w i | w i − 2 , w i − 1 ) = Count ( w i − 2 , w i − 1 , w i ) Count ( w i − 2 , w i − 1 ) q ( laughs | the, dog ) = Count ( the, dog, laughs ) Count ( the, dog )

Sparse Data Problems A natural estimate (the “maximum likelihood estimate”): q ( w i | w i − 2 , w i − 1 ) = Count ( w i − 2 , w i − 1 , w i ) Count ( w i − 2 , w i − 1 ) q ( laughs | the, dog ) = Count ( the, dog, laughs ) Count ( the, dog ) Say our vocabulary size is N = |V| , then there are N 3 parameters in the model. 20 , 000 3 = 8 × 10 12 parameters e.g., N = 20 , 000 ⇒

Evaluating a Language Model: Perplexity ◮ We have some test data, m sentences s 1 , s 2 , s 3 , . . . , s m

Evaluating a Language Model: Perplexity ◮ We have some test data, m sentences s 1 , s 2 , s 3 , . . . , s m ◮ We could look at the probability under our model � m i =1 p ( s i ) . Or more conveniently, the log probability m m � � log p ( s i ) = log p ( s i ) i =1 i =1

Evaluating a Language Model: Perplexity ◮ We have some test data, m sentences s 1 , s 2 , s 3 , . . . , s m ◮ We could look at the probability under our model � m i =1 p ( s i ) . Or more conveniently, the log probability m m � � log p ( s i ) = log p ( s i ) i =1 i =1 ◮ In fact the usual evaluation measure is perplexity m l = 1 Perplexity = 2 − l � where log p ( s i ) M i =1 and M is the total number of words in the test data.

Some Intuition about Perplexity ◮ Say we have a vocabulary V , and N = |V| + 1 and model that predicts q ( w | u, v ) = 1 N for all w ∈ V ∪ { STOP } , for all u, v ∈ V ∪ { * } . ◮ Easy to calculate the perplexity in this case: l = log 1 Perplexity = 2 − l where N ⇒ Perplexity = N Perplexity is a measure of effective “branching factor”

Typical Values of Perplexity ◮ Results from Goodman (“A bit of progress in language modeling”), where |V| = 50 , 000 ◮ A trigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 2 , x i − 1 ) . Perplexity = 74

Typical Values of Perplexity ◮ Results from Goodman (“A bit of progress in language modeling”), where |V| = 50 , 000 ◮ A trigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 2 , x i − 1 ) . Perplexity = 74 ◮ A bigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 1 ) . Perplexity = 137

Typical Values of Perplexity ◮ Results from Goodman (“A bit of progress in language modeling”), where |V| = 50 , 000 ◮ A trigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 2 , x i − 1 ) . Perplexity = 74 ◮ A bigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i | x i − 1 ) . Perplexity = 137 ◮ A unigram model: p ( x 1 . . . x n ) = � n i =1 q ( x i ) . Perplexity = 955

Some History ◮ Shannon conducted experiments on entropy of English i.e., how good are people at the perplexity game? C. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:50–64, 1951.

Language Modeling Michael Collins, Columbia University Overview - PowerPoint PPT Presentation

Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting methods The

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

NEST Modeling Language: A modeling language for spiking neuron and synapse models for NEST

Topics Why E Field Modeling What is E Field Modeling Case Studies Questions 2 Why

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Verilog HDL:Digital Design and Modeling Chapter 5 Gate-Level Modeling Chapter 5 Gate-Level

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Models of Language Evolution models thereof its evolution language Models of Language Evolution

K + Analyses Gary Cheng Columbia University SciBooNE May 2010 Collaboration Meeting 1

Factor Analysis and Beyond Chris Williams School of Informatics, University of Edinburgh October

MATH 12002 - CALCULUS I 3.3: Graphing Example Professor Donald L. White Department of

Introduction to R Week 2: Making gures Louisa Smith July 20 - July 24 Let's make our data...

Lionel Riou Fransca Univariate & bivariate Two kind of analysis Univariate

SEVEN WAYS CHURCHES ARE RECLAIMING ATTENDANCE AS THEY REGATHER SO WHAT IS ? Presented by SO

Social Media Analysis so far Fabio Giglietto (fabio.giglietto@uniurb.it) Dealing with platforms

INF580 Large-scale Mathematical Programming TD4 Distance Geometry, Part I Leo Liberti

Language Modeling Michael Collins, Columbia University Overview - PowerPoint PPT Presentation

Language Modeling Michael Collins, Columbia University Overview The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques: Linear interpolation Discounting methods The

Language Modeling CSE354 - Spring 2020 Task Language Modeling Probabilistic Modeling

Language Modeling CSE392 - Spring 2019 Special Topic in CS Task Probabilistic Modeling

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Count-based Language Modeling CMSC 473/673 UMBC Some slides adapted from 3SLP, Jason Eisner

NEST Modeling Language: A modeling language for spiking neuron and synapse models for NEST

Topics Why E Field Modeling What is E Field Modeling Case Studies Questions 2 Why

Outline 1 The topic 2 Decision support systems 3 Modeling 3.3 Advanced modeling

Verilog HDL:Digital Design and Modeling Chapter 5 Gate-Level Modeling Chapter 5 Gate-Level

Developmental Developmental Disorders affecting Disorders affecting language language

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

CS11-737: Multilingual Natural Language Processing Language contact Yulia Tsvetkov Language

Models of Language Evolution models thereof its evolution language Models of Language Evolution

K + Analyses Gary Cheng Columbia University SciBooNE May 2010 Collaboration Meeting 1

Factor Analysis and Beyond Chris Williams School of Informatics, University of Edinburgh October

MATH 12002 - CALCULUS I 3.3: Graphing Example Professor Donald L. White Department of

Introduction to R Week 2: Making gures Louisa Smith July 20 - July 24 Let's make our data...

Lionel Riou Fransca Univariate &amp; bivariate Two kind of analysis Univariate

SEVEN WAYS CHURCHES ARE RECLAIMING ATTENDANCE AS THEY REGATHER SO WHAT IS ? Presented by SO

Social Media Analysis so far Fabio Giglietto (fabio.giglietto@uniurb.it) Dealing with platforms

INF580 Large-scale Mathematical Programming TD4 Distance Geometry, Part I Leo Liberti

Lionel Riou Fransca Univariate & bivariate Two kind of analysis Univariate