Natural Language Processing Info 159/259 Lecture 6: Language models - PowerPoint PPT Presentation

Natural Language Processing Info 159/259   Lecture 6: Language models 1 (Sept 12, 2017) David Bamman, UC Berkeley

Language Model • Vocabulary 𝒲 is a finite set of discrete symbols (e.g., words, characters); V = | 𝒲 | • 𝒲 + is the infinite set of sequences of symbols from 𝒲 ; each sequence ends with STOP • x ∈ 𝒲 +

Language Model P ( w ) = P ( w 1 , . . . , w n ) P(“Call me Ishmael”) =   P(w 1 = “call”, w 2 = “me”, w 3 = “Ishmael”) x P(STOP) � P ( w ) = 1 0 ≤ P ( w ) ≤ 1 w ∈ V + over all sequence lengths!

Language Model • Language models provide us with a way to quantify the likelihood of sequence — i.e., plausible sentences.

OCR • to fee great Pompey paffe the Areets of Rome: • to see great Pompey passe the streets of Rome:

Machine translation • Fidelity (to source text) • Fluency (of the translation)

Speech Recognition • 'Scuse me while I kiss the sky. • 'Scuse me while I kiss this guy • 'Scuse me while I kiss this fly. • 'Scuse me while my biscuits fry

Dialogue generation Li et al. (2016), "Deep Reinforcement Learning for Dialogue Generation" (EMNLP)

Information theoretic view Y “One morning I shot an elephant in my pajamas” encode(Y) decode(encode(Y)) Shannon 1948

Noisy Channel X Y ASR speech signal transcription MT target text source text OCR pixel densities transcription P ( Y | X ) ∝ P ( X | Y ) P ( Y ) � �� channel model source model

Language Model • Language modeling is the task of estimating P(w) • Why is this hard? P(“It was the best of times, it was the worst of times”)

Chain rule (of probability) P ( x 1 , x 2 , x 3 , x 4 , x 5 ) = P ( x 1 ) × P ( x 2 | x 1 ) × P ( x 3 | x 1 , x 2 ) × P ( x 4 | x 1 , x 2 , x 3 ) × P ( x 5 | x 1 , x 2 , x 3 , x 4 )

Chain rule (of probability) P(“It was the best of times, it was the worst of times”)

Markov assumption P ( x i | x 1 , . . . x i − 1 ) ≈ P ( x i | x i − 1 ) first-order P ( x i | x 1 , . . . x i − 1 ) ≈ P ( x i | x i − 2 , x i − 1 ) second-order

Markov assumption n � P ( w i | w i − 1 ) × P (STOP | w n ) bigram model (first-order markov) i n � P ( w i | w i − 2 , w i − 1 ) trigram model (second-order markov) i × P (STOP | w n − 1 , w n )

Estimation unigram bigram trigram n n n � � � P ( w i ) P ( w i | w i − 1 ) P ( w i | w i − 2 , w i − 1 ) i i i × P ( STOP ) × P ( STOP | w n ) × P ( STOP | w n − 1 , w n ) Maximum likelihood estimate c ( w i − 1 , w i ) c ( w i − 2 , w i − 1 , w i ) c ( w i ) c ( w i − 1 ) c ( w i − 2 , w i − 1 ) N

Generating 0.06 0.04 0.02 0.00 a amazing bad best good like love movie not of sword the worst • What we learn in estimating language models is P(word | context), where context — at least here — is the previous n-1 words (for ngram of order n) • We have one multinomial over the vocabulary (including STOP) for each context

Generating generated   context1 context2 word • As we sample, START START The the words we generate form START The dog the new context we condition on The dog walked dog walked in

Aside: sampling?

Sampling from a Multinomial Probability 0.6 mass function (PMF) 0.5 0.4 P(z = x) P(z = x) 0.3 exactly 0.2 0.1 0.0 1 2 3 4 5 x

Sampling from a Multinomial Cumulative 1.0 density 0.8 function (CDF) 0.6 P(z <= x) P(z ≤ x) 0.4 0.2 0.0 1 2 3 4 5 x

Sampling from a Multinomial 1.0 Sample p uniformly in p=.78 0.8 [0,1] 0.6 P(z <= x) Find the point CDF -1 (p) 0.4 0.2 0.0 1 2 3 4 5 x

Sampling from a Multinomial 1.0 Sample p uniformly in 0.8 [0,1] 0.6 P(z <= x) Find the point CDF -1 (p) 0.4 0.2 p=.06 0.0 1 2 3 4 5 x

Sampling from a Multinomial ≤ 1.000 1.0 Sample p uniformly in 0.8 ≤ 0.703 [0,1] 0.6 P(z <= x) Find the point CDF -1 (p) 0.4 0.2 ≤ 0.071 ≤ 0.059 ≤ 0.008 0.0 1 2 3 4 5 x

Unigram model • the around, she They I blue talking “Don’t to and little come of • on fallen used there. young people to Lázaro • of the • the of of never that ordered don't avoided to complaining. • words do had men flung killed gift the one of but thing seen I plate Bradley was by small Kingmaker.

Bigram Model • “What the way to feel where we’re all those ancients called me one of the Council member, and smelled Tales of like a Korps peaks.” • Tuna battle which sold or a monocle, I planned to help and distinctly. • “I lay in the canoe ” • She started to be able to the blundering collapsed. • “Fine.”

Trigram Model • “I’ll worry about it.” • Avenue Great-Grandfather Edgeworth hasn’t gotten there. • “If you know what. It was a photograph of seventeenth-century flourishin’ To their right hands to the fish who would not care at all. Looking at the clock, ticking away like electronic warnings about wonderfully SAT ON FIFTH • Democratic Convention in rags soaked and my past life, I managed to wring your neck a boss won’t so David Pritchet giggled. • He humped an argument but her bare He stood next to Larry, these days it will have no trouble Jay Grayer continued to peer around the Germans weren’t going to faint in the

4gram Model • Our visitor in an idiot sister shall be blotted out in bars and flirting with curly black hair right marble, wallpapered on screen credit.” • You are much instant coffee ranges of hills. • Madison might be stored here and tell everyone about was tight in her pained face was an old enemy, trading-posts of the outdoors watching Anyog extended On my lips moved feebly. • said. • “I’m in my mind, threw dirt in an inch,’ the Director.

Evaluation • The best evaluation metrics are external — how does a better language model influence the application you care about? • Speech recognition (word error rate), machine translation (BLEU score), topic models (sensemaking)

Evaluation • A good language model should judge unseen real language to have high probability • Perplexity = inverse probability of test data, averaged by word. • To be reliable, the test data must be truly unseen (including knowledge of its vocabulary). � 1 N perplexity = P ( w 1 , . . . , w n )

Experiment design training development testing size 80% 10% 10% evaluation; model selection; never look at it purpose training models hyperparameter until the very tuning end

Evaluation N � log P ( w 1 , . . . , w n ) = log P ( w i ) i N 1 � log P ( w i ) N i � � N − 1 � exp log P ( w i ) perplexity = N i

Perplexity � � N − 1 � trigram model exp log P ( w i | w i − 2 , w i − 1 ) (second-order markov) N i

Perplexity Model Unigram Bigram Trigram Perplexity 962 170 109 SLP3 4.3

Smoothing • When estimating a language model, we’re relying on the data we’ve observed in a training corpus. • Training data is a small (and biased) sample of the creativity of language.

Data sparsity SLP3 4.1

n � P ( w i | w i − 1 ) × P (STOP | w n ) i • As in Naive Bayes, P(w i ) = 0 causes P(w) = 0. (Perplexity?)

Smoothing in NB • One solution: add a little probability mass to every element. smoothed estimates maximum likelihood estimate P ( x i | y ) = n i , y + α n y + Vα P ( x i | y ) = n i , y n y same α for all x i n i , y + α i P ( x i | y ) = n i,y = count of word i in class y n y + � V j = 1 α j n y = number of words in y V = size of vocabulary possibly different α for each x i

Additive smoothing P ( w i ) = c ( w i ) + α Laplace smoothing:   α = 1 N + V α P ( w i | w i − 1 ) = c ( w i − 1 , w i ) + α c ( w i − 1 ) + V α

Smoothing 0.6 0.5 0.4 MLE 0.3 0.2 0.1 0.0 1 2 3 4 5 6 Smoothing is the re-allocation of probability mass 0.6 0.5 0.4 smoothing with α =1 0.3 0.2 0.1 0.0 1 2 3 4 5 6

Smoothing • How can best re-allocate probability mass? Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Center for Research in Computing Technology, Harvard University, 1998.

Interpolation • As ngram order rises, we have the potential for higher precision but also higher variability in our estimates. • A linear interpolation of any two language models p and q (with λ ∈ [0,1]) is also a valid language model. λ p + (1 − λ ) q p = the web q = political speeches

Interpolation • We can use this fact to make higher-order language models more robust. P ( w i | w i − 2 , w i − 1 ) = λ 1 P ( w i | w i − 2 , w i − 1 ) + λ 2 P ( w i | w i − 1 ) + λ 3 P ( w i ) λ 1 + λ 2 + λ 3 = 1

Natural Language Processing Info 159/259 Lecture 6: Language models - PowerPoint PPT Presentation

Natural Language Processing Info 159/259 Lecture 6: Language models 1 (Sept 12, 2017) David Bamman, UC Berkeley Language Model Vocabulary is a finite set of discrete symbols (e.g., words, characters); V = | | + is the

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Unit 9: Static & Dynamic Scheduling Slides originally developed by Drew

Automating Operations with Machine Learning Matt Callanan Senior Software Development Engineer

Logical Foundations of Cyber-Physical Systems Andr Platzer Andr Platzer (CMU) LFCPS/08:

Tabled higher-order logic programming Brigitte Pientka Department of Computer Science Carnegie

The Software Life Cycle Elaboration Production Software Engineering Deployment Modelling

The role of freely available and open-source software in our daily space operations Sacha Tholl

Computational Semantics: Lambda Calculus Semantic Analysis Problems One Solution: -Calculus

Detection of Software Vulnerabilities: Static Analysis (Part II) Lucas Cordeiro Department of

Natural Language Processing Info 159/259 Lecture 6: Language models - PowerPoint PPT Presentation

Natural Language Processing Info 159/259 Lecture 6: Language models 1 (Sept 12, 2017) David Bamman, UC Berkeley Language Model Vocabulary is a finite set of discrete symbols (e.g., words, characters); V = | | + is the

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Unit 9: Static &amp; Dynamic Scheduling Slides originally developed by Drew

Automating Operations with Machine Learning Matt Callanan Senior Software Development Engineer

Logical Foundations of Cyber-Physical Systems Andr Platzer Andr Platzer (CMU) LFCPS/08:

Tabled higher-order logic programming Brigitte Pientka Department of Computer Science Carnegie

The Software Life Cycle Elaboration Production Software Engineering Deployment Modelling

The role of freely available and open-source software in our daily space operations Sacha Tholl

Computational Semantics: Lambda Calculus Semantic Analysis Problems One Solution: -Calculus

Detection of Software Vulnerabilities: Static Analysis (Part II) Lucas Cordeiro Department of

Unit 9: Static & Dynamic Scheduling Slides originally developed by Drew