Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C - PowerPoint PPT Presentation

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu With slides from Graham Neubig and Philipp Koehn

Roadmap • Modeling Sequences – First example: language model – What are n-gram models? – How to estimate them? – How to evaluate them? – Neural language models

Probabilistic Language Modeling • Goal: compute the probability of a sentence or sequence of words P(W) = P(w 1 ,w 2 ,w 3 ,w 4 ,w 5 … w n ) • Related task: probability of an upcoming word P(w 5 |w 1 ,w 2 ,w 3 ,w 4 ) • A model that computes either of these: P(W) or P(w n |w 1 ,w 2 …w n-1 ) is called a language model .

Evaluation: How good is our model? • Does our language model prefer good sentences to bad ones? – Assign higher probability to “real” or “frequently observed” sentences • Than “ungrammatical” or “rarely observed” sentences? • Extrinsic vs intrinsic evaluation

Intrinsic evaluation: intuition • The Shannon Game: – How well can we predict the next word? mushrooms 0.1 pepperoni 0.1 anchovies 0.01 I always order pizza with cheese and ____ …. The 33 rd President of the US was ____ fried rice 0.0001 I saw a ____ …. and 1e-100 – Unigrams are terrible at this game. (Why?) • A better model of a text assigns a higher probability to the word that actually occurs

Intrinsic evaluation metric: perplexity The best language model is one that best predicts an unseen test set • Gives the highest P(sentence) - 1 = PP ( W ) P ( w 1 w 2 ... w N ) N Perplexity is the inverse probability of the test set, normalized by the number of 1 words: = N P ( w 1 w 2 ... w N ) Chain rule: For bigrams: Minimizing perplexity is the same as maximizing probability

Perplexity as branching factor • Let’s suppose a sentence consisting of N random digits • What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?

Lower perplexity = better model • Training 38 million words, test 1.5 million words, WSJ N-gram Unigram Bigram Trigram Order Perplexity 962 170 109

Pros and cons of n-gram models • N-gram models – Really easy to build, can train on billions and billions of words – Smoothing helps generalize to new data – Only work well for word prediction if the test corpus looks like the training corpus – Only capture short distance context “Smarter” LMs can address some of these issues, but they are order of magnitudes slower…

Roadmap • Modeling Sequences – First example: language model – What are n-gram models? – How to estimate them? – How to evaluate them? – Neural language models

Aside NE NEURAL AL NE NETWO WORKS KS

Recall the person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

Formalizing binary prediction

The Perceptron: a “machine” to calculate a weighted sum φ “A” = 1 0 φ “site” = 1 -3 φ “located” = 1 0 φ “Maizuru” = 1 0 𝐽 -1 0 sign 𝑥 𝑗 ⋅ ϕ 𝑗 𝑦 φ “,” = 2 𝑗=1 0 φ “in” = 1 0 φ “Kyoto” = 1 2 0 φ “priest” = 0 φ “black” = 0

The Perceptron: Geometric interpretation O X O X O X

Limitation of perceptron ● can only find linear separations between positive and negative examples X O O X

Neural Networks ● Connect together multiple perceptrons φ “A” = 1 φ “site” = 1 φ “located” = 1 φ “Maizuru” = 1 -1 φ “,” = 2 φ “in” = 1 φ “Kyoto” = 1 φ “priest” = 0 φ “black” = 0 ● Motivation: Can represent non-linear functions!

Neural Networks: key terms • Input (aka features) • Output • Nodes • Layers φ “A” = 1 • Activation function φ “site” = 1 (non-linear) φ “located” = 1 φ “Maizuru” = 1 -1 φ “,” = 2 φ “in” = 1 φ “Kyoto” = 1 • Multi-layer φ “priest” = 0 perceptron φ “black” = 0

Example ● Create two classifiers w 0,0 φ 0 [0] 1 φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} 1 φ 0 [1] φ 1 [0] sign φ 0 [1] X O -1 1 b 0,0 φ 0 [0] w 0,1 O X φ 0 [0] -1 φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} -1 φ 0 [1] φ 1 [1] sign -1 1 b 0,1

Example ● These classifiers map to a new space φ 1 (x 3 ) = {-1, 1} φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 1 [1] φ 2 O X O φ 1 φ 1 [0] X O X O φ 1 (x 1 ) = {-1, -1} φ 1 (x 2 ) = {1, -1} φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} φ 1 (x 4 ) = {-1, -1} 1 φ 1 [0] 1 -1 -1 φ 1 [1] -1 -1

Example ● In new space, the examples are linearly separable! φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 0 [1] X O φ 0 [0] 1 φ 2 [0] = y 1 1 O X φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} φ 1 [1] 1 φ 1 (x 3 ) = {-1, 1} O φ 1 [0] 1 -1 φ 1 [0] -1 φ 1 (x 1 ) = {-1, -1} X φ 1 [1] -1 O φ 1 (x 2 ) = {1, -1} -1 φ 1 (x 4 ) = {-1, -1}

Example wrap-up: Forward propagation ● The final net φ 0 [0] 1 1 φ 0 [1] φ 1 [0] tanh 1 -1 1 φ 0 [0] φ 2 [0] -1 tanh 1 -1 φ 0 [1] φ 1 [1] tanh -1 1 1 1

Softmax Function for multiclass classification ● Sigmoid function for multiple classes 𝑓 𝐱⋅ϕ 𝑦,𝑧 Current class 𝑄 𝑧 ∣ 𝑦 = 𝑧 𝑓 𝐱⋅ϕ 𝑦, 𝑧 Sum of other classes ● Can be expressed using matrix/vector ops 𝐬 = exp 𝐗 ⋅ ϕ 𝑦 𝐪 = 𝐬 𝑠 𝑠∈𝐬 24

Stochastic Gradient Descent Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw In other words For every training example, calculate the gradient • (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α •

Gradient of the Sigmoid Function Take the derivative of the probability 0.4 dp(y|x)/dw*phi(x) 𝑓 𝐱⋅ϕ 𝑦 𝑒 𝑒 0.3 𝑒𝑥 𝑄 𝑧 = 1 ∣ 𝑦 = 1 + 𝑓 𝐱⋅ϕ 𝑦 𝑒𝑥 0.2 𝑓 𝐱⋅ϕ 𝑦 0.1 = ϕ 𝑦 1 + 𝑓 𝐱⋅ϕ 𝑦 2 0 -10 0 10 w*phi(x) 𝑓 𝐱⋅ϕ 𝑦 𝑒 𝑒 𝑒𝑥 𝑄 𝑧 = −1 ∣ 𝑦 = 1 − 1 + 𝑓 𝐱⋅ϕ 𝑦 𝑒𝑥 𝑓 𝐱⋅ϕ 𝑦 = −ϕ 𝑦 1 + 𝑓 𝐱⋅ϕ 𝑦 2

Learning: We Don't Know the Derivative for Hidden Units! For NNs, only know correct tag for last layer 𝐢 𝑦 w 1 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? 𝑒𝐱 𝟐 𝑓 𝐱 𝟓 ⋅𝐢 𝑦 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = 𝐢 𝑦 1 + 𝑓 𝐱 𝟓 ⋅𝐢 𝑦 2 𝑒𝐱 𝟓 w 2 w 4 ϕ 𝑦 y=1 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? w 3 𝑒𝐱 𝟑 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? 𝑒𝐱 𝟒

Answer: Back-Propagation Calculate derivative with chain rule 𝑒𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱 𝟓 𝐢 𝐲 𝑒ℎ 1 𝐲 𝑒𝐱 𝟐 𝑒𝐱 𝟓 𝐢 𝐲 𝑒ℎ 1 𝐲 𝑒𝐱 𝟐 𝑓 𝐱 𝟓 ⋅𝐢 𝑦 𝑥 1,4 1 + 𝑓 𝐱 𝟓 ⋅𝐢 𝑦 2 Error of Weight Gradient of next unit (δ 4 ) this unit 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = 𝑒ℎ 𝑗 𝐲 In General δ 𝑘 𝑥 𝑗,𝑘 𝐱 𝐣 𝑒𝐱 𝐣 Calculate i based 𝑘 on next units j :

Backpropagation = Gradient descent + Chain rule

Feed Forward Neural Nets All connections point forward ϕ 𝑦 y It is a directed acyclic graph (DAG)

Neural Networks • Non-linear classification • Prediction: forward propagation – Vector/matrix operations + non-linearities • Training: backpropagation + stochastic gradient descent For more details, see Cho chap 3 or CIML Chap 7

Aside NE NEURAL AL NE NETWO WORKS KS

Back to language modeling…

Representing words • “one hot vector” dog = [ 0, 0, 0, 0, 1, 0, 0, 0 …] cat = [ 0, 0, 0, 0, 0, 0, 1, 0 …] eat = [ 0, 1, 0, 0, 0, 0, 1, 0 …] • That’s a large vector! practical solutions: – limit to most frequent words (e.g., top 20000) – cluster words into classes • WordNet classes, frequency binning, etc.

Feed-Forward Neural Language Model Map each word into a lower-dimensional real-valued space using shared weight matrix C Embedding layer Bengio et al. 2003

Word Embeddings • Neural language models produce word embeddings as a by product • Words that occurs in similar contexts tend to have similar embeddings • Embeddings are useful features in many NLP tasks [Turian et al. 2009]

Word embeddings illustrated

Recurrent Neural Networks

Recurrent Neural Nets (RNN) Part of the node outputs return as input 𝐢 𝐮−𝟐 ϕ 𝐮 𝑦 y Why? It is possible to “memorize”

Training: backpropagation through time After processing a few training examples, Update through the unfolded recurrent neural network

Recurrent neural language models • Hidden layer plays double duty – Memory of the network – Continuous space representation to predict output words • Other more elaborate architectures – Long Short Term Memory – Gated Recurrent Units

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C - PowerPoint PPT Presentation

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu With slides from Graham Neubig and Philipp Koehn Roadmap Modeling Sequences First example: language model What are n-gram models? How to

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Models of Language Evolution models thereof its evolution language Models of Language Evolution

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Language Models: Evaluation & Neural Models CMSC 470 Marine Carpuat Slides credit: Jurasky

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Dependency Parsing Guest lecture in Computational Linguistics course Barbara Plank

Seth Cable Introduction to Linguistic Theory Spring 2018 Linguistics 201 Some Notes and Practice

Competency questions for ontologies COMP62342 Sean Bechhofer (with thanks to Robert Stevens)

Dua ual P Pha hase se l light ght simul ulati tion i in n LArSoft ArSoft Simulation of

Language Models (2) CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin Roadmap

P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, we

Natural Language Processing (CSE 490U): Compositional Semantics Noah Smith 2017 c

El Nio / Southern Oscillation (ENSO) and inter-annual climate variability seasonal cycle what

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C - PowerPoint PPT Presentation

Neural Language Models CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT marine@cs.umd.edu With slides from Graham Neubig and Philipp Koehn Roadmap Modeling Sequences First example: language model What are n-gram models? How to

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Models of Language Evolution models thereof its evolution language Models of Language Evolution

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Language Modeling CS 6956: Deep Learning for NLP Overview What is a language model? How

Language Models: Evaluation &amp; Neural Models CMSC 470 Marine Carpuat Slides credit: Jurasky

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Dependency Parsing Guest lecture in Computational Linguistics course Barbara Plank

Seth Cable Introduction to Linguistic Theory Spring 2018 Linguistics 201 Some Notes and Practice

Competency questions for ontologies COMP62342 Sean Bechhofer (with thanks to Robert Stevens)

Dua ual P Pha hase se l light ght simul ulati tion i in n LArSoft ArSoft Simulation of

Language Models (2) CMSC 470 Marine Carpuat Slides credit: Jurasky &amp; Martin Roadmap

P ( w 1 w 2 w n ) P ( w i | w i k w i 1 ) i In other words, we

Natural Language Processing (CSE 490U): Compositional Semantics Noah Smith 2017 c

El Nio / Southern Oscillation (ENSO) and inter-annual climate variability seasonal cycle what

Language Models: Evaluation & Neural Models CMSC 470 Marine Carpuat Slides credit: Jurasky

Language Models (2) CMSC 470 Marine Carpuat Slides credit: Jurasky & Martin Roadmap