Conditional Neural Language Models Karl Stratos Rutgers University - PowerPoint PPT Presentation

CS 533: Natural Language Processing Conditional Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/53

Language Models Considered So Far p Y | X ( y | x 1:100 ) ◮ Classical trigram models: q Y | X ( y | x 99 , x 100 ) ◮ Training: closed-form solution ◮ Log-linear models: softmax y ([ w ⊤ φ (( x 99 , x 100 ) , y ′ )] y ′ ) ◮ Training: gradient descent on convex loss ◮ Neural models ◮ Feedforward: softmax y (FF([ E x 99 , E x 100 ])) ◮ Recurrent: softmax y (FF( h ( x 1:99 ) , E x 100 )) ◮ Training: gradient descent on nonconvex loss Karl Stratos CS 533: Natural Language Processing 2/53

Conditional Language Models ◮ Machine translation ⇒ And the programme has been implemented Le programme a ´ et´ e mis en application ◮ Summarization russian defense minister ivanov called sunday for the creation of a joint front for ⇒ combating global terrorism russia calls for joint front against terrorism ◮ Data-to-text generation (Wiseman et al., 2017) ◮ Image captioning ⇒ the dog saw the cat Karl Stratos CS 533: Natural Language Processing 3/53

Encoder-Decoder Models Much of machine learning is learning x �→ y where x, y are some complicated structures Encoder-decoder models are conditional models that handle this wide class of problems in two steps: 1. Encode the given input x using some architecture. 2. Decode output y . Training: again minimize cross entropy � � − ln q θ min E Y | X (output | input) θ (input , output) ∼ p Y | X Karl Stratos CS 533: Natural Language Processing 4/53

Agenda 1. MT 2. Attention in detail 3. Beam Search Karl Stratos CS 533: Natural Language Processing 5/53

Machine Translation (MT) ◮ Goal: Translate text from one language to another. ◮ One of the oldest problems in artificial intelligence. Karl Stratos CS 533: Natural Language Processing 6/53

Some History ◮ Early ’90s: Rise of statistical MT (SMT) ◮ Exploit parallel text . And the programme has been implemented Le programme a ´ et´ e mis en application ◮ Infer word alignment (“IBM” models, Brown et al., 1993) Karl Stratos CS 533: Natural Language Processing 7/53

SMT: Huge Pipeline 1. Use IBM models to extract word alignment, phrase alignment (Koehn et al., 2003). 2. Use syntactic analyzers (e.g., parser) to extract features and manipulate text (e.g., phrase re-ordering). 3. Use a separate language model to enforce fluency. 4. . . . Multiple independently trained models patched together ◮ Really complicated, prone to error propogation Karl Stratos CS 533: Natural Language Processing 8/53

Rise of Neural MT Started taking off around 2014 ◮ Replaced the entire pipeline with a single model ◮ Called “end-to-end” training/prediction Input : Le programme a ´ et´ e mis en application Output : And the programme has been implemented ◮ Revolution in MT ◮ Better performance, way simpler system ◮ A hallmark of the recent neural domination in NLP ◮ Key: attention mechanism Karl Stratos CS 533: Natural Language Processing 9/53

Recap: Recurrent Neural Network (RNN) ◮ Always think of an RNN as a mapping φ : R d × R d ′ → R d ′ Input : an input vector x ∈ R d , a state vector h ∈ R d ′ Output : a new state vector h ′ ∈ R d ′ ◮ Left-to-right RNN processes input sequence x 1 . . . x m ∈ R d as h i = φ ( x i , h i − 1 ) where h 0 is an initial state vector. ◮ Idea: h i is a representation of x i that has incorporated all inputs to the left . h i = φ ( x i , φ ( x i − 1 , φ ( x i − 2 , · · · φ ( x 1 , h 0 ) · · · ))) Karl Stratos CS 533: Natural Language Processing 10/53

Variety 1: “Simple” RNN ◮ Parameters U ∈ R d ′ × d and V ∈ R d ′ × d ′ h i = tanh ( Ux i + V h i − 1 ) Karl Stratos CS 533: Natural Language Processing 11/53

Picture Karl Stratos CS 533: Natural Language Processing 12/53

Stacked Simple RNN ◮ Parameters U (1) . . . U ( L ) ∈ R d ′ × d and V (1) . . . V ( L ) ∈ R d ′ × d ′ � � h (1) U (1) x i + V (1) h (1) = tanh i i − 1 � � h (2) U (2) h (1) + V (2) h (2) = tanh i − 1 i i . . . � � h ( L ) U ( L ) h ( L − 1) + V ( L ) h ( L ) = tanh i i i − 1 ◮ Think of it as mapping φ : R d × R Ld ′ → R Ld ′ .     h (1) h (1) i − 1 i  .   .  . . x i �→     . . h ( L ) h ( L ) i − 1 i Karl Stratos CS 533: Natural Language Processing 13/53

Variety 2: Long Short-Term Memory (LSTM) ◮ Parameters U q , U c , U o ∈ R d ′ × d , V q , V c , V o , W q , W o ∈ R d ′ × d ′ q i = σ ( U q x i + V q h i − 1 + W q c i − 1 ) c i = (1 − q i ) ⊙ c i − 1 + q i ⊙ tanh ( U c x i + V c h i − 1 ) o i = σ ( U o x i + V o h i − 1 + W o c i ) h i = o i ⊙ tanh( c i ) ◮ Idea: “Memory cells” c i can carry long-range information. ◮ What happens if q i is close to zero? ◮ Can be stacked as in simple RNN. Karl Stratos CS 533: Natural Language Processing 14/53

Translation Problem ◮ Vocabulary of the source language V src � � V src = ᄀ ᅳ , ᄀ ᅢᄀ ᅡ , ᄇ ᅩ ᄋ ᆻ ᄃ ᅡ ᅡ , ᄉ ᅩ ᄉ ᅵ ᆨ , 2017, 5 ᄋ ᅯ ᆯ . . . ◮ Vocabulary of the target language V trg � � V trg = the, dog, cat, 2021, May, . . . ◮ Task. Given any sentence x 1 . . . x m ∈ V src , produce a corresponding translation y 1 . . . y n ∈ V trg . ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻ ᄃ ᅡ = ⇒ the dog barked Karl Stratos CS 533: Natural Language Processing 15/53

Evaluating Machine Translation ◮ T : human-translated sentences ◮ � T : machine-translated sentences ◮ p n : precision of n -grams in � T against n -grams in T (sentence-wise) ◮ BLEU: Controversial but popular scheme to automatically evaluate translation quality � �   � 4 � 1 � � � � T � � 4  1 ,  × BLEU = min p n | T | n =1 Karl Stratos CS 533: Natural Language Processing 16/53

Translation Model: Conditional Language Model A translation model defines a probability distribution p ( y 1 . . . y n | x 1 . . . x m ) over all sentences y 1 . . . y n ∈ V trg conditioning on any sentence x 1 . . . x m ∈ V src . Goal : Design a good translation model p ( the dog barked | ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻ ᄃ ᅡ ) > p ( the cat barked | ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻ ᄃ ᅡ ) > p ( dog the barked | ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻ ᄃ ᅡ ) > p ( oqc shgwqw#w 1g0 | ᄀ ᅢᄀ ᅡ ᄌ ᅵ ᆽᄋ ᅥ ᆻ ᄃ ᅡ ) How can we use an RNN to build a translation model? Karl Stratos CS 533: Natural Language Processing 17/53

Basic Encoder-Decoder Framework Model parameters ◮ Vector e x ∈ R d for every x ∈ V src ◮ Vector e y ∈ R d for every y ∈ V trg ∪ { * } ◮ Encoder RNN ψ : R d × R d ′ → R d ′ for V src ◮ Decoder RNN φ : R d × R d ′ → R d ′ for V trg ◮ Feedforward f : R d ′ → R | V trg | +1 Basic idea 1. Transform x 1 . . . x m ∈ V src with ψ into some representation ξ . 2. Build a language model φ over V trg conditioning on ξ . Karl Stratos CS 533: Natural Language Processing 18/53

Encoder For i = 1 . . . m , � � h ψ e x i , h ψ i = ψ i − 1 � � � � � �� h ψ e x 1 , h ψ m = ψ e x m , ψ e x m − 1 , ψ e x m − 2 , · · · ψ · · · 0 Karl Stratos CS 533: Natural Language Processing 19/53

Decoder Initialize h φ 0 = h ψ m and y 0 = * . For i = 1 , 2 , . . . , the decoder defines a probability distribution over V trg ∪ { STOP } as � � h φ e y i − 1 , h φ i = φ i − 1 p Θ ( y | x 1 . . . x m , y 0 . . . y i − 1 ) = softmax y ( f ( h φ i )) Probability of translation y 1 . . . y n given x 1 . . . x m : n � p Θ ( y 1 . . . y n | x 1 . . . x m ) = p Θ ( y i | x 1 . . . x m , y 0 . . . y i − 1 ) × i =1 p Θ ( STOP | x 1 . . . x m , y 0 . . . y n ) Karl Stratos CS 533: Natural Language Processing 20/53

Slide Credit: Danchi Chen & Karthik Narasimhan Encoder Sentence: This cat is cute h h t +3 h t −1 h t h t +1 h t +2 x t +3 x t x t +1 x t +2 word embedding This cat is cute Karl Stratos CS 533: Natural Language Processing 21/53

Slide Credit: Danchi Chen & Karthik Narasimhan Encoder Sentence: This cat is cute h h t +3 h 0 h 1 h t +1 h t +2 x t +3 x 1 x t +1 x t +2 word embedding This cat is cute Karl Stratos CS 533: Natural Language Processing 22/53

Slide Credit: Danchi Chen & Karthik Narasimhan Encoder Sentence: This cat is cute h h t +3 h 4 h 0 h 1 h 2 h t +2 h 3 x t +3 x 1 x 4 x 2 x t +2 x 3 word embedding This cat is cute Karl Stratos CS 533: Natural Language Processing 23/53

Slide Credit: Danchi Chen & Karthik Narasimhan Encoder (encoded representation) Sentence: This cat is cute h enc h 4 h 0 h 1 h 2 h 3 x 1 x 4 x 2 x 3 word embedding This cat is cute Karl Stratos CS 533: Natural Language Processing 24/53

Slide Credit: Danchi Chen & Karthik Narasimhan Decoder est mignon <e> ce chat o o o o o z 4 z 5 h enc z 1 z 2 z 3 x ′ � x ′ � x ′ � x ′ � x ′ � 5 1 4 2 3 word embedding <s> ce chat est mignon Karl Stratos CS 533: Natural Language Processing 25/53

Conditional Neural Language Models Karl Stratos Rutgers University - PowerPoint PPT Presentation

CS 533: Natural Language Processing Conditional Neural Language Models Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing 1/53 Language Models Considered So Far p Y | X ( y | x 1:100 ) Classical trigram

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Review: Conditional Probability Conditional Probability The conditional probability of event

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Graphical Models Graphical Models Conditional Independence 1 Steven J Zeil d-Separation 2

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Recurrent Language Models CMSC 470 Marine Carpuat Toward a Neural Language Model Figures by

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Graphical Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Conditional Independence

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Conditional Statements Python Conditional Statements Sometimes a statement (or a block of

Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

minimize use of terminology Structuring Your Talk A concept is set of all configs. from X

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention Damien Teney 1 ,

Attention, Coordination, and Bounded Recall Alessandro Pavan Northwestern University Chicago

Prioritizing Attention in Fast Data Principles and Promise Peter Bailis Edward Gan Kexin Rong

Right now, we pay attention to only 2 things 1) Scary and uncertain news And anything

Differential Attention to Attributes in Utility-Theoretic Choice Models Trudy Ann Cameron J.R.

Introducing Precautionary Behavior by Temporal Diversion of Voter Attention from Casting to

Tacotron: End-to-End TTS Tacotron [Wang 2017]: Neural Vocoder Convert spectrogram to