CS224N/Ling284 Lecture 8: Machine Translation, - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Machine Translation, Sequence-to-sequence and Attention Abigail See

Announcements • We are taking attendance today • Sign in with the TAs outside the auditorium • No need to get up now – there will be plenty of time to sign in after the lecture ends • For attendance policy special cases, see Piazza post for clarification • Assignment 4 content covered today • Get started early! The model takes 4 hours to train! • Mid-quarter feedback survey: • Will be sent out sometime in the next few days (watch Piazza). • Complete it for 0.5% credit 2

Overview Today we will: • Introduce a new task: Machine Translation is a major use-case of • Introduce a new neural architecture: sequence-to-sequence is improved by • Introduce a new neural technique: attention 3

Section 1: Pre-Neural Machine Translation 4

Machine Translation Machine Translation (MT) is the task of translating a sentence x from one language (the source language) to a sentence y in another language (the target language). x: L'homme est né libre, et partout il est dans les fers y: Man is born free, but everywhere he is in chains - Rousseau 5

1950s: Early Machine Translation Machine Translation research began in the early 1950s. • Russian → English (motivated by the Cold War!) 1 minute video showing 1954 MT: https://youtu.be/K-HfpsHPmvw • Systems were mostly rule-based, using a bilingual dictionary to map Russian words to their English counterparts 6

1990s-2010s: Statistical Machine Translation • Core idea: Learn a probabilistic model from data • Suppose we’re translating French → English. • We want to find best English sentence y, given French sentence x • Use Bayes Rule to break this down into two components to be learnt separately: Translation Model Language Model Models how words and phrases Models how to write should be translated ( fidelity ). good English ( fluency ). Learnt from monolingual data. Learnt from parallel data. 7

1990s-2010s: Statistical Machine Translation • Question: How to learn translation model ? • First, need large amount of parallel data (e.g. pairs of human-translated French/English sentences) The Rosetta Stone Ancient Egyptian Demotic Ancient Greek 8

Learning alignment for SMT • Question: How to learn translation model from the parallel corpus? • Break it down further: we actually want to consider where a is the alignment, i.e. word-level correspondence between French sentence x and English sentence y 9

• – ’ ’ • “ ” ” “ ” … – t’ “ …” es “ k” – “ ”… – – – – r … What is alignment? Alignment is the correspondence between particular words in the translated sentence pair. • Note: Some words have no counterpart etween words in f and words in e nouveaux s é ismes secou é spurious Japon é é deux word par Le Le Japan Japon Japan été secou é shaken shaken by par by two deux nouveaux new two sé ismes quakes new quakes 10 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003

Alignment is complex Alignment can be many-to-one autochtones appartenait The Le reste aux Le balance reste é was The the appartenait territory balance don’t of aux é was the the aboriginal autochtones territory people of the many-to-one aboriginal alignments people 11 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003 • … • • é é • … • – … • • – • é é

• – ’ ’ • “ ” ” “ ” … t’ “ …” – es “ k” – “ ”… – – – – r … Alignment is complex Alignment can be one-to-many zero fertility word programme application not translated We call this a fertile word mis é é t é Le en a And Le é the programme And program a the été has é program been mis has implemented en application been sé implemented one-to-many alignment 12 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003

Alignment is complex Some words are very fertile! with me pie he hit a il il he a hit a m’ me m’ entarté with entarté a pie This word has no single- word equivalent in English 13

Alignment is complex Alignment can be many-to-many (phrase-level) d é munis pauvres sont Les The Les The poor pauvres poor don’t sont don t d é munis have have any money any money many-to-many alignment phrase alignment 14 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/anthology/J93-2003 • … • • é é • … • – … • • – • é é

Learning alignment for SMT • We learn as a combination of many factors, including: • Probability of particular words aligning (also depends on position in sent) • Probability of particular words having particular fertility (number of corresponding words) • etc. 15

Decoding for SMT Language Model Question: Translation Model How to compute this argmax? • We could enumerate every possible y and calculate the probability? → Too expensive! • Answer: Use a heuristic search algorithm to search for the best translation, discarding hypotheses that are too low-probability • This process is called decoding 16

Decoding for SMT Source: ”Statistical Machine Translation", Chapter 6, Koehn, 2009. https://www.cambridge.org/core/books/statistical-machine-translation/94EADF9F680558E13BE759997553CDE5 17

Decoding for SMT Source: ”Statistical Machine Translation", Chapter 6, Koehn, 2009. https://www.cambridge.org/core/books/statistical-machine-translation/94EADF9F680558E13BE759997553CDE5 18

1990s-2010s: Statistical Machine Translation • SMT was a huge research field • The best systems were extremely complex • Hundreds of important details we haven’t mentioned here • Systems had many separately-designed subcomponents • Lots of feature engineering • Need to design features to capture particular language phenomena • Require compiling and maintaining extra resources • Like tables of equivalent phrases • Lots of human effort to maintain • Repeated effort for each language pair! 19

Section 2: Neural Machine Translation 20

2014 (dramatic reenactment) 21

2014 (dramatic reenactment) 22

What is Neural Machine Translation? • Neural Machine Translation (NMT) is a way to do Machine Translation with a single neural network • The neural network architecture is called sequence-to-sequence (aka seq2seq) and it involves two RNNs. 23

Neural Machine Translation (NMT) The sequence-to-sequence model Target sentence (output) Encoding of the source sentence. Provides initial hidden state with a pie <END> he hit me for Decoder RNN. argmax argmax argmax argmax argmax argmax argmax Encoder RNN Decoder RNN he hit me with a pie <START> il a m’ entarté Decoder RNN is a Language Model that generates Source sentence (input) target sentence, conditioned on encoding . Encoder RNN produces Note: This diagram shows test time behavior: an encoding of the decoder output is fed in as next step’s input source sentence. 24

Sequence-to-sequence is versatile! • Sequence-to-sequence is useful for more than just MT • Many NLP tasks can be phrased as sequence-to-sequence: • Summarization (long text → short text) • Dialogue (previous utterances → next utterance) • Parsing (input text → output parse as sequence) • Code generation (natural language → Python code) 25

Neural Machine Translation (NMT) • The sequence-to-sequence model is an example of a Conditional Language Model . • Language Model because the decoder is predicting the next word of the target sentence y • Conditional because its predictions are also conditioned on the source sentence x • NMT directly calculates : Probability of next target word, given target words so far and source sentence x • Question : How to train a NMT system? • Answer : Get a big parallel corpus… 26

Training a Neural Machine Translation system = negative log = negative log = negative log prob of “with” prob of “he” prob of <END> 𝑈 𝐾 = 1 𝐾 3 𝐾 4 𝑈 ෍ 𝐾 𝑢 𝐾 1 𝐾 2 𝐾 5 𝐾 6 𝐾 7 = + + + + + + 𝑢=1 𝑧 3 ො 𝑧 1 ො 𝑧 2 ො 𝑧 4 ො 𝑧 5 ො 𝑧 6 ො 𝑧 7 ො Encoder RNN Decoder RNN <START> he hit me with a pie il a m’ entarté Target sentence (from corpus) Source sentence (from corpus) Seq2seq is optimized as a single system. Backpropagation operates “ end-to- end” . 27

CS224N/Ling284 Lecture 8: Machine Translation, - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Machine Translation, Sequence-to-sequence and Attention Abigail See Announcements We are taking attendance today Sign in with the TAs outside the auditorium

CS224N/Ling284 Lecture 7: Vanishing Gradients and Fancy RNNs Abigail See Announcements

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 7: Vanishing Gradients

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 16:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 12:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 10:

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Machine

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 4:

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 15: Natural Language

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 3:

Natural Language Processing with Deep Learning CS224N/Ling284 Matthew Lamm Lecture

MaxEnt Models and Discriminative Estimation Gerald Penn CS224N/Ling284 [based on slides by

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 12:

Natural Language Processing CS224N/Ling284 Christopher Manning Spring 2010 Lecture 1 Course

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 14:

Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning Lecture 13:

CS224N/Ling284 Lecture 6: Language Models and Recurrent Neural Networks Abigail See Overview

Self-Attention For Generative Models Ashish Vaswani and Anna Huang Joint work with: Noam

Maximizing Skills in Office Bartholin duct and vulvar abscesses GYN Procedures Vaso-vagal

Health Impact Assessment (HIA) in context rainer.fehr @ uni-bielefeld.de, www.rfehr.eu [19-10] 1

What type(s) of ARC grant(s) did you apply for this year? Who prepared your ARC

Attention-based Networks M. Malinowski Why attention? Long term memories - attending to

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

A Neural Attention Model for Sentence Summarization Alexander M. Rush, Sumit Chopra, Jason