Natural Language Processing with Deep Learning CS224N/Ling284 - PowerPoint PPT Presentation

  Natural Language Processing   with Deep Learning     CS224N/Ling284 Lecture 8:   Machine Translation,   Sequence-to-sequence and Attention   Abigail See, Matthew Lamm

Announcements • We are taking attendance today • Sign in with the TAs outside the auditorium • No need to get up now – there will be plenty of time to sign in after the lecture ends • For attendance policy special cases, see Piazza post for clarification • Assignment 4 content covered today • Get started early! The model takes 4 hours to train! • Mid-quarter feedback survey: • Will be sent out sometime in the next few days (watch Piazza). • Complete it for 0.5% credit 2

        Overview Today we will:   • Introduce a new task: Machine Translation   is a major use-case of • Introduce a new neural architecture: sequence-to-sequence   is improved by • Introduce a new neural technique: attention 3

Section 1: Pre-Neural Machine Translation 4

    Machine Translation Machine Translation (MT) is the task of translating a sentence x from one language (the source language) to a sentence y in another language (the target language). x: L'homme est né libre, et partout il est dans les fers y: Man is born free, but everywhere he is in chains   - Rousseau 5

      1950s: Early Machine Translation Machine Translation research   began in the early 1950s.   • Russian → English   (motivated by the Cold War!)   1 minute video showing 1954 MT:   https://youtu.be/K-HfpsHPmvw • Systems were mostly rule-based, using a bilingual dictionary to map Russian words to their English counterparts 6

    1990s-2010s: Statistical Machine Translation • Core idea: Learn a probabilistic model from data • Suppose we’re translating French → English. • We want to find best English sentence y, given French sentence x   • Use Bayes Rule to break this down into two components to be learnt separately: Translation Model   Language Model   Models how words and phrases Models how to write   should be translated ( fidelity ).   good English ( fluency ).   Learnt from parallel data. Learnt from monolingual 7 data.

                1990s-2010s: Statistical Machine Translation • Question: How to learn translation model ? • First, need large amount of parallel data   (e.g. pairs of human-translated French/English sentences) The Rosetta Stone Ancient Egyptian   Demotic   Ancient Greek 8

  Learning alignment for SMT • Question: How to learn translation model from the parallel corpus?   • Break it down further: Introduce latent a variable into the model:   where a is the alignment, i.e. word-level correspondence between source sentence x and target sentence y 9

What is alignment? Alignment is the correspondence between particular words in the translated sentence pair.   • Typological differences between languages lead to complicated alignments! • Note: Some words have no counterpart 10 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/ anthology/J93-2003

Alignment is complex Alignment can be many-to-one 11 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/ anthology/J93-2003

Alignment is complex Alignment can be one-to-many Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/ 12 anthology/J93-2003

Alignment is complex Some words are very fertile! he hit me with a pie il he il hit a a me m’ m’ with entarté entarté a pie This word has no single- word equivalent in English 13

Alignment is complex Alignment can be many-to-many (phrase-level) 14 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/ anthology/J93-2003

  Learning alignment for SMT • We learn as a combination of many factors, including: • Probability of particular words aligning (also depends on position in sent) • Probability of particular words having particular fertility (number of corresponding words) • etc. • Alignments a are latent variables : They aren’t explicitly specified in the data! • Require the use of special learning aglos (like Expectation- Maximization) for learning the parameters of distributions with latent variables (CS 228)   15

Decoding for SMT Language Model Question:   Translation How to compute Model this argmax? • We could enumerate every possible y and calculate the probability? → Too expensive! • Answer: Impose strong independence assumptions in model, use dynamic programming for globally optimal solutions (e.g. Viterbi algorithm). • This process is called decoding 16

Viterbi: Decoding with Dynamic Programming • Impose strong independence assumptions in model: Source: “Speech and Language Processing", Chapter A, Jurafsky and Martin, 2019.   17

1990s-2010s: Statistical Machine Translation • SMT was a huge research field • The best systems were extremely complex • Hundreds of important details we haven’t mentioned here • Systems had many separately-designed subcomponents • Lots of feature engineering • Need to design features to capture particular language phenomena • Require compiling and maintaining extra resources • Like tables of equivalent phrases • Lots of human effort to maintain • Repeated effort for each language pair! 18

Section 2: Neural Machine Translation 19

What is Neural Machine Translation? • Neural Machine Translation (NMT) is a way to do Machine Translation with a single neural network   • The neural network architecture is called sequence-to- sequence (aka seq2seq) and it involves two RNNs. 20

Neural Machine Translation (NMT) The sequence-to-sequence model Target sentence (output) Encoding of the source sentence.   Provides initial hidden state   with a pie <END> he hit me for Decoder RNN. argmax argmax argmax argmax argmax argmax argmax Encoder RNN Decoder RNN il a m’ entarté he hit me with a pie <START> Decoder RNN is a Language Model that Source sentence (input) generates target sentence, conditioned on encoding . Encoder RNN produces Note: This diagram shows test time behavior: an encoding of the decoder output is fed in as next step’s input source sentence. 21

Sequence-to-sequence is versatile! • Sequence-to-sequence is useful for more than just MT   • Many NLP tasks can be phrased as sequence-to-sequence: • Summarization (long text → short text) • Dialogue (previous utterances → next utterance) • Parsing (input text → output parse as sequence) • Code generation (natural language → Python code) 22

  Neural Machine Translation (NMT) • The sequence-to-sequence model is an example of a   Conditional Language Model . • Language Model because the decoder is predicting the   next word of the target sentence y • Conditional because its predictions are also conditioned on the source sentence x   • NMT directly calculates :   Probability of next target word, given target words so far and source sentence x • Question : How to train a NMT system? • Answer : Get a big parallel corpus… 23

Training a Neural Machine Translation system = negative log   = negative log   = negative log   prob of “with” prob of “he” prob of <END> 𝑈 𝐾 = 1 ∑ = + + + + + + 𝐾 𝑢 𝐾 1 𝐾 3 𝐾 4 𝐾 2 𝐾 5 𝐾 6 𝐾 7 𝑈 𝑢 =1 ^ ^ ^ ^ ^ ^ ^ 𝑧 1 𝑧 3 𝑧 4 𝑧 2 𝑧 5 𝑧 6 𝑧 7 Encoder RNN Decoder RNN <START> he hit me with a pie il a m’ entarté Target sentence (from corpus) Source sentence (from corpus) Seq2seq is optimized as a single system. Backpropagation operates “ end-to-end” . 24

                Greedy decoding • We saw how to generate (or “decode”) the target sentence by taking argmax on each step of the decoder   me with a pie <END> he hit argmax argmax argmax argmax argmax argmax argmax he hit me with a <START> pie • This is greedy decoding (take most probable word on each step) • Problems with this method? 25

Problems with greedy decoding • Greedy decoding has no way to undo decisions! • Input: il a m’entarté (he hit me with a pie) • → he ____ • → he hit ____ • → he hit a ____ (whoops! no going back now…)   • How to fix this? 26

    Exhaustive search decoding • Ideally we want to find a (length T ) translation y that maximizes   • We could try computing all possible sequences y • This means that on each step t of the decoder, we’re tracking V t possible partial translations, where V is vocab size • This O(V T ) complexity is far too expensive! 27

Natural Language Processing with Deep Learning CS224N/Ling284 - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Machine Translation, Sequence-to-sequence and Attention Abigail See, Matthew Lamm Announcements We are taking attendance today Sign in

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre

Meeting the Needs of Deaf Patients: The Provider's Responsibility and How Technology Can Help

CS201 Lecture 02 Computer Vision: Image Formation and Basic Techniques John Magee 1 Computer

9/1/2020 USDA A Foo oods ds Deli eliver ery & Invent entor ory Ma Mana nageme ement

Count Census 2020 Convening & Implementation Plan Workshop August 16, 2019 San Diego

Challenges of Sharing Interpreting Services Across the Country via Video for more information

Dieter Fox Kevin Lai Organization and Overview Teams of

2020 Census 1 The 2020 Census: An Overview January 28, 2020 Robin Bachman Chief National

Questions Regarding Evaluation of Speech and Language Disorders in Children Under 18 months

Natural Language Processing with Deep Learning CS224N/Ling284 - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 8: Machine Translation, Sequence-to-sequence and Attention Abigail See, Matthew Lamm Announcements We are taking attendance today Sign in

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Deep learning for natural language processing Introduction to natural language processing

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing with Deep Learning CS224N The Future of Deep Learning + NLP Kevin

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Learning for Natural Language Processing (in 2 hours) Eneko Agirre

Meeting the Needs of Deaf Patients: The Provider's Responsibility and How Technology Can Help

CS201 Lecture 02 Computer Vision: Image Formation and Basic Techniques John Magee 1 Computer

9/1/2020 USDA A Foo oods ds Deli eliver ery &amp; Invent entor ory Ma Mana nageme ement

Count Census 2020 Convening &amp; Implementation Plan Workshop August 16, 2019 San Diego

Challenges of Sharing Interpreting Services Across the Country via Video for more information

Dieter Fox Kevin Lai Organization and Overview Teams of

2020 Census 1 The 2020 Census: An Overview January 28, 2020 Robin Bachman Chief National

Questions Regarding Evaluation of Speech and Language Disorders in Children Under 18 months

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

9/1/2020 USDA A Foo oods ds Deli eliver ery & Invent entor ory Ma Mana nageme ement

Count Census 2020 Convening & Implementation Plan Workshop August 16, 2019 San Diego