Natural Language Processing with Deep Learning Sequence-to-sequence - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception

Agenda • Sequence-to-sequence models • Attention Mechanism • seq2seq with Attention Some slides are adopted from http://web.stanford.edu/class/cs224n/

Agenda • Sequence-to-sequence models • Attention Mechanism • seq2seq with Attention

Sequence in – sequence out! § Several NLP tasks are defined as: - Given the source sequence 𝑌 = {𝑦 (") , 𝑦 ($) , … , 𝑦 (%) } - Create/Generate the target sequence 𝑍 = {𝑧 (") , 𝑧 ($) , … , 𝑧 (&) } 𝑌 𝑍 Was mich nicht umbringt, macht What does not kill me makes Machine Translation mich stärker. me stronger. F. Nietzsche Then the woman went to to the RB DT NN VBD TO DT NN TO POS Tagging bank to deposit her cash . VB PRP$ NN . Semantic How tall is Stephansdom? [Heightof, ., Stephansdom] parsing 4

Sequence in – sequence out! § Tasks such as: - Machine Translation (source language → target language) - Summarization (long text → short text) - Dialogue (previous utterances → next utterance) - Code generation (natural language → SQL/Python code) - Named entity recognition - Dependency/semantic/ POS Parsing (input text → output parse as sequence) but also … - Image captioning (image → caption) - Automatic Speech Recognition (speech → manuscript) some elephants standing Image around a tall tree captioning 5

Machine Translation (MT) § A long-history (since 1950) § Statistical Machine Translation (1990-2010) – and also Neural MT – use large amount of parallel data to calculate: argmax 𝑄(𝑍|𝑌) ! § Challenges: - Alignment - Common sense - Idioms! - Low-resource language pairs https://en.wikipedia.org/wiki/Rosetta_Stone 6

Machine Translation (MT) – Evaluation § BLEU (Bilingual Evaluation Understudy) § BLEU computes a similarity score between the machine-written translation to one or several human- written translation(s), based on: - n -gram precision (usually for 1, 2, 3 and 4-grams) - plus a penalty for too-short machine translations § BLEU is precision-based, while ROUGE is recall-based Details of how to calculate BLEU: https://www.coursera.org/lecture/nlp-sequence-models/bleu-score- optional-kC2HD 7

Sequence-to-sequence model § Sequence-to-sequence model (aka seq2seq) is the neural network architecture to approach … - given the source sequence 𝑌 = {𝑦 (") , 𝑦 ($) , … , 𝑦 (%) } , - generate the target sequence 𝑍 = {𝑧 (") , 𝑧 ($) , … , 𝑧 (&) } § A seq2seq model first creates a model to estimate the conditional probability: 𝑄(𝑍|𝑌) § and then generates a new sequence 𝑍 ∗ by solving: 𝑍 ∗ = argmax 𝑄(𝑍|𝑌) ! 8

Seq2seq model § In fact, a seq2seq model is a conditional Language Model § It calculates the probability of the next word of target sequence, conditioned on the previous words of target sequence and the source sequence: for 𝑧 (") → 𝑄(𝑧 (") |𝑌) for 𝑧 ($) → 𝑄(𝑧 ($) |𝑌, 𝑧 (") ) … for 𝑧 (() → 𝑄(𝑧 (() |𝑌, 𝑧 (") , … , 𝑧 (()") ) … and for whole the target sequence: 𝑄 𝑍 𝑌 = 𝑄 𝑧 ! 𝑌 ×𝑄 𝑧 " 𝑌, 𝑧 ! × ⋯×𝑄 𝑧 # 𝑌, 𝑧 ! , … , 𝑧 #$! # 𝑄(𝑧 % |𝑌, 𝑧 ! , … , 𝑧 %$! ) 𝑄 𝑍 𝑌 = * %&! 9

Seq2seq – steps § Like Language Modeling, we … § … design a model that predicts the probabilities of the next words of the target sequence, one after each other (in auto- regressive fashion): 𝑄(𝑧 (() |𝑌, 𝑧 (") , … , 𝑧 (()") ) § We train the model by maximizing these probabilities for the correct next words, appearing in training data § At inference time (or during decoding), we use the model to generate new target sequences, that have high generation probabilities: 𝑄 𝑍 𝑌 10

Seq2seq with two RNNs EN ENCOD ODER ER DE DECODE DER 𝒛 (") : predicted probability distribution of ) Probability of appearance of the the next target word, given the source next target word: sequence and previous target words 𝑄 𝑧 ) 𝑌, 𝑧 ! , 𝑧 " , 𝑧 * (*) = 0 𝑧 + ! 𝒛 (') 1 𝑿 𝒕 ($) 𝒕 (&) 𝒊 ($) 𝒕 (') 𝒊 (&) 𝒊 (') 𝒊 (() … RNN ( RNN ( RNN ( RNN ' RNN ' RNN ' RNN ' 𝒛 ($) 𝒛 (&) 𝒛 (') 𝒚 (() 𝒚 ($) 𝒚 (&) 𝒚 (') 𝑽 𝑽 𝑽 𝑭 𝑭 𝑭 𝑭 𝑦 ($) 𝑦 (() 𝑧 ($) 𝑧 (&) 𝑧 (') 𝑦 (&) 𝑦 (') < sos > < eos > < sos > 11

Seq2seq with two RNNs – formulation § There are two sets of vocabularies - 𝕎 / is the set of vocabularies for source sequences - 𝕎 0 is the set of vocabularies for target sequences EN ENCODER ER § Encoder embedding - Encoder embeddings for source words ( 𝕎 / ) → 𝑭 - Embedding of the source word 𝑦 (.) (at time step 𝑚 ) → 𝒚 (1) § Encoder RNN: 𝒊 (1) = RNN(𝒊 1)" , 𝒚 (1) ) Parameters are shown in red 12

Seq2seq with two RNNs – formulation DE DECODE DER § Decoder embedding - Decoder embeddings at input for target words ( 𝕎 0 ) → 𝑽 - Embedding of the target word 𝑧 (2) (at time step 𝑢 ) → 𝒛 (2) § Decoder RNN 𝒕 (+) = RNN(𝒕 +,$ , 𝒛 (+) ) - The values of the last hidden state of the encoder RNN are passed to the initial hidden state of the decoder RNN: 𝒕 (-) = 𝒊 . Parameters are shown in red 13

Seq2seq with two RNNs – formulation DE DECODE DER § Decoder output prediction - Predicted probability distribution of words at the next time step: 𝒛 (+) = softmax 𝑿𝒕 + + 𝒄 ∈ ℝ 𝕎 ! 1 - Probability of the next target word (at time step 𝑢 + 1 ): 𝑄 𝑧 (+0$) 𝑌, 𝑧 $ , … , 𝑧 (+,$) , 𝑧 (+) =C (+) 𝑧 1 (#$%) Parameters are shown in red 14

Training Seq2seq § Training a seq2seq is the same as training a Language Model - We predict the next word, calculate loss, backpropagate, and update parameters - Since seq2seq is an end-to-end model, gradient flows from loss to all parameters (both RNNs and embeddings) § Loss function: Negative Log Likelihood of the predicted probability of the correct next target word 𝑧 23" ℒ (2) = − log < = − log 𝑄 𝑧 23" 𝑌, 𝑧 " , … , 𝑧 (2) 2 𝑧 4 $%& " & ℒ (2) & ∑ 25" § Overall loss: ℒ = 15

Training Seq2seq 𝒊 ($) 𝒊 (&) 𝒊 (') 𝒊 (() RNN ' RNN ' RNN ' RNN ' 𝒚 (() 𝒚 ($) 𝒚 (&) 𝒚 (') 𝑭 𝑭 𝑭 𝑭 𝑦 ($) 𝑧 ($) 𝑧 (&) 𝑧 (') 𝑦 (() 𝑦 (&) 𝑦 (') < sos > < sos > < eos > 16

Training Seq2seq NLL of 𝑧 (') ℒ ($) 𝒛 ($) 1 𝑿 𝒕 ($) 𝒊 ($) 𝒊 (&) 𝒊 (') 𝒊 (() RNN ( RNN ' RNN ' RNN ' RNN ' 𝒛 ($) 𝒚 (() 𝒚 ($) 𝒚 (&) 𝒚 (') 𝑽 𝑭 𝑭 𝑭 𝑭 𝑦 ($) 𝑧 ($) 𝑧 (&) 𝑧 (') 𝑦 (() 𝑦 (&) 𝑦 (') < sos > < sos > < eos > 17

Training Seq2seq NLL of 𝑧 (() ℒ ($) ℒ (&) 𝒛 ($) 𝒛 (&) 1 1 𝑿 𝑿 𝒕 ($) 𝒕 (&) 𝒊 ($) 𝒊 (&) 𝒊 (') 𝒊 (() RNN ( RNN ( RNN ' RNN ' RNN ' RNN ' 𝒛 ($) 𝒛 (&) 𝒚 (() 𝒚 ($) 𝒚 (&) 𝒚 (') 𝑽 𝑽 𝑭 𝑭 𝑭 𝑭 𝑦 ($) 𝑧 ($) 𝑧 (&) 𝑧 (') 𝑦 (() 𝑦 (&) 𝑦 (') < sos > < sos > < eos > 18

Training Seq2seq NLL of 𝑧 ()) ℒ ($) ℒ (&) ℒ (') 𝒛 ($) 𝒛 (&) 𝒛 (') 1 1 1 𝑿 𝑿 𝑿 𝒕 ($) 𝒕 (&) 𝒊 ($) 𝒕 (') 𝒊 (&) 𝒊 (') 𝒊 (() … RNN ( RNN ( RNN ( RNN ' RNN ' RNN ' RNN ' 𝒛 ($) 𝒛 (&) 𝒛 (') 𝒚 (() 𝒚 ($) 𝒚 (&) 𝒚 (') 𝑽 𝑽 𝑽 𝑭 𝑭 𝑭 𝑭 𝑦 ($) 𝑧 ($) 𝑧 (&) 𝑧 (') 𝑦 (() 𝑦 (&) 𝑦 (') < sos > < sos > < eos > 19

Parameters § Encoder embeddings 𝑭 → 𝕎 3 ×𝑒 3 § Encoder RNN parameters § Decoder embeddings 𝑽 → 𝕎 4 ×𝑒 5 § Decoder RNN parameters § Decoder output projection 𝑿 → 𝑒 6 × 𝕎 4 § bias terms are discarded 𝑒 " , 𝑒 # , 𝑒 $ are embedding dimensions § § RNNs can be an LSTM, GRU, or vanilla (Elman) RNN 20

Practical points: vocabs & embeddings § In Machine Translation - Encoder and decoder vocabularies belong to two different languages § In summarization - Encoder and decoder vocabularies are typically the same set (as they are in the same language) - Encoder and decoder embeddings ( 𝑭 and 𝑽 ) can also share parameters § Weight tying - can be done by sharing the parameters of 𝑽 and 𝑿 in decoder 21

Decoding Recap § After training, we use the model to generate a target sequence given the source sequence (decoding). We aim to find the optimal output sequence 𝑍 ∗ that maximizes 𝑄(𝑍|𝑌) : 𝑍 ∗ = argmax 𝑄(𝑍|𝑌) 0 where 𝑄(𝑍|𝑌) for any arbitrary 𝑍 = {𝑧 (!) , 𝑧 (") , … , 𝑧 (#) } is: - 𝑄(𝑧 * |𝑌, 𝑧 , , … , 𝑧 *., ) 𝑄 𝑍 𝑌 = ; *+, § Question: among all possible 𝑍 sequences, how can we find 𝑍 ∗ ? 22

Natural Language Processing with Deep Learning Sequence-to-sequence - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception Agenda Sequence-to-sequence

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Deep learning for natural language processing Introduction to natural language processing

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

A Mathematical View of Attention Models in Deep Learning Shuiwang Ji, Yaochen Xie Department of

1 Planning Conside r ations F unding is c o ming o ut Co uld b e re c e iving ne w q uic

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

Tacotron: End-to-End TTS Tacotron [Wang 2017]: Neural Vocoder Convert spectrogram to

Investigating positional information in the Transformer Group 9 Outline Background &

Spatial Transformers in Feed-Forward Networks Max Jaederberg, Karen Simonyan, Andrew Zisserman

Some examples of issue- definitions and their relation to the politics of attention POLI 195

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of

Natural Language Processing with Deep Learning Sequence-to-sequence - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception Agenda Sequence-to-sequence

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Deep learning for natural language processing Introduction to natural language processing

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

A Mathematical View of Attention Models in Deep Learning Shuiwang Ji, Yaochen Xie Department of

1 Planning Conside r ations F unding is c o ming o ut Co uld b e re c e iving ne w q uic

CS480/680 Lecture 18: July 8, 2019 Recurrent and Recursive Neural Networks [GBC] Chap. 10

Tacotron: End-to-End TTS Tacotron [Wang 2017]: Neural Vocoder Convert spectrogram to

Investigating positional information in the Transformer Group 9 Outline Background &amp;

Spatial Transformers in Feed-Forward Networks Max Jaederberg, Karen Simonyan, Andrew Zisserman

Some examples of issue- definitions and their relation to the politics of attention POLI 195

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Investigating positional information in the Transformer Group 9 Outline Background &