natural language processing with deep learning sequence
play

Natural Language Processing with Deep Learning Sequence-to-sequence - PowerPoint PPT Presentation

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception Agenda Sequence-to-sequence


  1. Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid Rekab-Saz navid.rekabsaz@jku.at Institute of Computational Perception Institute of Computational Perception

  2. Agenda • Sequence-to-sequence models • Attention Mechanism • seq2seq with Attention Some slides are adopted from http://web.stanford.edu/class/cs224n/

  3. Agenda • Sequence-to-sequence models • Attention Mechanism • seq2seq with Attention

  4. Sequence in – sequence out! § Several NLP tasks are defined as: - Given the source sequence 𝑌 = {𝑦 (") , 𝑦 ($) , … , 𝑦 (%) } - Create/Generate the target sequence 𝑍 = {𝑧 (") , 𝑧 ($) , … , 𝑧 (&) } 𝑌 𝑍 Was mich nicht umbringt, macht What does not kill me makes Machine Translation mich stärker. me stronger. F. Nietzsche Then the woman went to to the RB DT NN VBD TO DT NN TO POS Tagging bank to deposit her cash . VB PRP$ NN . Semantic How tall is Stephansdom? [Heightof, ., Stephansdom] parsing 4

  5. Sequence in – sequence out! § Tasks such as: - Machine Translation (source language → target language) - Summarization (long text → short text) - Dialogue (previous utterances → next utterance) - Code generation (natural language → SQL/Python code) - Named entity recognition - Dependency/semantic/ POS Parsing (input text → output parse as sequence) but also … - Image captioning (image → caption) - Automatic Speech Recognition (speech → manuscript) some elephants standing Image around a tall tree captioning 5

  6. Machine Translation (MT) § A long-history (since 1950) § Statistical Machine Translation (1990-2010) – and also Neural MT – use large amount of parallel data to calculate: argmax 𝑄(𝑍|𝑌) ! § Challenges: - Alignment - Common sense - Idioms! - Low-resource language pairs https://en.wikipedia.org/wiki/Rosetta_Stone 6

  7. Machine Translation (MT) – Evaluation § BLEU (Bilingual Evaluation Understudy) § BLEU computes a similarity score between the machine-written translation to one or several human- written translation(s), based on: - n -gram precision (usually for 1, 2, 3 and 4-grams) - plus a penalty for too-short machine translations § BLEU is precision-based, while ROUGE is recall-based Details of how to calculate BLEU: https://www.coursera.org/lecture/nlp-sequence-models/bleu-score- optional-kC2HD 7

  8. Sequence-to-sequence model § Sequence-to-sequence model (aka seq2seq) is the neural network architecture to approach … - given the source sequence 𝑌 = {𝑦 (") , 𝑦 ($) , … , 𝑦 (%) } , - generate the target sequence 𝑍 = {𝑧 (") , 𝑧 ($) , … , 𝑧 (&) } § A seq2seq model first creates a model to estimate the conditional probability: 𝑄(𝑍|𝑌) § and then generates a new sequence 𝑍 ∗ by solving: 𝑍 ∗ = argmax 𝑄(𝑍|𝑌) ! 8

  9. Seq2seq model § In fact, a seq2seq model is a conditional Language Model § It calculates the probability of the next word of target sequence, conditioned on the previous words of target sequence and the source sequence: for 𝑧 (") → 𝑄(𝑧 (") |𝑌) for 𝑧 ($) → 𝑄(𝑧 ($) |𝑌, 𝑧 (") ) … for 𝑧 (() → 𝑄(𝑧 (() |𝑌, 𝑧 (") , … , 𝑧 (()") ) … and for whole the target sequence: 𝑄 𝑍 𝑌 = 𝑄 𝑧 ! 𝑌 ×𝑄 𝑧 " 𝑌, 𝑧 ! × ⋯×𝑄 𝑧 # 𝑌, 𝑧 ! , … , 𝑧 #$! # 𝑄(𝑧 % |𝑌, 𝑧 ! , … , 𝑧 %$! ) 𝑄 𝑍 𝑌 = * %&! 9

  10. Seq2seq – steps § Like Language Modeling, we … § … design a model that predicts the probabilities of the next words of the target sequence, one after each other (in auto- regressive fashion): 𝑄(𝑧 (() |𝑌, 𝑧 (") , … , 𝑧 (()") ) § We train the model by maximizing these probabilities for the correct next words, appearing in training data § At inference time (or during decoding), we use the model to generate new target sequences, that have high generation probabilities: 𝑄 𝑍 𝑌 10

  11. Seq2seq with two RNNs EN ENCOD ODER ER DE DECODE DER 𝒛 (") : predicted probability distribution of ) Probability of appearance of the the next target word, given the source next target word: sequence and previous target words 𝑄 𝑧 ) 𝑌, 𝑧 ! , 𝑧 " , 𝑧 * (*) = 0 𝑧 + ! 𝒛 (') 1 𝑿 𝒕 ($) 𝒕 (&) 𝒊 ($) 𝒕 (') 𝒊 (&) 𝒊 (') 𝒊 (() … RNN ( RNN ( RNN ( RNN ' RNN ' RNN ' RNN ' 𝒛 ($) 𝒛 (&) 𝒛 (') 𝒚 (() 𝒚 ($) 𝒚 (&) 𝒚 (') 𝑽 𝑽 𝑽 𝑭 𝑭 𝑭 𝑭 𝑦 ($) 𝑦 (() 𝑧 ($) 𝑧 (&) 𝑧 (') 𝑦 (&) 𝑦 (') < sos > < eos > < sos > 11

  12. Seq2seq with two RNNs – formulation § There are two sets of vocabularies - 𝕎 / is the set of vocabularies for source sequences - 𝕎 0 is the set of vocabularies for target sequences EN ENCODER ER § Encoder embedding - Encoder embeddings for source words ( 𝕎 / ) → 𝑭 - Embedding of the source word 𝑦 (.) (at time step 𝑚 ) → 𝒚 (1) § Encoder RNN: 𝒊 (1) = RNN(𝒊 1)" , 𝒚 (1) ) Parameters are shown in red 12

  13. Seq2seq with two RNNs – formulation DE DECODE DER § Decoder embedding - Decoder embeddings at input for target words ( 𝕎 0 ) → 𝑽 - Embedding of the target word 𝑧 (2) (at time step 𝑢 ) → 𝒛 (2) § Decoder RNN 𝒕 (+) = RNN(𝒕 +,$ , 𝒛 (+) ) - The values of the last hidden state of the encoder RNN are passed to the initial hidden state of the decoder RNN: 𝒕 (-) = 𝒊 . Parameters are shown in red 13

  14. Seq2seq with two RNNs – formulation DE DECODE DER § Decoder output prediction - Predicted probability distribution of words at the next time step: 𝒛 (+) = softmax 𝑿𝒕 + + 𝒄 ∈ ℝ 𝕎 ! 1 - Probability of the next target word (at time step 𝑢 + 1 ): 𝑄 𝑧 (+0$) 𝑌, 𝑧 $ , … , 𝑧 (+,$) , 𝑧 (+) =C (+) 𝑧 1 (#$%) Parameters are shown in red 14

  15. Training Seq2seq § Training a seq2seq is the same as training a Language Model - We predict the next word, calculate loss, backpropagate, and update parameters - Since seq2seq is an end-to-end model, gradient flows from loss to all parameters (both RNNs and embeddings) § Loss function: Negative Log Likelihood of the predicted probability of the correct next target word 𝑧 23" ℒ (2) = − log < = − log 𝑄 𝑧 23" 𝑌, 𝑧 " , … , 𝑧 (2) 2 𝑧 4 $%& " & ℒ (2) & ∑ 25" § Overall loss: ℒ = 15

  16. Training Seq2seq 𝒊 ($) 𝒊 (&) 𝒊 (') 𝒊 (() RNN ' RNN ' RNN ' RNN ' 𝒚 (() 𝒚 ($) 𝒚 (&) 𝒚 (') 𝑭 𝑭 𝑭 𝑭 𝑦 ($) 𝑧 ($) 𝑧 (&) 𝑧 (') 𝑦 (() 𝑦 (&) 𝑦 (') < sos > < sos > < eos > 16

  17. Training Seq2seq NLL of 𝑧 (') ℒ ($) 𝒛 ($) 1 𝑿 𝒕 ($) 𝒊 ($) 𝒊 (&) 𝒊 (') 𝒊 (() RNN ( RNN ' RNN ' RNN ' RNN ' 𝒛 ($) 𝒚 (() 𝒚 ($) 𝒚 (&) 𝒚 (') 𝑽 𝑭 𝑭 𝑭 𝑭 𝑦 ($) 𝑧 ($) 𝑧 (&) 𝑧 (') 𝑦 (() 𝑦 (&) 𝑦 (') < sos > < sos > < eos > 17

  18. Training Seq2seq NLL of 𝑧 (() ℒ ($) ℒ (&) 𝒛 ($) 𝒛 (&) 1 1 𝑿 𝑿 𝒕 ($) 𝒕 (&) 𝒊 ($) 𝒊 (&) 𝒊 (') 𝒊 (() RNN ( RNN ( RNN ' RNN ' RNN ' RNN ' 𝒛 ($) 𝒛 (&) 𝒚 (() 𝒚 ($) 𝒚 (&) 𝒚 (') 𝑽 𝑽 𝑭 𝑭 𝑭 𝑭 𝑦 ($) 𝑧 ($) 𝑧 (&) 𝑧 (') 𝑦 (() 𝑦 (&) 𝑦 (') < sos > < sos > < eos > 18

  19. Training Seq2seq NLL of 𝑧 ()) ℒ ($) ℒ (&) ℒ (') 𝒛 ($) 𝒛 (&) 𝒛 (') 1 1 1 𝑿 𝑿 𝑿 𝒕 ($) 𝒕 (&) 𝒊 ($) 𝒕 (') 𝒊 (&) 𝒊 (') 𝒊 (() … RNN ( RNN ( RNN ( RNN ' RNN ' RNN ' RNN ' 𝒛 ($) 𝒛 (&) 𝒛 (') 𝒚 (() 𝒚 ($) 𝒚 (&) 𝒚 (') 𝑽 𝑽 𝑽 𝑭 𝑭 𝑭 𝑭 𝑦 ($) 𝑧 ($) 𝑧 (&) 𝑧 (') 𝑦 (() 𝑦 (&) 𝑦 (') < sos > < sos > < eos > 19

  20. Parameters § Encoder embeddings 𝑭 → 𝕎 3 ×𝑒 3 § Encoder RNN parameters § Decoder embeddings 𝑽 → 𝕎 4 ×𝑒 5 § Decoder RNN parameters § Decoder output projection 𝑿 → 𝑒 6 × 𝕎 4 § bias terms are discarded 𝑒 " , 𝑒 # , 𝑒 $ are embedding dimensions § § RNNs can be an LSTM, GRU, or vanilla (Elman) RNN 20

  21. Practical points: vocabs & embeddings § In Machine Translation - Encoder and decoder vocabularies belong to two different languages § In summarization - Encoder and decoder vocabularies are typically the same set (as they are in the same language) - Encoder and decoder embeddings ( 𝑭 and 𝑽 ) can also share parameters § Weight tying - can be done by sharing the parameters of 𝑽 and 𝑿 in decoder 21

  22. Decoding Recap § After training, we use the model to generate a target sequence given the source sequence (decoding). We aim to find the optimal output sequence 𝑍 ∗ that maximizes 𝑄(𝑍|𝑌) : 𝑍 ∗ = argmax 𝑄(𝑍|𝑌) 0 where 𝑄(𝑍|𝑌) for any arbitrary 𝑍 = {𝑧 (!) , 𝑧 (") , … , 𝑧 (#) } is: - 𝑄(𝑧 * |𝑌, 𝑧 , , … , 𝑧 *., ) 𝑄 𝑍 𝑌 = ; *+, § Question: among all possible 𝑍 sequences, how can we find 𝑍 ∗ ? 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend