Sequence-to-sequence models used for machine translation and Murat - PowerPoint PPT Presentation

Sequence-to-sequence models used for machine translation and Murat Apishev Katya Artemova Computational Pragmatics Lab, HSE December 2, 2019 Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 1 / 67

Machine translation Today Machine translation 1 Task oriented chat-bots 2 Constituency parsing 3 Spelling correction 4 Summarization 5 Question answering 6 IR-based QA 7 Datasets Models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 2 / 67

Machine translation Sequence-to-sequence Neural encoder encoder architectures Achieves high results on machine translation, spelling correction, summarization and other NLP tasks The encoder inputs sequence of tokens x 1: n and outputs hidden states h E n The decoder decodes an output sequence of tokens y 1: n by decoding last hidden state h D 0 = h E n seq2seq architectures are trained on parallel corpora Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 3 / 67 Image source: jeddy92

Machine translation Seq-to-seq for MT Both encoder and decoder are recurrent networks Input words x i ( i ∈ [1 , n ]) are represented as word embeddings ( w2v for example) The context vector: h n , last hidden state of RNN encoder, turns out to be a bottleneck It is challenging for the models to deal with long sentences as the impact of last words is higher Attention mechanism is one of the possible solutions Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 4 / 67

Machine translation Seq-to-seq for MT + attention Attention mechanism allows to align input and output words. The encoder passes all the hidden states to the decoder: not h E n , but rather h E i , i ∈ [1 , n ] The hidden states can be treated as context-aware word embeddings The hidden states are used to produced a context vector c for the decoder Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 5 / 67

Machine translation Seq-to-seq for MT + attention At the step j the decoder inputs h D j − 1 , j ∈ [ n +1 , m ] and a context vector c j from the encoder The context vector c j is a linear combination of the encoder hidden states: � α i h E c j = i i α i are attention weights which help the decoder to focur on the relvant part of the encoder input Image source: jalammar Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 6 / 67

Machine translation Seq-to-seq MT + attention To generate a new word the decoder at the step j : inputs h D j − 1 and produces h D j concatenates h D to c j j passes the concatenated vector through linear layer with softmax activation to get a probability distribution over target vocabulary Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 7 / 67

Machine translation Attention weights The attention weights α ij measure the similarity of the encoder hidden state h E i while generating the word j exp( sim ( h E i , h D j )) � a ij = k exp( sim ( h E k , h D j )) The similarity sim can be computed by ◮ dot product attention: sim ( h , s ) = h T s ◮ additive attention: sim ( h , s ) = w T tanh( W h h + W s s ) ◮ multiplicative attention: sim ( h , s ) = h T W s Weights are trained jointly with the whole model Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 8 / 67

Machine translation Attention map Figure: Visualisation of attention weights Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 9 / 67

Machine translation MT metrics BLEU compares system output to reference translation Reference translation: E-mail was sent on Tuesday System output: The letter was sent on Tuesday Given N ( N ∈ [1 , 4]) compute the number of N -grams present both in system output and reference translation: N = 1 ⇒ 4 N = 2 ⇒ 3 N = 3 ⇒ 2 N = 4 ⇒ 1 6 5 4 3 Take geometric mean N : � 4 6 · 3 5 · 2 4 · 1 4 score = 3 Brevity penalty: BP = min(1 , 6 / 5) Finally BLEU: BP · score ≈ 0 . 5081 Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 10 / 67

Task oriented chat-bots Today Machine translation 1 Task oriented chat-bots 2 Constituency parsing 3 Spelling correction 4 Summarization 5 Question answering 6 IR-based QA 7 Datasets Models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 11 / 67

Task oriented chat-bots Natural language understanding Two tasks (intent detection and slot filling) : identify speaker’s intent and extract semantic constituents from the natural language query Figure: ATIS corpus sample with intent and slot annotation Intent detection is a classification task Slot filling is a sequence labelling task NLU datasets : ATIS [1], Snips [2] Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 12 / 67

Task oriented chat-bots Joint intent detection and slot filling [3] 1 The encoder models is a biLSTM 2 The decoder is a unidirectional LSTM 3 At each step the decoder state s i is: s i = f ( s i − 1 , y i − 1 , h i , c i ), where c i = � T j α i , j h j , exp( e i , j ) α i , j = k exp( e i , k ) , � T e i , k = g ( s i − 1 , h k ) The inputs are explicitly aligned. Costs from both decoders are back-propagated to the encoder. Figure: Encoder-decoder models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 13 / 67

Task oriented chat-bots Joint intent detection and slot filling [3] BiLSTM reads the source sequence forward RNN models slot label dependencies the hidden state h i at each step is a concatenation of the forward state fh i and backward state bh i the hidden state is h i combined Figure: RNN-based model with the context vector c i c i is calculated as a weighted average of h = ( h 1 , ..., h T ) Figure: Attention weights Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 14 / 67

Constituency parsing Today Machine translation 1 Task oriented chat-bots 2 Constituency parsing 3 Spelling correction 4 Summarization 5 Question answering 6 IR-based QA 7 Datasets Models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 15 / 67

Constituency parsing Grammar as a Foreign Language [4] Figure: Example parsing task and its linearization Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 16 / 67

Constituency parsing Grammar as a Foreign Language [4] Figure: LSTM+attention encoder-decoder model for parsing Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 17 / 67

Constituency parsing Grammar as a Foreign Language [4] The encoder LSTM is is used to encode the sequence of input words A i , | A | = T A The decoder LSTM is used to output symbols B i , | B | = T B The attention vector at each output time t over the input words: i = v T tanh( W 1 h E u t i + W 2 h D t ) a t i = softmax( u T i ) T A � d ′ a t i h E t = i , i =1 where the vector v and matrices W 1 , W 2 are learnable parameters of the model. Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 18 / 67

Spelling correction Today Machine translation 1 Task oriented chat-bots 2 Constituency parsing 3 Spelling correction 4 Summarization 5 Question answering 6 IR-based QA 7 Datasets Models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 19 / 67

Spelling correction Neural Language Correction with Character-Based Attention [5] Trained on a parallel corpus of “bad” ( x ) and “good” ( y ) sentences Encoder has a pyramid structure: f ( j ) = GRU( f ( j − 1) , c ( j − 1) ) t t − 1 t b ( j ) = GRU( b ( j − 1) t +1 , c ( j − 1) ) t t h ( j ) = f ( j ) + b ( j ) Figure: An encoder-decoder t t t neural network model with two c ( j ) = tanh( W ( j ) pyr [ h ( j − 1) , h ( j − 1) 2 t +1 ] ⊤ + b ( j ) pyr ) t 2 t encoder hidden layers and one decoder hidden layer Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 20 / 67

Spelling correction Neural Language Correction with Character-Based Attention [5] Decoder network: d ( j ) = GRU( d ( j − 1) t − 1 , ( j − 1) ) t t Attention mechanism: u tk = φ 1 ( d ( M ) ) ⊤ φ 2 ( c k ) , φ : tahn ( W ×· ) u tk α tk = � j u tj a t = � j α tj c j Figure: An encoder-decoder Loss: neural network model with two L ( x , y ) = � T t =1 logP ( y t | x , y < t ) encoder hidden layers and one decoder hidden layer Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 21 / 67

Spelling correction Neural Language Correction with Character-Based Attention [5] Beam search for decoding: s k ( y 1: k | x ) = log P NN ( y 1: k | x ) + λ logP LM ( y 1: k ) Synthezing errors: article or determiner errors (ArtOrDet) and noun number Figure: An encoder-decoder errors (Nn) neural network model with two encoder hidden layers and one decoder hidden layer Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 22 / 67

Summarization Today Machine translation 1 Task oriented chat-bots 2 Constituency parsing 3 Spelling correction 4 Summarization 5 Question answering 6 IR-based QA 7 Datasets Models Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 23 / 67

Sequence-to-sequence models used for machine translation and Murat - PowerPoint PPT Presentation

Sequence-to-sequence models used for machine translation and Murat Apishev Katya Artemova Computational Pragmatics Lab, HSE December 2, 2019 Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 1 / 67 Machine translation

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Machine Translation and Sequence-to-sequence Models http://phontron.com/class/mtandseq2seq2018/

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

Machine Translation/ Sequence-to-sequence Models Graham Neubig Site

We know what you did at 9am Analysis Systems with Dynamic User Generated Content Christian

Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping 1st Workshop

Online Multi-Agent Pathfinding Intelligent Robotics Fin Tter Technical Aspects of Multimodal

Contributions Introduction Data Exploration without Specification B. Saket, H. Kim, E. T. Brown

Emerging Service Provider Scenarios for IPv6 Deployment draft-carpenter-v6ops-isp-scenarios-01

RHIP COUNCIL No November 17, 17, 2020 2020 Meeting Objectives Det eter ermine n e next

dra$-ie(-cdni-logging Open Discussion Non-Real-Time vs

Drupal and Logstash: centralised logging Marji Cermak Marji Cermak Systems Engineer at Morpht

Sequence-to-sequence models used for machine translation and Murat - PowerPoint PPT Presentation

Sequence-to-sequence models used for machine translation and Murat Apishev Katya Artemova Computational Pragmatics Lab, HSE December 2, 2019 Apishev, Artemova (HSE) Sequence-to-sequence models December 2, 2019 1 / 67 Machine translation

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sequence-to-sequence Models and Attention Graham Neubig Preliminaries: Language Models

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Machine Translation and Sequence-to-sequence Models http://phontron.com/class/mtandseq2seq2018/

The Neural Noisy Channel: Generative Models for Sequence to Sequence Modeling Chris

Natural Language Processing with Deep Learning Sequence-to-sequence Models with Attention Navid

Machine Translation/ Sequence-to-sequence Models Graham Neubig Site

We know what you did at 9am Analysis Systems with Dynamic User Generated Content Christian

Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping 1st Workshop

Online Multi-Agent Pathfinding Intelligent Robotics Fin Tter Technical Aspects of Multimodal

Contributions Introduction Data Exploration without Specification B. Saket, H. Kim, E. T. Brown

Emerging Service Provider Scenarios for IPv6 Deployment draft-carpenter-v6ops-isp-scenarios-01

RHIP COUNCIL No November 17, 17, 2020 2020 Meeting Objectives Det eter ermine n e next

dra$-ie(-cdni-logging Open Discussion Non-Real-Time vs

Drupal and Logstash: centralised logging Marji Cermak Marji Cermak Systems Engineer at Morpht

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or