Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context - PowerPoint PPT Presentation

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1

Outline  Task Definition  Challenges and Solution  Approach  Experiments  Conclusion 2 NLPLab, Soochow University

Task Definition  NMT  Unsupervised neural machine translation(NMT) is a approach for machine translation  Without using any labeled data during training  Translation Tasks  English-German  English-French  Chinese-English 3 NLPLab, Soochow University

Challenges and Solution  Challenges  Only one encoder is shared by the source and target language  Be weakly in keeping in the uniqueness and internal characteristics of each language(style, terminology, sentence structure)  Solution  Two independent encoders and decoders for each one language  The weight-sharing constraint to the two auto- encoder(AE)  Two different generative adversarial networks(GAN) 4 NLPLab, Soochow University

Approach  Model Architecture  Encoders: �� , ��  Decoders: �� , ��  Local discriminator: � �  Global discriminator: � �� , � �� 5 NLPLab, Soochow University

Approach  Model Architecture  Encoders: �� , �� The encoder is composed of a stack of four identical layers Each layer consists of a multi-head self-attention and a simple position-wise fully connected feed-forward network. 6 NLPLab, Soochow University

Approach  Model Architecture  Decoders: �� , �� The decoder is composed of four identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sublayer, which performs multi-head attention over the output of the encoder stack. 7 NLPLab, Soochow University

Approach  Model Architecture  Local discriminator: � � a multi-layer perceptron  Global discriminator: � �� , � �� the convolutional neural network (CNN) 8 NLPLab, Soochow University

Approach  Model Architecture 9 NLPLab, Soochow University

Approach  Directional Self-attention  The forward mask � � The later token only makes attention connections to the early tokens in the sequence  The backward mask � � 10 NLPLab, Soochow University

Approach  Weight sharing  Sharing the weights of the last few layers of the �� , ��  Be responsible for extracting high-level representations of the input sentences  Sharing the first few layers of the �� and �� , which are expected to decode high-level representations that are vital for reconstructing the input sentences  The shared weights are utilized to map the hidden features extracted by the independent weights to the shared-latent space 11 NLPLab, Soochow University

Approach  Embedding reinforced encoder  Using pretrained cross-lingual embeddings in the encoders that are kept fixed during training  the input sequence embedding vectors � � �� , … , � � �  the initial output sequence of the encoder stack H � �� , … , � � �  the final output sequence of the encoder  g is a gate unit ( � � , � � and b are trainable parameters and they are shared by the two encoders) 12 NLPLab, Soochow University

Approach  Embedding reinforced encoder  Using pretrained cross-lingual embeddings in the encoders that are kept fixed during training  the input sequence embedding vectors � � �� , … , � � �  the initial output sequence of the encoder stack H � �� , … , � � �  the final output sequence of the encoder  g is a gate unit ( � � , � � and b are trainable parameters and they are shared by the two encoders) 13 NLPLab, Soochow University

Approach  Local GAN  To classify between the encoding of source sentences and the encoding of target sentences  Local discriminator loss θ �� 、 θ �� and θ �� represent the parameters of the local discriminator and two encoders  Encoders loss 14 NLPLab, Soochow University

Approach  Global GAN  aim to distinguish among the true sentences and generated sentences  update the whole parameters of the proposed model, including the parameters of encoders and decoders  In �� , the Enct and Decs act as the generator, which generates the sentence � � � from � �  The � �� , implemented based on CNN, assesses whether the generated sentence � � � is the true target-language sentence or the generated sentence 15 NLPLab, Soochow University

Approach  Global GAN  aim to distinguish among the true sentences and generated sentences  update the whole parameters of the proposed model, including the parameters of encoders and decoders  In �� , the Enct and Decs act as the generator, which generates the sentence � � � from � �  The � �� , implemented based on CNN, assesses whether the generated sentence � � � is the true target-language sentence or the generated sentence 16 NLPLab, Soochow University

Experiments Setup  Datasets  English-French:WMT14  a parallel corpus of about 30M pairs of sentences  selecting English sentences from 15M random pairs, and selecting the French sentences from the complementary set  English-German:WMT16  two monolingual training data of 1.8M sentences each  Chinese-English:LDC  1.6M sentence pairs randomly extracted from LDC corpora 17 NLPLab, Soochow University

Experiments Setup  Tools  train the embeddings for each language independently by using word2vec(Mikolov et al., 2013)  apply the public implementation of the method proposed by (Artetxe et al., 2017a) to map these embeddings to a shared- latent space  Hyper-parameters Parameters Values Parameters Values Word embedding 512 Beam size 4 α dropout 0.1 0.6 Head number 8 GPU four K80 GPUS 18 NLPLab, Soochow University

Experiments Setup  Model selection  stop training when the model achieves no improvement for the tenth evaluation on the development set  devlopment set:3000 source and target sentences extracted randomly from the monolingual training corpora  Evaluation metrics  BLEU(Papineni et al., 2002)  Chinese-English:the script mteval-v11b.pl  English-German,English-French:the script multi-belu.pl 19 NLPLab, Soochow University

Experiments Setup  Baseline Systems  Word-by-word translation (WBW) It translates a sentence word-by-word, replacing each word with its nearest neighbor in the other language  Lample et al. (2017) The same training and testing sets with this paper Encoder:Bi-LSTM Decoder:a forward LSTM  Supervised training The same model as ours,but trained using the standard cross- entropy loss on the original parallel sentences 20 NLPLab, Soochow University

Experiments Results  Number of the weight-sharing layers  We find that the number of weight-sharing layers shows much effect on the translation performance. 21 NLPLab, Soochow University

Experiments Results  Number of the weight-sharing layers  The best translation performance is achieved when only one layer is shared in our system 22 NLPLab, Soochow University

Experiments Results  Number of the weight-sharing layers  When all of the four layers are shared, we get poor translation performance in all of the three translation tasks. 23 NLPLab, Soochow University

Experiments Results  Number of the weight-sharing layers  We notice that using two completely independent encoders, results in poor translation performance too. 24 NLPLab, Soochow University

Experiments Results  Translation Results  The proposed approach obtains significant improvements than the word-by-word baseline system, with at least +5.01 BLEU points in English-to-German translation and up to +13.37 BLEU points in English-to-French translation. 25 NLPLab, Soochow University

Experiments Results  Translation Results  Compared to the work of (Lample et al., 2017), our model achieves up to +1.92 BLEU points improvement on English- to-French translation task.  However, there is still a large room for improvement compared to the supervised upper bound. 26 NLPLab, Soochow University

Experiments Results  Ablation Study  The best performance is obtained with the simultaneous use of all the tested elements.  The weight-sharing constraint, which is vital to map sentences of different languages to the shared latent space. 27 NLPLab, Soochow University

Experiments Results  Ablation Study  The embedding-reinforced encoder 、 directional self-attention 、 local GANs and global GANs are all the importance of different components of the proposed system. 28 NLPLab, Soochow University

Conclusion We propose the weight-sharing constraint in  unsupervised NMT. We also propose the embedding-reinforced  encoders, local GAN and global GAN into the proposed system. The experimental results reveal that our  approach achieves significant improvement. However, there is still a large room for  improvement compared to the supervised NMT. 29 NLPLab, Soochow University

Q&A Thank you! 30 NLPLab, Soochow University

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context - PowerPoint PPT Presentation

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1 Outline Task Definition Challenges and Solution Approach

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Applications Three sample applications Fuzzy inferno Nostalgic cow Twilight Eden Fuzzy inferno

11 Fuzzy Rule-Based Models Fuzzy Systems Engineering Toward Human-Centric Computing Contents

Safety and Health Recogni2on Achievement Program (SHARP) OSHCON SHARP Introduc/on SHARP

7 Transformations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

Semi-Heuristic Target-Based Fuzzy Target . . . Fuzzy Target . . . Fuzzy Decision Procedures:

M odels for Inexact Reasoning Fuzzy Logic Lesson 8 Fuzzy Controllers M aster in

On using Different Distance Measures for Fuzzy Numbers in Fuzzy Linear Regression Models Duygu

10 Fuzzy Modeling: Principles and Methodology Fuzzy Systems Engineering Toward Human-Centric

Watup, could you pick up this thing for me NEARBY? HOME NEARBY After 20 metres After 10 metres

Fuzzy Reasoning Outline Introduction Bivalent & Multivalent Logics Fundamental

A fuzzy clustering method using Genetic Algorithm and Fuzzy Subtractive Clustering Thanh Le, Tom

M odels for Inexact Reasoning Fuzzy Logic Lesson 1 Crisp and Fuzzy Sets M aster in

5 Operations and Aggregations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric

2 Notions and Concepts of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

Fuzzy Logic in Natural Fuzzy Logic in Natural Language Processing Language Processing ...wild

Generative modeling - Electromagnetic shower of a calorimeter Paul KLEIN 24 th of April 2017

Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

GANs + Final practice questions Lecture 23 CS 753 Instructor: Preethi Jyothi Final Exam

CoFiGAN: Collaborative Filtering by Generative and Discriminative Training for One-Class

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

US LHC Accelerator Research Program HL-LHC BNL - FNAL- LBNL - SLAC LARP Accelerator Systems 17

Long-term GRMHD Simulations of NS Merger Accretion Disks Rodrigo Fernndez (University of

Multicurve Cohomology of Mapping Class Groups Rasmus Villemoes Center for the Topology and

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context - PowerPoint PPT Presentation

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1 Outline Task Definition Challenges and Solution Approach

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Applications Three sample applications Fuzzy inferno Nostalgic cow Twilight Eden Fuzzy inferno

11 Fuzzy Rule-Based Models Fuzzy Systems Engineering Toward Human-Centric Computing Contents

Safety and Health Recogni2on Achievement Program (SHARP) OSHCON SHARP Introduc/on SHARP

7 Transformations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

Semi-Heuristic Target-Based Fuzzy Target . . . Fuzzy Target . . . Fuzzy Decision Procedures:

M odels for Inexact Reasoning Fuzzy Logic Lesson 8 Fuzzy Controllers M aster in

On using Different Distance Measures for Fuzzy Numbers in Fuzzy Linear Regression Models Duygu

10 Fuzzy Modeling: Principles and Methodology Fuzzy Systems Engineering Toward Human-Centric

Watup, could you pick up this thing for me NEARBY? HOME NEARBY After 20 metres After 10 metres

Fuzzy Reasoning Outline Introduction Bivalent &amp; Multivalent Logics Fundamental

A fuzzy clustering method using Genetic Algorithm and Fuzzy Subtractive Clustering Thanh Le, Tom

M odels for Inexact Reasoning Fuzzy Logic Lesson 1 Crisp and Fuzzy Sets M aster in

5 Operations and Aggregations of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric

2 Notions and Concepts of Fuzzy Sets Fuzzy Systems Engineering Toward Human-Centric Computing

Fuzzy Logic in Natural Fuzzy Logic in Natural Language Processing Language Processing ...wild

Generative modeling - Electromagnetic shower of a calorimeter Paul KLEIN 24 th of April 2017

Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

GANs + Final practice questions Lecture 23 CS 753 Instructor: Preethi Jyothi Final Exam

CoFiGAN: Collaborative Filtering by Generative and Discriminative Training for One-Class

EMNLP | 2020 SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup Rongzhi Zhang, Yue

US LHC Accelerator Research Program HL-LHC BNL - FNAL- LBNL - SLAC LARP Accelerator Systems 17

Long-term GRMHD Simulations of NS Merger Accretion Disks Rodrigo Fernndez (University of

Multicurve Cohomology of Mapping Class Groups Rasmus Villemoes Center for the Topology and

Fuzzy Reasoning Outline Introduction Bivalent & Multivalent Logics Fundamental