sharp nearby fuzzy far away how neural language models
play

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context - PowerPoint PPT Presentation

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1 Outline Task Definition Challenges and Solution Approach


  1. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1

  2. Outline  Task Definition  Challenges and Solution  Approach  Experiments  Conclusion 2 NLPLab, Soochow University

  3. Task Definition  NMT  Unsupervised neural machine translation(NMT) is a approach for machine translation  Without using any labeled data during training  Translation Tasks  English-German  English-French  Chinese-English 3 NLPLab, Soochow University

  4. Challenges and Solution  Challenges  Only one encoder is shared by the source and target language  Be weakly in keeping in the uniqueness and internal characteristics of each language(style, terminology, sentence structure)  Solution  Two independent encoders and decoders for each one language  The weight-sharing constraint to the two auto- encoder(AE)  Two different generative adversarial networks(GAN) 4 NLPLab, Soochow University

  5. Approach  Model Architecture  Encoders: ��� � , ��� �  Decoders: ��� � , ��� �  Local discriminator: � �  Global discriminator: � �� , � �� 5 NLPLab, Soochow University

  6. Approach  Model Architecture  Encoders: ��� � , ��� � The encoder is composed of a stack of four identical layers Each layer consists of a multi-head self-attention and a simple position-wise fully connected feed-forward network. 6 NLPLab, Soochow University

  7. Approach  Model Architecture  Decoders: ��� � , ��� � The decoder is composed of four identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sublayer, which performs multi-head attention over the output of the encoder stack. 7 NLPLab, Soochow University

  8. Approach  Model Architecture  Local discriminator: � � a multi-layer perceptron  Global discriminator: � �� , � �� the convolutional neural network (CNN) 8 NLPLab, Soochow University

  9. Approach  Model Architecture 9 NLPLab, Soochow University

  10. Approach  Directional Self-attention  The forward mask � � The later token only makes attention connections to the early tokens in the sequence  The backward mask � � 10 NLPLab, Soochow University

  11. Approach  Weight sharing  Sharing the weights of the last few layers of the ��� � , ��� �  Be responsible for extracting high-level representations of the input sentences  Sharing the first few layers of the ��� � and ��� � , which are expected to decode high-level representations that are vital for reconstructing the input sentences  The shared weights are utilized to map the hidden features extracted by the independent weights to the shared-latent space 11 NLPLab, Soochow University

  12. Approach  Embedding reinforced encoder  Using pretrained cross-lingual embeddings in the encoders that are kept fixed during training  the input sequence embedding vectors � � �� � , … , � � �  the initial output sequence of the encoder stack H � �� � , … , � � �  the final output sequence of the encoder  g is a gate unit ( � � , � � and b are trainable parameters and they are shared by the two encoders) 12 NLPLab, Soochow University

  13. Approach  Embedding reinforced encoder  Using pretrained cross-lingual embeddings in the encoders that are kept fixed during training  the input sequence embedding vectors � � �� � , … , � � �  the initial output sequence of the encoder stack H � �� � , … , � � �  the final output sequence of the encoder  g is a gate unit ( � � , � � and b are trainable parameters and they are shared by the two encoders) 13 NLPLab, Soochow University

  14. Approach  Local GAN  To classify between the encoding of source sentences and the encoding of target sentences  Local discriminator loss θ �� 、 θ ���� and θ ���� represent the parameters of the local discriminator and two encoders  Encoders loss 14 NLPLab, Soochow University

  15. Approach  Global GAN  aim to distinguish among the true sentences and generated sentences  update the whole parameters of the proposed model, including the parameters of encoders and decoders  In ��� �� , the Enct and Decs act as the generator, which generates the sentence � � � from � �  The � �� , implemented based on CNN, assesses whether the generated sentence � � � is the true target-language sentence or the generated sentence 15 NLPLab, Soochow University

  16. Approach  Global GAN  aim to distinguish among the true sentences and generated sentences  update the whole parameters of the proposed model, including the parameters of encoders and decoders  In ��� �� , the Enct and Decs act as the generator, which generates the sentence � � � from � �  The � �� , implemented based on CNN, assesses whether the generated sentence � � � is the true target-language sentence or the generated sentence 16 NLPLab, Soochow University

  17. Experiments Setup  Datasets  English-French:WMT14  a parallel corpus of about 30M pairs of sentences  selecting English sentences from 15M random pairs, and selecting the French sentences from the complementary set  English-German:WMT16  two monolingual training data of 1.8M sentences each  Chinese-English:LDC  1.6M sentence pairs randomly extracted from LDC corpora 17 NLPLab, Soochow University

  18. Experiments Setup  Tools  train the embeddings for each language independently by using word2vec(Mikolov et al., 2013)  apply the public implementation of the method proposed by (Artetxe et al., 2017a) to map these embeddings to a shared- latent space  Hyper-parameters Parameters Values Parameters Values Word embedding 512 Beam size 4 α dropout 0.1 0.6 Head number 8 GPU four K80 GPUS 18 NLPLab, Soochow University

  19. Experiments Setup  Model selection  stop training when the model achieves no improvement for the tenth evaluation on the development set  devlopment set:3000 source and target sentences extracted randomly from the monolingual training corpora  Evaluation metrics  BLEU(Papineni et al., 2002)  Chinese-English:the script mteval-v11b.pl  English-German,English-French:the script multi-belu.pl 19 NLPLab, Soochow University

  20. Experiments Setup  Baseline Systems  Word-by-word translation (WBW) It translates a sentence word-by-word, replacing each word with its nearest neighbor in the other language  Lample et al. (2017) The same training and testing sets with this paper Encoder:Bi-LSTM Decoder:a forward LSTM  Supervised training The same model as ours,but trained using the standard cross- entropy loss on the original parallel sentences 20 NLPLab, Soochow University

  21. Experiments Results  Number of the weight-sharing layers  We find that the number of weight-sharing layers shows much effect on the translation performance. 21 NLPLab, Soochow University

  22. Experiments Results  Number of the weight-sharing layers  The best translation performance is achieved when only one layer is shared in our system 22 NLPLab, Soochow University

  23. Experiments Results  Number of the weight-sharing layers  When all of the four layers are shared, we get poor translation performance in all of the three translation tasks. 23 NLPLab, Soochow University

  24. Experiments Results  Number of the weight-sharing layers  We notice that using two completely independent encoders, results in poor translation performance too. 24 NLPLab, Soochow University

  25. Experiments Results  Translation Results  The proposed approach obtains significant improvements than the word-by-word baseline system, with at least +5.01 BLEU points in English-to-German translation and up to +13.37 BLEU points in English-to-French translation. 25 NLPLab, Soochow University

  26. Experiments Results  Translation Results  Compared to the work of (Lample et al., 2017), our model achieves up to +1.92 BLEU points improvement on English- to-French translation task.  However, there is still a large room for improvement compared to the supervised upper bound. 26 NLPLab, Soochow University

  27. Experiments Results  Ablation Study  The best performance is obtained with the simultaneous use of all the tested elements.  The weight-sharing constraint, which is vital to map sentences of different languages to the shared latent space. 27 NLPLab, Soochow University

  28. Experiments Results  Ablation Study  The embedding-reinforced encoder 、 directional self-attention 、 local GANs and global GANs are all the importance of different components of the proposed system. 28 NLPLab, Soochow University

  29. Conclusion We propose the weight-sharing constraint in  unsupervised NMT. We also propose the embedding-reinforced  encoders, local GAN and global GAN into the proposed system. The experimental results reveal that our  approach achieves significant improvement. However, there is still a large room for  improvement compared to the supervised NMT. 29 NLPLab, Soochow University

  30. Q&A Thank you! 30 NLPLab, Soochow University

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend