Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context - - PowerPoint PPT Presentation
Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context - - PowerPoint PPT Presentation
Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1 Outline Task Definition Challenges and Solution Approach
Outline
NLPLab, Soochow University
Task Definition Challenges and Solution Approach Experiments Conclusion
2
Task Definition
NMT Unsupervised neural machine translation(NMT) is
a approach for machine translation
Without using any labeled data during training Translation Tasks
English-German English-French Chinese-English
NLPLab, Soochow University
3
Challenges and Solution
Challenges
Only one encoder is shared by the source and target
language
Be weakly in keeping in the uniqueness and internal
characteristics of each language(style, terminology, sentence structure)
Solution
Two independent encoders and decoders for each one
language
The weight-sharing constraint to the two auto-
encoder(AE)
Two different generative adversarial networks(GAN)
NLPLab, Soochow University
4
Approach
Model Architecture
Encoders:, Decoders:, Local discriminator: Global discriminator:,
NLPLab, Soochow University
5
Approach
Model Architecture
Encoders:,
The encoder is composed of a stack of four identical layers Each layer consists of a multi-head self-attention and a simple position-wise fully connected feed-forward network.
NLPLab, Soochow University
6
Approach
Model Architecture
Decoders:,
The decoder is composed of four identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sublayer, which performs multi-head attention over the
- utput of the encoder stack.
NLPLab, Soochow University
7
Approach
Model Architecture
Local discriminator:
a multi-layer perceptron
Global discriminator:,
the convolutional neural network (CNN)
NLPLab, Soochow University
8
Approach
Model Architecture
NLPLab, Soochow University
9
Approach
Directional Self-attention
The forward mask
The later token only makes attention connections to the early tokens in the sequence
The backward mask
NLPLab, Soochow University
10
Approach
Weight sharing
Sharing the weights of the last few layers of the , Be responsible for extracting high-level representations of the
input sentences
Sharing the first few layers of the and , which are
expected to decode high-level representations that are vital for reconstructing the input sentences
The shared weights are utilized to map the hidden features
extracted by the independent weights to the shared-latent space
NLPLab, Soochow University
11
Approach
Embedding reinforced encoder
Using pretrained cross-lingual embeddings in the encoders that
are kept fixed during training
- the input sequence embedding vectors
, … ,
- the initial output sequence of the encoder stack
H , … ,
- the final output sequence of the encoder
- g is a gate unit
(, and b are trainable parameters and they are shared by the two encoders)
NLPLab, Soochow University
12
Approach
Embedding reinforced encoder
Using pretrained cross-lingual embeddings in the encoders that
are kept fixed during training
- the input sequence embedding vectors
, … ,
- the initial output sequence of the encoder stack
H , … ,
- the final output sequence of the encoder
- g is a gate unit
(, and b are trainable parameters and they are shared by the two encoders)
NLPLab, Soochow University
13
Approach
Local GAN
To classify between the encoding of source sentences and the
encoding of target sentences
Local discriminator loss Encoders loss
NLPLab, Soochow University
14 θ、θ and θrepresent the parameters of the local discriminator and two encoders
Approach
Global GAN
aim to distinguish among the true sentences and generated
sentences
update the whole parameters of the proposed model, including
the parameters of encoders and decoders
In
, the Enct and Decs act as the generator, which
generates the sentence from
The , implemented based on CNN, assesses whether the
generated sentence is the true target-language sentence
- r the generated sentence
NLPLab, Soochow University
15
Approach
Global GAN
aim to distinguish among the true sentences and generated
sentences
update the whole parameters of the proposed model, including
the parameters of encoders and decoders
In
, the Enct and Decs act as the generator, which
generates the sentence from
The , implemented based on CNN, assesses whether the
generated sentence is the true target-language sentence
- r the generated sentence
NLPLab, Soochow University
16
Experiments Setup
Datasets
English-French:WMT14
- a parallel corpus of about 30M pairs of sentences
- selecting English sentences from 15M random pairs, and
selecting the French sentences from the complementary set
English-German:WMT16
- two monolingual training data of 1.8M sentences each
Chinese-English:LDC
- 1.6M sentence pairs randomly extracted from LDC corpora
NLPLab, Soochow University
17
Experiments Setup
Tools
train the embeddings for each language independently by
using word2vec(Mikolov et al., 2013)
apply the public implementation of the method proposed by
(Artetxe et al., 2017a) to map these embeddings to a shared- latent space
Hyper-parameters
NLPLab, Soochow University
18
Parameters Values Parameters Values Word embedding 512 Beam size 4 dropout 0.1 0.6 Head number 8 GPU four K80 GPUS α
Experiments Setup
Model selection stop training when the model achieves no improvement
for the tenth evaluation on the development set
devlopment set:3000 source and target sentences extracted
randomly from the monolingual training corpora
Evaluation metrics
BLEU(Papineni et al., 2002) Chinese-English:the script mteval-v11b.pl English-German,English-French:the script multi-belu.pl
NLPLab, Soochow University
19
Experiments Setup
Baseline Systems Word-by-word translation (WBW)
It translates a sentence word-by-word, replacing each word with its nearest neighbor in the other language
Lample et al. (2017)
The same training and testing sets with this paper Encoder:Bi-LSTM Decoder:a forward LSTM
Supervised training
The same model as ours,but trained using the standard cross- entropy loss on the original parallel sentences
NLPLab, Soochow University
20
Experiments Results
Number of the weight-sharing layers
We find that the number of weight-sharing layers shows much
effect on the translation performance.
NLPLab, Soochow University
21
Experiments Results
Number of the weight-sharing layers
The best translation performance is achieved when only one layer
is shared in our system
NLPLab, Soochow University
22
Experiments Results
Number of the weight-sharing layers
When all of the four layers are shared, we get poor translation
performance in all of the three translation tasks.
NLPLab, Soochow University
23
Experiments Results
Number of the weight-sharing layers
We notice that using two completely independent encoders,
results in poor translation performance too.
NLPLab, Soochow University
24
Experiments Results
Translation Results
The proposed approach obtains significant improvements than
the word-by-word baseline system, with at least +5.01 BLEU points in English-to-German translation and up to +13.37 BLEU points in English-to-French translation.
NLPLab, Soochow University
25
Experiments Results
Translation Results
Compared to the work of (Lample et al., 2017), our model
achieves up to +1.92 BLEU points improvement on English- to-French translation task.
However, there is still a large room for improvement compared
to the supervised upper bound.
NLPLab, Soochow University
26
Experiments Results
Ablation Study
The best performance is obtained with the simultaneous use of
all the tested elements.
The weight-sharing constraint, which is vital to map sentences
- f different languages to the shared latent space.
NLPLab, Soochow University
27
Experiments Results
Ablation Study
The embedding-reinforced encoder、directional self-attention、
local GANs and global GANs are all the importance of different components of the proposed system.
NLPLab, Soochow University
28
Conclusion
We propose the weight-sharing constraint in unsupervised NMT.
We also propose the embedding-reinforced encoders, local GAN and global GAN into the proposed system.
The experimental results reveal that our approach achieves significant improvement.
However, there is still a large room for improvement compared to the supervised NMT.
NLPLab, Soochow University
29
Q&A
Thank you!
NLPLab, Soochow University