Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context - - PowerPoint PPT Presentation

sharp nearby fuzzy far away how neural language models
SMART_READER_LITE
LIVE PREVIEW

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context - - PowerPoint PPT Presentation

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Zhen Yang,Wei Chen, FengWang, Bo Xu Chinese Academy of Sciences Jing Ye NLP Lab, Soochow University 1 Outline Task Definition Challenges and Solution Approach


slide-1
SLIDE 1

1

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

Zhen Yang,Wei Chen, FengWang, Bo Xu

Chinese Academy of Sciences

Jing Ye

NLP Lab, Soochow University

slide-2
SLIDE 2

Outline

NLPLab, Soochow University

 Task Definition  Challenges and Solution  Approach  Experiments  Conclusion

2

slide-3
SLIDE 3

Task Definition

 NMT  Unsupervised neural machine translation(NMT) is

a approach for machine translation

 Without using any labeled data during training  Translation Tasks

 English-German  English-French  Chinese-English

NLPLab, Soochow University

3

slide-4
SLIDE 4

Challenges and Solution

 Challenges

 Only one encoder is shared by the source and target

language

 Be weakly in keeping in the uniqueness and internal

characteristics of each language(style, terminology, sentence structure)

 Solution

 Two independent encoders and decoders for each one

language

 The weight-sharing constraint to the two auto-

encoder(AE)

 Two different generative adversarial networks(GAN)

NLPLab, Soochow University

4

slide-5
SLIDE 5

Approach

 Model Architecture

 Encoders:,  Decoders:,  Local discriminator:  Global discriminator:,

NLPLab, Soochow University

5

slide-6
SLIDE 6

Approach

 Model Architecture

 Encoders:,

The encoder is composed of a stack of four identical layers Each layer consists of a multi-head self-attention and a simple position-wise fully connected feed-forward network.

NLPLab, Soochow University

6

slide-7
SLIDE 7

Approach

 Model Architecture

 Decoders:,

The decoder is composed of four identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sublayer, which performs multi-head attention over the

  • utput of the encoder stack.

NLPLab, Soochow University

7

slide-8
SLIDE 8

Approach

 Model Architecture

 Local discriminator:

a multi-layer perceptron

 Global discriminator:,

the convolutional neural network (CNN)

NLPLab, Soochow University

8

slide-9
SLIDE 9

Approach

 Model Architecture

NLPLab, Soochow University

9

slide-10
SLIDE 10

Approach

 Directional Self-attention

 The forward mask

The later token only makes attention connections to the early tokens in the sequence

 The backward mask

NLPLab, Soochow University

10

slide-11
SLIDE 11

Approach

 Weight sharing

 Sharing the weights of the last few layers of the ,  Be responsible for extracting high-level representations of the

input sentences

 Sharing the first few layers of the and , which are

expected to decode high-level representations that are vital for reconstructing the input sentences

 The shared weights are utilized to map the hidden features

extracted by the independent weights to the shared-latent space

NLPLab, Soochow University

11

slide-12
SLIDE 12

Approach

 Embedding reinforced encoder

 Using pretrained cross-lingual embeddings in the encoders that

are kept fixed during training

  • the input sequence embedding vectors

, … ,

  • the initial output sequence of the encoder stack

H , … ,

  • the final output sequence of the encoder
  • g is a gate unit

(, and b are trainable parameters and they are shared by the two encoders)

NLPLab, Soochow University

12

slide-13
SLIDE 13

Approach

 Embedding reinforced encoder

 Using pretrained cross-lingual embeddings in the encoders that

are kept fixed during training

  • the input sequence embedding vectors

, … ,

  • the initial output sequence of the encoder stack

H , … ,

  • the final output sequence of the encoder
  • g is a gate unit

(, and b are trainable parameters and they are shared by the two encoders)

NLPLab, Soochow University

13

slide-14
SLIDE 14

Approach

 Local GAN

 To classify between the encoding of source sentences and the

encoding of target sentences

 Local discriminator loss  Encoders loss

NLPLab, Soochow University

14 θ、θ and θrepresent the parameters of the local discriminator and two encoders

slide-15
SLIDE 15

Approach

 Global GAN

 aim to distinguish among the true sentences and generated

sentences

 update the whole parameters of the proposed model, including

the parameters of encoders and decoders

 In

, the Enct and Decs act as the generator, which

generates the sentence from

 The , implemented based on CNN, assesses whether the

generated sentence is the true target-language sentence

  • r the generated sentence

NLPLab, Soochow University

15

slide-16
SLIDE 16

Approach

 Global GAN

 aim to distinguish among the true sentences and generated

sentences

 update the whole parameters of the proposed model, including

the parameters of encoders and decoders

 In

, the Enct and Decs act as the generator, which

generates the sentence from

 The , implemented based on CNN, assesses whether the

generated sentence is the true target-language sentence

  • r the generated sentence

NLPLab, Soochow University

16

slide-17
SLIDE 17

Experiments Setup

 Datasets

 English-French:WMT14

  • a parallel corpus of about 30M pairs of sentences
  • selecting English sentences from 15M random pairs, and

selecting the French sentences from the complementary set

 English-German:WMT16

  • two monolingual training data of 1.8M sentences each

 Chinese-English:LDC

  • 1.6M sentence pairs randomly extracted from LDC corpora

NLPLab, Soochow University

17

slide-18
SLIDE 18

Experiments Setup

 Tools

 train the embeddings for each language independently by

using word2vec(Mikolov et al., 2013)

 apply the public implementation of the method proposed by

(Artetxe et al., 2017a) to map these embeddings to a shared- latent space

 Hyper-parameters

NLPLab, Soochow University

18

Parameters Values Parameters Values Word embedding 512 Beam size 4 dropout 0.1 0.6 Head number 8 GPU four K80 GPUS α

slide-19
SLIDE 19

Experiments Setup

 Model selection  stop training when the model achieves no improvement

for the tenth evaluation on the development set

 devlopment set:3000 source and target sentences extracted

randomly from the monolingual training corpora

 Evaluation metrics

 BLEU(Papineni et al., 2002)  Chinese-English:the script mteval-v11b.pl  English-German,English-French:the script multi-belu.pl

NLPLab, Soochow University

19

slide-20
SLIDE 20

Experiments Setup

 Baseline Systems  Word-by-word translation (WBW)

It translates a sentence word-by-word, replacing each word with its nearest neighbor in the other language

 Lample et al. (2017)

The same training and testing sets with this paper Encoder:Bi-LSTM Decoder:a forward LSTM

 Supervised training

The same model as ours,but trained using the standard cross- entropy loss on the original parallel sentences

NLPLab, Soochow University

20

slide-21
SLIDE 21

Experiments Results

 Number of the weight-sharing layers

 We find that the number of weight-sharing layers shows much

effect on the translation performance.

NLPLab, Soochow University

21

slide-22
SLIDE 22

Experiments Results

 Number of the weight-sharing layers

 The best translation performance is achieved when only one layer

is shared in our system

NLPLab, Soochow University

22

slide-23
SLIDE 23

Experiments Results

 Number of the weight-sharing layers

 When all of the four layers are shared, we get poor translation

performance in all of the three translation tasks.

NLPLab, Soochow University

23

slide-24
SLIDE 24

Experiments Results

 Number of the weight-sharing layers

 We notice that using two completely independent encoders,

results in poor translation performance too.

NLPLab, Soochow University

24

slide-25
SLIDE 25

Experiments Results

 Translation Results

 The proposed approach obtains significant improvements than

the word-by-word baseline system, with at least +5.01 BLEU points in English-to-German translation and up to +13.37 BLEU points in English-to-French translation.

NLPLab, Soochow University

25

slide-26
SLIDE 26

Experiments Results

 Translation Results

 Compared to the work of (Lample et al., 2017), our model

achieves up to +1.92 BLEU points improvement on English- to-French translation task.

 However, there is still a large room for improvement compared

to the supervised upper bound.

NLPLab, Soochow University

26

slide-27
SLIDE 27

Experiments Results

 Ablation Study

 The best performance is obtained with the simultaneous use of

all the tested elements.

 The weight-sharing constraint, which is vital to map sentences

  • f different languages to the shared latent space.

NLPLab, Soochow University

27

slide-28
SLIDE 28

Experiments Results

 Ablation Study

 The embedding-reinforced encoder、directional self-attention、

local GANs and global GANs are all the importance of different components of the proposed system.

NLPLab, Soochow University

28

slide-29
SLIDE 29

Conclusion

We propose the weight-sharing constraint in unsupervised NMT.

We also propose the embedding-reinforced encoders, local GAN and global GAN into the proposed system.

The experimental results reveal that our approach achieves significant improvement.

However, there is still a large room for improvement compared to the supervised NMT.

NLPLab, Soochow University

29

slide-30
SLIDE 30

Q&A

Thank you!

NLPLab, Soochow University

30