[PPT] - Improved Neural Machine Translation with a Syntax-Aware Encoder and PowerPoint Presentation

SLIDE 1

Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder

1

Huadong Chen!, Shujian Huang!, David Chiang", Jiajun Chen! {chenhd,huangsj,chenjj}@nlp.nju.edu.cn dchiang@nd.edu

1. State Key Laboratory of Novel Software

T echnology (Nanjing University)

2. University of Notre Dame

SLIDE 2

Outline

2

Motivation
Approach
Experiments
Conclusion

SLIDE 3

Part 1

Motivation

3

SLIDE 4

Encoder-decoder framework

4

Neural Machine Translation

Cho et al., (2014)

SLIDE 5

Attentional NMT

5

Neural Machine Translation

Bahdanau et al., (2015)

SLIDE 6

Their success depends on the representation

they use to bridge the source and target language sentences.

6

Neural Machine Translation

SLIDE 7

However, this representation, a sequence of fixed

dimensional vectors, differs considerably from

– most theories about mental representations of sentences; – and traditional natural language processing pipelines, in which semantics is built up compositionally using a syntactic structure.

It neglecting the potentially useful structural

information

– perhaps as evidence of this, current NMT models still suffer from syntactic errors such as attachment (Shi et al., 2016).

7

Neural Machine Translation

SLIDE 8

Encoder: building up representation at higher

levels, such as phrases, may need structures

8

Neural Machine Translation

𝑦! 𝑦" 𝑦G 𝑦H 𝑦I

1 2 3 4 1 2 3 Phrase-level Representation

SLIDE 9

9

Neural Machine Translation

驻马尼拉大使馆 zhu manila dashiguan embassy in manila in embassy of manila

Decoder: structures could act as the guidance
r control for generation

do not match the source structure

SLIDE 10

Our Work

We propose an encoder-decoder framework

that takes the syntactic structures into consideration, which includes a bidirectional tree structure encoder and a tree coverage decoder.

10

SLIDE 11

Part 2

Syntax-aware Neural Machine Translation

11

SLIDE 12

Bottom-up Tree Encoder(1/3)

12

Bottom-up tree encoder (Tai et al., 2015,

Eriguchi et al., 2016): – building the tree structure representations from the bottom, which form the representations of constituents from their children

SLIDE 13

Bottom-up Tree Encoder(2/3)

13

l We assume model consistency is important. l Our sequential model is based on bidirectional GRU. l We design tree-GRU instead of using tree-LSTM

SLIDE 14

Bottom-up Tree Encoder(3/3)

14

Drawbacks of bottom-up tree encoder:

– the learned representation of a node is based on its subtree only; it contains no information from higher up in the tree – the representation of leaf nodes is still the sequential one, thus no syntactic information is fed into words.

SLIDE 15

15

Bi-directional tree encoder

– also propagating information from the top, which includes information from outside the current constituent – bi-directional Tree-LSMT for classification (Tengand Zhang, 2016) – bi-directional Tree-GRU for sentiment (Kokkinos and Potamianos, 2017)

Bidirectional Tree Encoder(1/2)

SLIDE 16

Bidirectional Tree Encoder(2/2)

16

The top-down encoder by itself would have

no lexical information as input

– we feed the hidden states of the bottom-up encoder to the top-down encoder as input

The information propagated from parent node

to left and right nodes is redundant

– we use different weights for left and right nodes

SLIDE 17

Tree Attention

17

Treating the representation of tree nodes the

same as word representations and performing attention (Eriguchi et al., 2016)

– Pros: enabling attention at a higher level, i.e. the words in the same sub-tree could get attentions as a whole unit – Cons: still missing structural control, i.e. the attention for words and tree nodes may interfere with each other

SLIDE 18

Tree-Coverage Model(1/5)

18

T

wo observations of translations:

– a syntactic phrase in the source sentence is

ften incorrectly translated into discontinuous

words in the output – the attention model prefers to attend to the non-leaf nodes, which may aggravate the over- translation problem

SLIDE 19

Tree-Coverage Model(2/5)

19

Attention with Tree Encoder

SLIDE 20

Tree-Coverage Model(3/5)

20

Coverage model (Tu et al., 2016)

– it could be interpreted as a control mechanism for the attention model

Drawbacks

– the coverage model sees the source-sentence annotations as a bag of vectors – it knows nothing about word order, still less about syntactic structure.

SLIDE 21

Tree-Coverage Model(4/5)

21

We propose to use prior knowledge to

control the attention mechanism

– in our case, the prior knowledge is the source syntactic information – in particular, we build our model on top of the word coverage model proposed by Tu et al. (2016)

SLIDE 22

Tree-Coverage Model(5/5)

22

hW 𝑒

YZ!

𝐷

YZ!,W

𝛽Y,W GRU 𝐷

Y,W

𝑦] W 𝑦^ W 𝑦] W 𝐷

YZ!,](W)

𝐷

YZ!,^(W)

𝛽Y,^(W) 𝛽Y,](W)

SLIDE 23

Part 3

Experiments

23

SLIDE 24

Data and Settings

1.6 million sentence pairs from LDC for training
Using MT02 for held-out dev, MT03, 04, 05, 06 for

test

Implementation based on the dl4mt package

24

SLIDE 25

Tree-GRU v.s. Tree-LSTM

29.5 30 30.5 31 31.5 32 32.5 33 Sequential Tree-LSTM Tree-GRU Seq-LSTM SeqTree-LSTM

+0.75

29.5 30 30.5 31 31.5 32 32.5 33 Sequential Tree-LSTM Tree-GRU Seq-LSTM SeqTree-LSTM

+0.75 +1.62

29.5 30 30.5 31 31.5 32 32.5 33 Sequential Tree-LSTM Tree-GRU Seq-LSTM SeqTree-LSTM

+0.75 +1.62 +1.02

25

SLIDE 26

Tree-Coverage Model(1/2)

26

Our tree-coverage model consistently improves performance further (rows 9–11)

SLIDE 27

Tree-Coverage Model(2/2)

27

+ Tree-Coverage Model Attention with Tree Encoder

SLIDE 28

Analysis By Sentence Length

28

1. The proposed bidirectional tree encoder outperforms the sequential

NMT system and the T ree-GRU encoder across all lengths.

2. The improvements become larger for sentences longer than 20 words,

and the biggest improvement is for sentences longer than 50 words. 5% ↑ 10% ↑

SLIDE 29

Conclusion

We have investigated the potential of using

explicit source-side syntactic trees in NMT.

The improvement could come from:

– the enrichment of the representationduring encoding; – the structural control of attention during decoding.

In this paper, we only use the binarized structure
f the source side tree. For future work, it will be

interesting to make use of target side structure information or the syntactic labels,as well.

29

SLIDE 30

Thanks!

30