Improved Neural Machine Translation with a Syntax-Aware Encoder and - - PowerPoint PPT Presentation

improved neural machine translation with a syntax aware
SMART_READER_LITE
LIVE PREVIEW

Improved Neural Machine Translation with a Syntax-Aware Encoder and - - PowerPoint PPT Presentation

Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder Huadong Chen ! , Shujian Huang ! , David Chiang " , Jiajun Chen ! {chenhd,huangsj,chenjj}@nlp.nju.edu.cn dchiang@nd.edu 1. State Key Laboratory of Novel Software


slide-1
SLIDE 1

Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder

1

Huadong Chen!, Shujian Huang!, David Chiang", Jiajun Chen! {chenhd,huangsj,chenjj}@nlp.nju.edu.cn dchiang@nd.edu

  • 1. State Key Laboratory of Novel Software

T echnology (Nanjing University)

  • 2. University of Notre Dame
slide-2
SLIDE 2

Outline

2

  • Motivation
  • Approach
  • Experiments
  • Conclusion
slide-3
SLIDE 3

Part 1

Motivation

3

slide-4
SLIDE 4
  • Encoder-decoder framework

4

Neural Machine Translation

Cho et al., (2014)

slide-5
SLIDE 5
  • Attentional NMT

5

Neural Machine Translation

Bahdanau et al., (2015)

slide-6
SLIDE 6
  • Their success depends on the representation

they use to bridge the source and target language sentences.

6

Neural Machine Translation

slide-7
SLIDE 7
  • However, this representation, a sequence of fixed

dimensional vectors, differs considerably from

– most theories about mental representations of sentences; – and traditional natural language processing pipelines, in which semantics is built up compositionally using a syntactic structure.

  • It neglecting the potentially useful structural

information

– perhaps as evidence of this, current NMT models still suffer from syntactic errors such as attachment (Shi et al., 2016).

7

Neural Machine Translation

slide-8
SLIDE 8
  • Encoder: building up representation at higher

levels, such as phrases, may need structures

8

Neural Machine Translation

𝑦! 𝑦" 𝑦G 𝑦H 𝑦I

1 2 3 4 1 2 3 Phrase-level Representation

slide-9
SLIDE 9

9

Neural Machine Translation

驻 马尼拉 大使馆 zhu manila dashiguan embassy in manila in embassy of manila

  • Decoder: structures could act as the guidance
  • r control for generation

do not match the source structure

slide-10
SLIDE 10

Our Work

  • We propose an encoder-decoder framework

that takes the syntactic structures into consideration, which includes a bidirectional tree structure encoder and a tree coverage decoder.

10

slide-11
SLIDE 11

Part 2

Syntax-aware Neural Machine Translation

11

slide-12
SLIDE 12

Bottom-up Tree Encoder(1/3)

12

  • Bottom-up tree encoder (Tai et al., 2015,

Eriguchi et al., 2016): – building the tree structure representations from the bottom, which form the representations of constituents from their children

slide-13
SLIDE 13

Bottom-up Tree Encoder(2/3)

13

l We assume model consistency is important. l Our sequential model is based on bidirectional GRU. l We design tree-GRU instead of using tree-LSTM

slide-14
SLIDE 14

Bottom-up Tree Encoder(3/3)

14

  • Drawbacks of bottom-up tree encoder:

– the learned representation of a node is based on its subtree only; it contains no information from higher up in the tree – the representation of leaf nodes is still the sequential one, thus no syntactic information is fed into words.

slide-15
SLIDE 15

15

  • Bi-directional tree encoder

– also propagating information from the top, which includes information from outside the current constituent – bi-directional Tree-LSMT for classification (Tengand Zhang, 2016) – bi-directional Tree-GRU for sentiment (Kokkinos and Potamianos, 2017)

Bidirectional Tree Encoder(1/2)

slide-16
SLIDE 16

Bidirectional Tree Encoder(2/2)

16

  • The top-down encoder by itself would have

no lexical information as input

– we feed the hidden states of the bottom-up encoder to the top-down encoder as input

  • The information propagated from parent node

to left and right nodes is redundant

– we use different weights for left and right nodes

slide-17
SLIDE 17

Tree Attention

17

  • Treating the representation of tree nodes the

same as word representations and performing attention (Eriguchi et al., 2016)

– Pros: enabling attention at a higher level, i.e. the words in the same sub-tree could get attentions as a whole unit – Cons: still missing structural control, i.e. the attention for words and tree nodes may interfere with each other

slide-18
SLIDE 18

Tree-Coverage Model(1/5)

18

  • T

wo observations of translations:

– a syntactic phrase in the source sentence is

  • ften incorrectly translated into discontinuous

words in the output – the attention model prefers to attend to the non-leaf nodes, which may aggravate the over- translation problem

slide-19
SLIDE 19

Tree-Coverage Model(2/5)

19

Attention with Tree Encoder

slide-20
SLIDE 20

Tree-Coverage Model(3/5)

20

  • Coverage model (Tu et al., 2016)

– it could be interpreted as a control mechanism for the attention model

  • Drawbacks

– the coverage model sees the source-sentence annotations as a bag of vectors – it knows nothing about word order, still less about syntactic structure.

slide-21
SLIDE 21

Tree-Coverage Model(4/5)

21

  • We propose to use prior knowledge to

control the attention mechanism

– in our case, the prior knowledge is the source syntactic information – in particular, we build our model on top of the word coverage model proposed by Tu et al. (2016)

slide-22
SLIDE 22

Tree-Coverage Model(5/5)

22

hW 𝑒

YZ!

𝐷

YZ!,W

𝛽Y,W GRU 𝐷

Y,W

𝑦] W 𝑦^ W 𝑦] W 𝐷

YZ!,](W)

𝐷

YZ!,^(W)

𝛽Y,^(W) 𝛽Y,](W)

slide-23
SLIDE 23

Part 3

Experiments

23

slide-24
SLIDE 24

Data and Settings

  • 1.6 million sentence pairs from LDC for training
  • Using MT02 for held-out dev, MT03, 04, 05, 06 for

test

  • Implementation based on the dl4mt package

24

slide-25
SLIDE 25

Tree-GRU v.s. Tree-LSTM

29.5 30 30.5 31 31.5 32 32.5 33 Sequential Tree-LSTM Tree-GRU Seq-LSTM SeqTree-LSTM

+0.75

29.5 30 30.5 31 31.5 32 32.5 33 Sequential Tree-LSTM Tree-GRU Seq-LSTM SeqTree-LSTM

+0.75 +1.62

29.5 30 30.5 31 31.5 32 32.5 33 Sequential Tree-LSTM Tree-GRU Seq-LSTM SeqTree-LSTM

+0.75 +1.62 +1.02

25

slide-26
SLIDE 26

Tree-Coverage Model(1/2)

26

Our tree-coverage model consistently improves performance further (rows 9–11)

slide-27
SLIDE 27

Tree-Coverage Model(2/2)

27

+ Tree-Coverage Model Attention with Tree Encoder

slide-28
SLIDE 28

Analysis By Sentence Length

28

  • 1. The proposed bidirectional tree encoder outperforms the sequential

NMT system and the T ree-GRU encoder across all lengths.

  • 2. The improvements become larger for sentences longer than 20 words,

and the biggest improvement is for sentences longer than 50 words. 5% ↑ 10% ↑

slide-29
SLIDE 29

Conclusion

  • We have investigated the potential of using

explicit source-side syntactic trees in NMT.

  • The improvement could come from:

– the enrichment of the representationduring encoding; – the structural control of attention during decoding.

  • In this paper, we only use the binarized structure
  • f the source side tree. For future work, it will be

interesting to make use of target side structure information or the syntactic labels,as well.

29

slide-30
SLIDE 30

Thanks!

30