Modelling Multiple Sequences: Explorations, Consequences and - - PowerPoint PPT Presentation

modelling multiple sequences explorations consequences
SMART_READER_LITE
LIVE PREVIEW

Modelling Multiple Sequences: Explorations, Consequences and - - PowerPoint PPT Presentation

Modelling Multiple Sequences: Explorations, Consequences and Challenges Orhan Firat Dagstuhl Seminar - C2NLU January 2017 1/36 Before we start! The Fog of Progress 1 and Artificial General Intelligence 1Hinton video-lectures,


slide-1
SLIDE 1

1/36

Modelling Multiple Sequences: Explorations, Consequences and Challenges

Orhan Firat Dagstuhl Seminar - C2NLU January 2017

slide-2
SLIDE 2

2/36

Before we start!

The Fog of Progress1 and Artificial General Intelligence

1Hinton video-lectures,https://www.youtube.com/watch?v=ZuvRXGX8cY8

slide-3
SLIDE 3

3/36

What is going on?

2014 2017 ⋆Google Zero-Shot NMT ⋆Fully Char-NMT ⋆Google - NMT ⋆Zero-Resource NMT ⋆WMT’16 - NMT ⋆Character-Dec NMT ⋆Multi-Source NMT ⋆Multilingual-NMT ⋆Subword-NMT ⋆WMT’15 - NMT ⋆OpenMT’15 - NMT ⋆Image Captioning - NMT ⋆Large-Vocabulary NMT ⋆Neural-MT

slide-4
SLIDE 4

4/36

Conclusion Slide of Machine Translation Marathon’16

What Lies Ahead?

Perhaps, we’ve only scratched the surface!

▸ Language barrier, surpassing human level quality.

Revisiting the new territory:

Character-level Larger-Context Multilingual Neural Machine Translation

using,

▸ Multiple modalities ▸ Better error signals ▸ and better GPUs

slide-5
SLIDE 5

5/36

Multi-Sequence Modelling

What is a sequence?

▸ A sequence (x1,...,xT) can be:

▸ sentence

(“visa”, “process”, “is”, “taking”, “so”, “long”, “.”)

▸ image ▸ video ▸ speech ▸ ...

slide-6
SLIDE 6

6/36

Multi-Sequence Modelling

What is sequence modelling? “It is all about telling how likely a sequence is.”

— Kyunghyun Cho

▸ Modelling in the sense of predictive modelling. ▸ What is the probability of (x1,...,xT)?

▸ p(x1,x2,...,xT) = ∏T

t=1 p(xt∣x<t) = ?

▸ Example: RNN language models

slide-7
SLIDE 7

7/36

Multi-Sequence Modelling

What do we mean by multiple sequences? Well, we really mean more than one (multiple) sequences

Conditional-LM

▸ p(x1,...,xT∣Y ) = ∏T t=1 p(xt∣x<t,Y ) ▸ seq2seq models, NMT

slide-8
SLIDE 8

7/36

Multi-Sequence Modelling

What do we mean by multiple sequences? Well, we really mean more than one (multiple) sequences

Conditional-LM

▸ p(x1,...,xT∣Y ) = ∏T t=1 p(xt∣x<t,Y ) ▸ seq2seq models, NMT

Multi-Way Enc1 Enc2 Enc3 Att Dec1 Dec2 Dec3

▸ Multi-lingual models ▸ Multi-modal models

slide-9
SLIDE 9

8/36

Warren Weaver-“Translation”, 1949

Tall towers analogy:

▸ Do not shout from tower to tower, ▸ Go down to the common basement of all towers: interlingua

slide-10
SLIDE 10

9/36

Warren Weaver-“Translation”, 1949

Tall towers analogy:

▸ Red Tower : source language ▸ Blue Tower: target language ▸ Green Car : alignment function

slide-11
SLIDE 11

10/36

Warren Weaver-“Translation”, 1949

Tall towers analogy:

▸ Do NOT model the individual behaviour of a car, ▸ Model how the highway works!

slide-12
SLIDE 12

11/36

Sequence Modelling with Finer Tokens

Issues with tokenization and segmentation

▸ Ineffective way of handling morphological variants:

’run’, ’runs’, ’running’ and ’runner’

▸ How are we doing with compound words?

Issues with treating each and every token separately

▸ Fill the vocabulary with similar words ▸ Vocabulary size grows linearly w.r.t. the corpus size ▸ Rare words, numbers and misspelled words:

9/11 is a huge contextual information

▸ Lose the learning signal of words marked as <UNK>

slide credit, Junyoung Chung

slide-13
SLIDE 13

12/36

Granularity of Input and Output Spaces (finer tokens)

slide-14
SLIDE 14

13/36

Sequence Modelling with Finer Tokens

We are still concerned,

▸ Less data sparsity (will still remain tho, Bengio et al.,2003) ▸ Consequences of increased sequence length!

▸ Capturing long-term dependencies ▸ Will be harder to train

(but wait we have GRU, LSTM and Attention)

▸ Speed loss, 2-3 times slower

but ...

▸ No need to worry about segmentation, ▸ Open vocabularies, saves us giant matrices or tricks ▸ Naturally embeds multiple languages (Lee et al.’16) ▸ And may be, multiple modalities with even finer tokens.

slide-15
SLIDE 15

14/36

Explorations: Shared Medium

Interlingua as Shared Functional Form

slide-16
SLIDE 16

15/36

Consequences: Shared Medium

Interlingua as Shared Functional Form

▸ Luong et al. 2015 - Examines multi-task sequence to sequence learning ▸ One-to-many: MT and Syntactic Parsing ▸ Many-to-one: Translation and Image Captioning ▸ Many-to-many: Unsupervised objectives and MT

slide-17
SLIDE 17

16/36

Consequences: Shared Medium

Interlingua as Shared Functional Form

▸ Firat et al. 2016a - Shared attention mechanism ▸ Notion of a shared function representing interlingua ▸ Trained using parallel data only ▸ Positive language transfer for low-resource (Firat et al.2016b) ▸ Single model that can translate 10 pairs

slide-18
SLIDE 18

17/36

Consequences: Shared Medium

Interlingua as Shared Functional Form *Training with multiple language pairs has encouraged the model to find a common context vector space (we can exploit flattened manifolds).

▸ Enables multi-source translation:

▸ Multi-source training (Zoph and Knight,2016) ▸ Multi-source decoding (Firat et al.2016c)

▸ Enables zero-resource translation (Firat et al.2016c) ▸ Easily extendible to Larger-Context NMT and System Combination

slide-19
SLIDE 19

18/36

Consequences: Shared Medium

Interlingua as Shared Functional Form

▸ Johnson et al.2016 - Google Multilingual NMT ▸ Thanh-Le Ha et al.2016 - Karlsruhe, universal encoder and decoder ▸ Mixed (multilingual) sub-word vocabularies (not chars) ▸ Enables zero-shot translation ▸ Source side code-switching (translate from a mixed source) ▸ Target side language selection (generate a mixed translation)

slide-20
SLIDE 20

19/36

Consequences: Shared Medium

Interlingua as Shared Functional Form

▸ Lee et al.2016 - Fully Character-Level Multilingual NMT ▸ Character based decoder was already proposed (Chung et al.2016) ▸ What makes it challenging?

▸ Training time (naive approach = 3 months, Luong et al.2016) ▸ Mapping character sequence to meaning. ▸ Long range dependencies in text.

▸ Map character sequence to meaning without sacrificing speed!

slide-21
SLIDE 21

20/36

Fully Character-Level Multilingual NMT

Jason Lee, Kyunghyun Cho and Thomas Hofmann, 2016

Model details,

▸ RNNSearch model ▸ Source-Target character level ▸ CNN+RNN encoder ▸ Two-layer simple GRU decoder ▸ {Fi,De,Cs,Ru} → En

Training,

▸ Mix mini-batches ▸ Use bi-text only

slide-22
SLIDE 22

21/36

Fully Character-Level Multilingual NMT

Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016

Hybrid Character Encoder

slide-23
SLIDE 23

22/36

Fully Character-Level Multilingual NMT

Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016

Experimental results for char2char:

  • 1. Bilingual char2char >= Bilingual bpe2char
  • 2. Multilingual char2char > Multilingual bpe2char

2.1 more flexible in assigning model capacity to different languages 2.2 works better than most bilingual models (as well as being more parameter efficient)

From Rico (comparison bpe2bpe - bpe2char - char2char):

slide-24
SLIDE 24

23/36

Fully Character-Level Multilingual NMT

Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Human evaluation:

▸ Multilingual char2char >= Bilingual char2char ▸ Bilingual char2char > Bilingual bpe2char

slide-25
SLIDE 25

24/36

Fully Character-Level Multilingual NMT

Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Additional qualitative analysis:

  • 1. Spelling mistakes
  • 2. Rare/long words
  • 3. Morphology
  • 4. Non-sense words
  • 5. Multi-lingual sentences (code-switching)

Also from Rico 2016 (LingEval97 - 97.000 Constrastive translation pairs):

  • 1. Noun-phrase agreement
  • 2. Subject-verb agreement
  • 3. Separable verb particle
  • 4. Polarity
  • 5. Transliteration
slide-26
SLIDE 26

25/36

slide-27
SLIDE 27

26/36

How far we can extend the existing approaches?

Bigger models, complicated architectures!

RNNs can express/approximate a set of Turing machines, BUT∗ ...

expressivity ≠ learnability

∗Edward Grefenstette: Deep Learning Summer School 2016

slide-28
SLIDE 28

27/36

How far we can extend the existing approaches?

Fast-Forward Connections for NMT, (Zhou et al., 2016)

Bigger models are harder to train!

▸ Deep topology for recurrent networks (16 layers) ▸ Performance boost (+6.2 BLEU points) ▸ Fast-forward connections for gradient flow

slide-29
SLIDE 29

28/36

How far we can extend the existing approaches?

Multi-task/Multilingual Models

Bigger models are harder to train and behave differently!

▸ Scheduling the learning process ▸ Balanced Batch trick and early stopping heuristics

(Firat et al.2016b, Lee et al.2016, Johnson et al.2016)

slide-30
SLIDE 30

29/36

How far we can extend the existing approaches?

Multi-task / Multilingual Models

Bigger models are harder to train and behave differently!

▸ Preventing the unlearning (catastrophic forgetting) ▸ Update scheduling heuristics (!)

slide-31
SLIDE 31

30/36

Interpretability

▸ Information is distributed which makes it hard to interpret ▸ What is attention model doing exactly? (Johnson et al.2016) ▸ How to dissect these giant models? ▸ Which sub-task should we use to evaluate models? ▸ Simultaneous Neural Machine Translation (Gu et al.2016) ▸ Character level alignments or importance

slide-32
SLIDE 32

31/36

Longer Sequences

Training Latency

▸ Longer credit assignment paths (BPTT) ▸ Extended training times

▸ Bilingual bpe2bpe : 1 week ▸ Bilingual char2char : 2 weeks ▸ Multilingual (10 pairs) bpe2bpe : 3 weeks (2GPU) ▸ Multilingual (4 pairs) char2char : 2.5 months

▸ Training latency limits the search for

▸ Diverse model architectures ▸ Limited hyper-parameter search

▸ How to extend larger context, document level?

slide-33
SLIDE 33

32/36

What about multiple modalities?

“Multi-modal Attention for Neural Machine Translation” Caglayan, Barrault and Bougares, 2016

slide-34
SLIDE 34

33/36

What about multiple modalities?

“Lip Reading Sentences in the Wild” Chung et al., 2016

slide-35
SLIDE 35

34/36

Why stop at characters?

slide-36
SLIDE 36

35/36

What are we optimizing?

Explorations on the right objective to be optimized NLL:

▸ MERT (Och,2003), MRT for NMT (Shen et al.,2016) ▸ Scheduled Sampling (Bengio et al.,2015), Sequence Level

Training (Ranzato et al.,2015), Task Loss Estimation (Bahdanau et al.,2015)

▸ Actor-Critic (Bahdanau et al.,2016), Reward Augmented ML

(Norouzi et al.,2016)

▸ Seq2Seq as Beam-Search optimization (Wiseman and Rush,

2016) New territory seems to be using new error signals!

slide-37
SLIDE 37

36/36

Thank you!