1/36
Modelling Multiple Sequences: Explorations, Consequences and Challenges
Orhan Firat Dagstuhl Seminar - C2NLU January 2017
Modelling Multiple Sequences: Explorations, Consequences and - - PowerPoint PPT Presentation
Modelling Multiple Sequences: Explorations, Consequences and Challenges Orhan Firat Dagstuhl Seminar - C2NLU January 2017 1/36 Before we start! The Fog of Progress 1 and Artificial General Intelligence 1Hinton video-lectures,
1/36
Orhan Firat Dagstuhl Seminar - C2NLU January 2017
2/36
The Fog of Progress1 and Artificial General Intelligence
1Hinton video-lectures,https://www.youtube.com/watch?v=ZuvRXGX8cY8
3/36
2014 2017 ⋆Google Zero-Shot NMT ⋆Fully Char-NMT ⋆Google - NMT ⋆Zero-Resource NMT ⋆WMT’16 - NMT ⋆Character-Dec NMT ⋆Multi-Source NMT ⋆Multilingual-NMT ⋆Subword-NMT ⋆WMT’15 - NMT ⋆OpenMT’15 - NMT ⋆Image Captioning - NMT ⋆Large-Vocabulary NMT ⋆Neural-MT
4/36
What Lies Ahead?
Perhaps, we’ve only scratched the surface!
▸ Language barrier, surpassing human level quality.
Revisiting the new territory:
using,
▸ Multiple modalities ▸ Better error signals ▸ and better GPUs
5/36
What is a sequence?
▸ A sequence (x1,...,xT) can be:
▸ sentence
(“visa”, “process”, “is”, “taking”, “so”, “long”, “.”)
▸ image ▸ video ▸ speech ▸ ...
6/36
What is sequence modelling? “It is all about telling how likely a sequence is.”
— Kyunghyun Cho
▸ Modelling in the sense of predictive modelling. ▸ What is the probability of (x1,...,xT)?
▸ p(x1,x2,...,xT) = ∏T
t=1 p(xt∣x<t) = ?
▸ Example: RNN language models
7/36
What do we mean by multiple sequences? Well, we really mean more than one (multiple) sequences
Conditional-LM
▸ p(x1,...,xT∣Y ) = ∏T t=1 p(xt∣x<t,Y ) ▸ seq2seq models, NMT
7/36
What do we mean by multiple sequences? Well, we really mean more than one (multiple) sequences
Conditional-LM
▸ p(x1,...,xT∣Y ) = ∏T t=1 p(xt∣x<t,Y ) ▸ seq2seq models, NMT
Multi-Way Enc1 Enc2 Enc3 Att Dec1 Dec2 Dec3
▸ Multi-lingual models ▸ Multi-modal models
8/36
Tall towers analogy:
▸ Do not shout from tower to tower, ▸ Go down to the common basement of all towers: interlingua
9/36
Tall towers analogy:
▸ Red Tower : source language ▸ Blue Tower: target language ▸ Green Car : alignment function
10/36
Tall towers analogy:
▸ Do NOT model the individual behaviour of a car, ▸ Model how the highway works!
11/36
Issues with tokenization and segmentation
▸ Ineffective way of handling morphological variants:
’run’, ’runs’, ’running’ and ’runner’
▸ How are we doing with compound words?
Issues with treating each and every token separately
▸ Fill the vocabulary with similar words ▸ Vocabulary size grows linearly w.r.t. the corpus size ▸ Rare words, numbers and misspelled words:
9/11 is a huge contextual information
▸ Lose the learning signal of words marked as <UNK>
slide credit, Junyoung Chung
12/36
13/36
We are still concerned,
▸ Less data sparsity (will still remain tho, Bengio et al.,2003) ▸ Consequences of increased sequence length!
▸ Capturing long-term dependencies ▸ Will be harder to train
(but wait we have GRU, LSTM and Attention)
▸ Speed loss, 2-3 times slower
but ...
▸ No need to worry about segmentation, ▸ Open vocabularies, saves us giant matrices or tricks ▸ Naturally embeds multiple languages (Lee et al.’16) ▸ And may be, multiple modalities with even finer tokens.
14/36
Interlingua as Shared Functional Form
15/36
Interlingua as Shared Functional Form
▸ Luong et al. 2015 - Examines multi-task sequence to sequence learning ▸ One-to-many: MT and Syntactic Parsing ▸ Many-to-one: Translation and Image Captioning ▸ Many-to-many: Unsupervised objectives and MT
16/36
Interlingua as Shared Functional Form
▸ Firat et al. 2016a - Shared attention mechanism ▸ Notion of a shared function representing interlingua ▸ Trained using parallel data only ▸ Positive language transfer for low-resource (Firat et al.2016b) ▸ Single model that can translate 10 pairs
17/36
Interlingua as Shared Functional Form *Training with multiple language pairs has encouraged the model to find a common context vector space (we can exploit flattened manifolds).
▸ Enables multi-source translation:
▸ Multi-source training (Zoph and Knight,2016) ▸ Multi-source decoding (Firat et al.2016c)
▸ Enables zero-resource translation (Firat et al.2016c) ▸ Easily extendible to Larger-Context NMT and System Combination
18/36
Interlingua as Shared Functional Form
▸ Johnson et al.2016 - Google Multilingual NMT ▸ Thanh-Le Ha et al.2016 - Karlsruhe, universal encoder and decoder ▸ Mixed (multilingual) sub-word vocabularies (not chars) ▸ Enables zero-shot translation ▸ Source side code-switching (translate from a mixed source) ▸ Target side language selection (generate a mixed translation)
19/36
Interlingua as Shared Functional Form
▸ Lee et al.2016 - Fully Character-Level Multilingual NMT ▸ Character based decoder was already proposed (Chung et al.2016) ▸ What makes it challenging?
▸ Training time (naive approach = 3 months, Luong et al.2016) ▸ Mapping character sequence to meaning. ▸ Long range dependencies in text.
▸ Map character sequence to meaning without sacrificing speed!
20/36
Jason Lee, Kyunghyun Cho and Thomas Hofmann, 2016
Model details,
▸ RNNSearch model ▸ Source-Target character level ▸ CNN+RNN encoder ▸ Two-layer simple GRU decoder ▸ {Fi,De,Cs,Ru} → En
Training,
▸ Mix mini-batches ▸ Use bi-text only
21/36
Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016
Hybrid Character Encoder
22/36
Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016
Experimental results for char2char:
2.1 more flexible in assigning model capacity to different languages 2.2 works better than most bilingual models (as well as being more parameter efficient)
From Rico (comparison bpe2bpe - bpe2char - char2char):
23/36
Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Human evaluation:
▸ Multilingual char2char >= Bilingual char2char ▸ Bilingual char2char > Bilingual bpe2char
24/36
Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Additional qualitative analysis:
Also from Rico 2016 (LingEval97 - 97.000 Constrastive translation pairs):
25/36
26/36
Bigger models, complicated architectures!
RNNs can express/approximate a set of Turing machines, BUT∗ ...
∗Edward Grefenstette: Deep Learning Summer School 2016
27/36
Fast-Forward Connections for NMT, (Zhou et al., 2016)
Bigger models are harder to train!
▸ Deep topology for recurrent networks (16 layers) ▸ Performance boost (+6.2 BLEU points) ▸ Fast-forward connections for gradient flow
28/36
Multi-task/Multilingual Models
Bigger models are harder to train and behave differently!
▸ Scheduling the learning process ▸ Balanced Batch trick and early stopping heuristics
(Firat et al.2016b, Lee et al.2016, Johnson et al.2016)
29/36
Multi-task / Multilingual Models
Bigger models are harder to train and behave differently!
▸ Preventing the unlearning (catastrophic forgetting) ▸ Update scheduling heuristics (!)
30/36
▸ Information is distributed which makes it hard to interpret ▸ What is attention model doing exactly? (Johnson et al.2016) ▸ How to dissect these giant models? ▸ Which sub-task should we use to evaluate models? ▸ Simultaneous Neural Machine Translation (Gu et al.2016) ▸ Character level alignments or importance
31/36
Training Latency
▸ Longer credit assignment paths (BPTT) ▸ Extended training times
▸ Bilingual bpe2bpe : 1 week ▸ Bilingual char2char : 2 weeks ▸ Multilingual (10 pairs) bpe2bpe : 3 weeks (2GPU) ▸ Multilingual (4 pairs) char2char : 2.5 months
▸ Training latency limits the search for
▸ Diverse model architectures ▸ Limited hyper-parameter search
▸ How to extend larger context, document level?
32/36
“Multi-modal Attention for Neural Machine Translation” Caglayan, Barrault and Bougares, 2016
33/36
“Lip Reading Sentences in the Wild” Chung et al., 2016
34/36
35/36
Explorations on the right objective to be optimized NLL:
▸ MERT (Och,2003), MRT for NMT (Shen et al.,2016) ▸ Scheduled Sampling (Bengio et al.,2015), Sequence Level
Training (Ranzato et al.,2015), Task Loss Estimation (Bahdanau et al.,2015)
▸ Actor-Critic (Bahdanau et al.,2016), Reward Augmented ML
(Norouzi et al.,2016)
▸ Seq2Seq as Beam-Search optimization (Wiseman and Rush,
2016) New territory seems to be using new error signals!
36/36