modelling multiple sequences explorations consequences
play

Modelling Multiple Sequences: Explorations, Consequences and - PowerPoint PPT Presentation

Modelling Multiple Sequences: Explorations, Consequences and Challenges Orhan Firat Dagstuhl Seminar - C2NLU January 2017 1/36 Before we start! The Fog of Progress 1 and Artificial General Intelligence 1Hinton video-lectures,


  1. Modelling Multiple Sequences: Explorations, Consequences and Challenges Orhan Firat Dagstuhl Seminar - C2NLU January 2017 1/36

  2. Before we start! The Fog of Progress 1 and Artificial General Intelligence 1Hinton video-lectures, https://www.youtube.com/watch?v=ZuvRXGX8cY8 2/36

  3. What is going on? 2014 2017 ⋆ Neural-MT ⋆ Large-Vocabulary NMT ⋆ Image Captioning - NMT ⋆ OpenMT’15 - NMT ⋆ WMT’15 - NMT ⋆ Subword-NMT ⋆ Multilingual-NMT ⋆ Multi-Source NMT ⋆ Character-Dec NMT ⋆ WMT’16 - NMT ⋆ Zero-Resource NMT ⋆ Google - NMT ⋆ Fully Char-NMT ⋆ Google Zero-Shot NMT 3/36

  4. Conclusion Slide of Machine Translation Marathon’16 What Lies Ahead? Perhaps, we’ve only scratched the surface! ▸ Language barrier, surpassing human level quality. Revisiting the new territory: Character-level Larger-Context Multilingual Neural Machine Translation using, ▸ Multiple modalities ▸ Better error signals ▸ and better GPUs 4/36

  5. Multi-Sequence Modelling What is a sequence? ▸ A sequence ( x 1 ,..., x T ) can be: ▸ sentence (“visa”, “process”, “is”, “taking”, “so”, “long”, “.”) ▸ image ▸ video ▸ speech ▸ ... 5/36

  6. Multi-Sequence Modelling What is sequence modelling? “It is all about telling how likely a sequence is.” — Kyunghyun Cho ▸ Modelling in the sense of predictive modelling . ▸ What is the probability of ( x 1 ,..., x T ) ? ▸ p ( x 1 , x 2 ,..., x T ) = ∏ T t = 1 p ( x t ∣ x < t ) = ? ▸ Example: RNN language models 6/36

  7. Multi-Sequence Modelling What do we mean by multiple sequences? Well, we really mean more than one (multiple) sequences � Conditional-LM ▸ p ( x 1 ,..., x T ∣ Y ) = ∏ T t = 1 p ( x t ∣ x < t , Y ) ▸ seq2seq models, NMT 7/36

  8. Multi-Sequence Modelling What do we mean by multiple sequences? Well, we really mean more than one (multiple) sequences � Conditional-LM Multi-Way ▸ p ( x 1 ,..., x T ∣ Y ) = ∏ T t = 1 p ( x t ∣ x < t , Y ) Dec 1 Dec 2 Dec 3 ▸ seq2seq models, NMT Att Enc 1 Enc 2 Enc 3 ▸ Multi-lingual models ▸ Multi-modal models 7/36

  9. Warren Weaver-“Translation”, 1949 Tall towers analogy: ▸ Do not shout from tower to tower, ▸ Go down to the common basement of all towers: interlingua 8/36

  10. Warren Weaver-“Translation”, 1949 Tall towers analogy: ▸ Red Tower : source language ▸ Blue Tower: target language ▸ Green Car : alignment function 9/36

  11. Warren Weaver-“Translation”, 1949 Tall towers analogy: ▸ Do NOT model the individual behaviour of a car, ▸ Model how the highway works! 10/36

  12. Sequence Modelling with Finer Tokens Issues with tokenization and segmentation ▸ Ineffective way of handling morphological variants: ’run’, ’runs’, ’running’ and ’runner’ ▸ How are we doing with compound words? Issues with treating each and every token separately ▸ Fill the vocabulary with similar words ▸ Vocabulary size grows linearly w.r.t. the corpus size ▸ Rare words, numbers and misspelled words: 9/11 is a huge contextual information ▸ Lose the learning signal of words marked as < UNK > slide credit, Junyoung Chung 11/36

  13. Granularity of Input and Output Spaces (finer tokens) 12/36

  14. Sequence Modelling with Finer Tokens We are still concerned, ▸ Less data sparsity (will still remain tho, Bengio et al.,2003) ▸ Consequences of increased sequence length! ▸ Capturing long-term dependencies ▸ Will be harder to train (but wait we have GRU, LSTM and Attention) ▸ Speed loss, 2-3 times slower but ... ▸ No need to worry about segmentation, ▸ Open vocabularies, saves us giant matrices or tricks ▸ Naturally embeds multiple languages ( Lee et al.’16 ) ▸ And may be, multiple modalities with even finer tokens. 13/36

  15. Explorations: Shared Medium Interlingua as Shared Functional Form 14/36

  16. Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Luong et al. 2015 - Examines multi-task sequence to sequence learning ▸ One-to-many: MT and Syntactic Parsing ▸ Many-to-one: Translation and Image Captioning ▸ Many-to-many: Unsupervised objectives and MT 15/36

  17. Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Firat et al. 2016a - Shared attention mechanism ▸ Notion of a shared function representing interlingua ▸ Trained using parallel data only ▸ Positive language transfer for low-resource (Firat et al.2016b) ▸ Single model that can translate 10 pairs 16/36

  18. Consequences: Shared Medium Interlingua as Shared Functional Form *Training with multiple language pairs has encouraged the model to find a common context vector space (we can exploit flattened manifolds). ▸ Enables multi-source translation: ▸ Multi-source training (Zoph and Knight,2016) ▸ Multi-source decoding (Firat et al.2016c) ▸ Enables zero-resource translation (Firat et al.2016c) ▸ Easily extendible to Larger-Context NMT and System Combination 17/36

  19. Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Johnson et al.2016 - Google Multilingual NMT ▸ Thanh-Le Ha et al.2016 - Karlsruhe, universal encoder and decoder ▸ Mixed (multilingual) sub-word vocabularies (not chars) ▸ Enables zero-shot translation ▸ Source side code-switching (translate from a mixed source) ▸ Target side language selection (generate a mixed translation) 18/36

  20. Consequences: Shared Medium Interlingua as Shared Functional Form ▸ Lee et al.2016 - Fully Character-Level Multilingual NMT ▸ Character based decoder was already proposed (Chung et al.2016) ▸ What makes it challenging? ▸ Training time (naive approach = 3 months, Luong et al.2016) ▸ Mapping character sequence to meaning. ▸ Long range dependencies in text. ▸ Map character sequence to meaning without sacrificing speed ! 19/36

  21. Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann, 2016 Model details, ▸ RNNSearch model ▸ Source-Target character level ▸ CNN+RNN encoder ▸ Two-layer simple GRU decoder ▸ { Fi , De , Cs , Ru } → En Training, ▸ Mix mini-batches ▸ Use bi-text only 20/36

  22. Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Hybrid Character Encoder 21/36

  23. Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Experimental results for char2char: 1. Bilingual char2char >= Bilingual bpe2char 2. Multilingual char2char > Multilingual bpe2char 2.1 more flexible in assigning model capacity to different languages 2.2 works better than most bilingual models (as well as being more parameter efficient) From Rico (comparison bpe2bpe - bpe2char - char2char): 22/36

  24. Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Human evaluation: ▸ Multilingual char2char >= Bilingual char2char ▸ Bilingual char2char > Bilingual bpe2char 23/36

  25. Fully Character-Level Multilingual NMT Jason Lee, Kyunghyun Cho and Thomas Hofmann 2016 Additional qualitative analysis: 1. Spelling mistakes 2. Rare/long words 3. Morphology 4. Non-sense words 5. Multi-lingual sentences (code-switching) Also from Rico 2016 (LingEval97 - 97.000 Constrastive translation pairs): 1. Noun-phrase agreement 2. Subject-verb agreement 3. Separable verb particle 4. Polarity 5. Transliteration 24/36

  26. 25/36

  27. How far we can extend the existing approaches? Bigger models, complicated architectures! RNNs can express/approximate a set of Turing machines, BUT ∗ ... expressivity ≠ learnability ∗ Edward Grefenstette: Deep Learning Summer School 2016 26/36

  28. How far we can extend the existing approaches? Fast-Forward Connections for NMT, (Zhou et al., 2016) Bigger models are harder to train! ▸ Deep topology for recurrent networks (16 layers) ▸ Performance boost (+6.2 BLEU points) ▸ Fast-forward connections for gradient flow 27/36

  29. How far we can extend the existing approaches? Multi-task/Multilingual Models Bigger models are harder to train and behave differently! ▸ Scheduling the learning process ▸ Balanced Batch trick and early stopping heuristics (Firat et al.2016b, Lee et al.2016, Johnson et al.2016) 28/36

  30. How far we can extend the existing approaches? Multi-task / Multilingual Models Bigger models are harder to train and behave differently! ▸ Preventing the unlearning (catastrophic forgetting) ▸ Update scheduling heuristics (!) 29/36

  31. Interpretability ▸ Information is distributed which makes it hard to interpret ▸ What is attention model doing exactly? (Johnson et al.2016) ▸ How to dissect these giant models? ▸ Which sub-task should we use to evaluate models? ▸ Simultaneous Neural Machine Translation (Gu et al.2016) ▸ Character level alignments or importance 30/36

  32. Longer Sequences Training Latency ▸ Longer credit assignment paths (BPTT) ▸ Extended training times ▸ Bilingual bpe2bpe : 1 week ▸ Bilingual char2char : 2 weeks ▸ Multilingual (10 pairs) bpe2bpe : 3 weeks (2GPU) ▸ Multilingual (4 pairs) char2char : 2.5 months ▸ Training latency limits the search for ▸ Diverse model architectures ▸ Limited hyper-parameter search ▸ How to extend larger context, document level? 31/36

  33. What about multiple modalities? “Multi-modal Attention for Neural Machine Translation” Caglayan, Barrault and Bougares, 2016 32/36

  34. What about multiple modalities? “Lip Reading Sentences in the Wild” Chung et al., 2016 33/36

  35. Why stop at characters? 34/36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend