Competence-based Curriculum Learning for Neural Machine Transla:on - PowerPoint PPT Presentation

Competence-based Curriculum Learning for Neural Machine Transla:on Anthony Platanios e.a.platanios@cs.cmu.edu Joint work with O,lia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell

Neural Machine Transla/on (NMT) • NMT represents the state-of-the-art for many machine transla/on systems. • NMT benefits from end-to-end training with large amounts of data . • Large scale NMT systems are o?en hard to train: - Transformers rely on a number of heuris/cs such as specialized learning [Popel 2018] rate schedules and large-batch training . Curriculum Learning � 2

Curriculum Learning Easy Medium Hard Thank you, for being so pa:ent today Thank you! Thank you, for being so pa:ent! and coming to this talk even though Training Example you’re probably :red! Training Time � 3

Curriculum Learning Easy Medium Hard Thank you, for being so pa:ent today Thank you! Thank you, for being so pa:ent! and coming to this talk even though Training Example you’re probably :red! Training Time � 4

Curriculum Learning Easy Medium Hard Thank you, for being so pa:ent today Thank you! Thank you, for being so pa:ent! and coming to this talk even though Training Example you’re probably :red! Training Time Avoid geEng stuck in bad local op/ma early on! - [Elman 1993]: Introduced the idea of curriculum learning. - [Kocmi 2017, Bojar 2017]: Empirical evalua/on on MT. Final performance is hurt. - [Zhang 2018]: Data binning strategy. The results are highly sensi/ve on several hyperparameters. Discrete Improvements in No improvements in regimes. training Bme! performance! � 5

Our Approach We introduce two key concepts: • Difficulty: Represents the difficulty of a training example that may depend on the current state of the learner. (e.g., sentence length) Training Example • Competence: Value between 0 and 1 that represents the progress of a learner during its training and can depend on the learner’s state. (e.g., valida/on set performance) Training Step � 6

Our Approach CURRICULUM LEARNING DIFFICULTY COMPETENCE MODEL STATE SAMPLE Use sample only if: diffjculty(sample) ≤ competence(model) MODEL TRAINER DATA The training examples are ranked according to their difficulty and the learner is only allowed to use the top por/on of them at /me . � 7

Our Approach — Algorithm 1. Compute the difficulty for each . � 10

Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. � 11

Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. Sentence Sentence Length Diffjculty Thank you very much! 4 Thank you very much! 0.01 Barack Obama loves ... 13 Barack Obama loves ... 0.15 My name is ... 6 My name is ... 0.03 What did she say ... 123 What did she say ... 0.95 � 12

Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. Sentence Length Sentence Diffjculty Thank you very much! 4 Thank you very much! 0.01 Barack Obama loves ... 13 Barack Obama loves ... 0.15 My name is ... 6 My name is ... 0.03 What did she say ... 123 What did she say ... 0.95 � 13

Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. Sentence Length Sentence Diffjculty Thank you very much! 4 Thank you very much! 0.01 Barack Obama loves ... 13 Barack Obama loves ... 0.15 0.5 My name is ... 6 My name is ... 0.03 What did she say ... 123 What did she say ... 0.95 50% shortest sentences � 14

Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. 3. For training step = 1, … : i. Compute the model competence . � 15

Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. 3. For training step = 1, … : i. Compute the model competence . ii. Sample a data batch uniformly from all examples such that: � 16

Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. 3. For training step = 1, … : i. Compute the model competence . ii. Sample a data batch uniformly from all examples such that: iii. Invoke the model trainer using the sampled batch. � 17

Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. 3. For training step = 1, … : i. Compute the model competence . ii. Sample a data batch uniformly from all examples such that: iii. Invoke the model trainer using the sampled batch. We are not changing the rela:ve probability of each training example under the input data distribu:on. We are constraining the domain of that distribu:on. � 18

Our Approach — Algorithm Diffjculty Competence Step 1000 Sample uniformly from Competence at current step blue region Step 10000 � 19

Our Approach — Difficulty We denote our training corpus as a collec:on of sentences , , where each sentence is a sequence of words : . • Sentence Length: • Word Rarity: � 20

Our Approach — Competence a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . � 21

Our Approach — Competence a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Linear Competence 1.0 ini/al competence 0.8 Competence 0.6 0.4 0.2 /me a?er which the   0.0 0 200 400 600 800 1000 learner is fully competent Time � 22

Our Approach — Competence a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Learner-Dependent Competence E.g., valida:on set performance . Too Expensive! � 23

Our Approach — Competence a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Root Competence Keep the rate in which new examples come in, inversely propor:onal to the training data size: � 24

Our Approach — Competence a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Root Competence 1.0 1.0 1.0 1.0 Keep the rate in which new examples come in, 0.8 0.8 0.8 0.8 inversely propor:onal to the training data size: Competence Competence Competence Competence 0.6 0.6 0.6 0.6 c linear c linear c linear c linear c sqrt c sqrt c sqrt c sqrt 0.4 0.4 0.4 0.4 c r c r c r c r o o o o o o o o t t t t -3 -3 -3 -3 c r c r c r c r o o o o o o o o t t t t -5 -5 -5 -5 0.2 0.2 0.2 0.2 c r c r c r c r o o o o o o o o t t t t -10 -10 -10 -10 0.0 0.0 0.0 0.0 0 0 0 0 200 200 200 200 400 400 400 400 600 600 600 600 800 800 800 800 1000 1000 1000 1000 Time Time Time Time � 25

Our Approach CURRICULUM LEARNING DIFFICULTY COMPETENCE - Sentence Length - Linear DIFFICULTY COMPETENCE - Word Rarity - Root MODEL STATE SAMPLE Use sample only if: diffjculty(sample) ≤ competence(model) MODEL TRAINER DATA The training examples are ranked according to their difficulty and the learner is only allowed to use the top por/on of them at /me . � 26

Experiments — Datasets Dataset # Train # Dev # Test 133 k 768 1268 IWSLT-15 En ) Vi 224 k 1080 1133 IWSLT-16 Fr ) En 4 . 5 m 3003 2999 WMT-16 En ) De � 27

Experiments — Setup ‣ RNN: - 2-layer bidirec/onal LSTM encoder / 2-layer decoder (4 layers for WMT). - 512 hidden units per layer and word embedding size ‣ Transformer: - 6-layer encoder/decoder. - 2,048 units for the feed-forward layers and 512 word embedding size. ‣ AMSGrad op/mizer (similar to Adam) with learning rate 0.001 ‣ Label smoothing factor = 0.1 ‣ Batch size = 5,120 tokens (i.e., 256 for sentence length 20) ‣ Beam width = 10 (using GNMT length normaliza/on) ‣ BPE vocabulary with 32,000 merge opera/ons � 28

Competence-based Curriculum Learning for Neural Machine Transla:on - PowerPoint PPT Presentation

Competence-based Curriculum Learning for Neural Machine Transla:on Anthony Platanios e.a.platanios@cs.cmu.edu Joint work with O,lia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell Neural Machine Transla/on (NMT) NMT represents the

LEARNING AND COMPETENCE DEVELOPMENT Yrj Engestrm University of Helsinki Center for Research on

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Elementary Social Elementary Social Studies Studies Curriculum Curriculum Overview Overview

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Gillian Evans Network Leader of Learning QE High Cluster Curriculum Current School Curriculum

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace Neural

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Entitifying Europeana: Building an ecosystem of networked references for Cultural Objects Hugo

Sout Southe heast Con Conferenc nce September 2018 Finan ancial al Performan ance &

Health Care Reform WAs Insurance Exchange WSMOS

Stability of Talagrands Gaussian Transport-Entropy Inequality Dan Mikulincer Geometric and

Turkish Presidential and Parliamentary Elections Poll June 2018 Center for American

Demographic Slides Memphis NAWMP Stakeholder Input Workshop 96% What is your country of

Mental Health in North Carolina: Challenges and Solutions Thava Mahadevan 1 Gary S. Cuddeback 1

Concentration inequalities, the entropy method, search for super -concentration Concentration, ...

Competence-based Curriculum Learning for Neural Machine Transla:on - PowerPoint PPT Presentation

Competence-based Curriculum Learning for Neural Machine Transla:on Anthony Platanios e.a.platanios@cs.cmu.edu Joint work with O,lia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell Neural Machine Transla/on (NMT) NMT represents the

LEARNING AND COMPETENCE DEVELOPMENT Yrj Engestrm University of Helsinki Center for Research on

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Elementary Social Elementary Social Studies Studies Curriculum Curriculum Overview Overview

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Gillian Evans Network Leader of Learning QE High Cluster Curriculum Current School Curriculum

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks &amp; backprop Byron C Wallace Neural

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Entitifying Europeana: Building an ecosystem of networked references for Cultural Objects Hugo

Sout Southe heast Con Conferenc nce September 2018 Finan ancial al Performan ance &amp;

Health Care Reform WAs Insurance Exchange WSMOS

Stability of Talagrands Gaussian Transport-Entropy Inequality Dan Mikulincer Geometric and

Turkish Presidential and Parliamentary Elections Poll June 2018 Center for American

Demographic Slides Memphis NAWMP Stakeholder Input Workshop 96% What is your country of

Mental Health in North Carolina: Challenges and Solutions Thava Mahadevan 1 Gary S. Cuddeback 1

Concentration inequalities, the entropy method, search for super -concentration Concentration, ...

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace Neural

Sout Southe heast Con Conferenc nce September 2018 Finan ancial al Performan ance &