Competence-based Curriculum Learning for Neural Machine Transla:on - - PowerPoint PPT Presentation

competence based curriculum learning for neural machine
SMART_READER_LITE
LIVE PREVIEW

Competence-based Curriculum Learning for Neural Machine Transla:on - - PowerPoint PPT Presentation

Competence-based Curriculum Learning for Neural Machine Transla:on Anthony Platanios e.a.platanios@cs.cmu.edu Joint work with O,lia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell Neural Machine Transla/on (NMT) NMT represents the


slide-1
SLIDE 1

Competence-based Curriculum Learning for Neural Machine Transla:on

Anthony Platanios

e.a.platanios@cs.cmu.edu

Joint work with O,lia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell

slide-2
SLIDE 2

Neural Machine Transla/on (NMT)

  • NMT represents the state-of-the-art for many machine transla/on systems.
  • NMT benefits from end-to-end training with large amounts of data.
  • Large scale NMT systems are o?en hard to train:
  • Transformers rely on a number of heuris/cs such as specialized learning

rate schedules and large-batch training.

2

[Popel 2018]

Curriculum Learning

slide-3
SLIDE 3

Curriculum Learning

3

Training Time

Easy

Thank you!

Hard

Thank you, for being so pa:ent today and coming to this talk even though you’re probably :red!

Training Example

Thank you, for being so pa:ent!

Medium

slide-4
SLIDE 4

Curriculum Learning

4

Training Time

Easy Hard

Thank you! Thank you, for being so pa:ent! Thank you, for being so pa:ent today and coming to this talk even though you’re probably :red!

Training Example

Medium

slide-5
SLIDE 5

Curriculum Learning

5

Avoid geEng stuck in bad local op/ma early on!

  • [Elman 1993]: Introduced the idea of curriculum learning.
  • [Kocmi 2017, Bojar 2017]: Empirical evalua/on on MT. Final performance is hurt.
  • [Zhang 2018]: Data binning strategy. The results are highly sensi/ve on several hyperparameters.

Discrete regimes. Improvements in training Bme! No improvements in performance!

Training Time

Easy Hard

Thank you! Thank you, for being so pa:ent! Thank you, for being so pa:ent today and coming to this talk even though you’re probably :red!

Training Example

Medium

slide-6
SLIDE 6

Our Approach

6

  • Difficulty: Represents the difficulty of a training example that may depend on

the current state of the learner.

(e.g., sentence length) Training Example

We introduce two key concepts:

Training Step (e.g., valida/on set performance)

  • Competence: Value between 0 and 1 that represents the progress of a

learner during its training and can depend on the learner’s state.

slide-7
SLIDE 7

Our Approach

7

The training examples are ranked according to their difficulty and the learner is

  • nly allowed to use the top por/on of them at /me .

CURRICULUM LEARNING

DIFFICULTY

Use sample only if: diffjculty(sample) ≤ competence(model)

COMPETENCE MODEL TRAINER DATA

SAMPLE MODEL STATE

slide-8
SLIDE 8

Our Approach

8

The training examples are ranked according to their difficulty and the learner is

  • nly allowed to use the top por/on of them at /me .

CURRICULUM LEARNING

DIFFICULTY

Use sample only if: diffjculty(sample) ≤ competence(model)

COMPETENCE MODEL TRAINER DATA

SAMPLE MODEL STATE

slide-9
SLIDE 9

CURRICULUM LEARNING

DIFFICULTY

Use sample only if: diffjculty(sample) ≤ competence(model)

COMPETENCE MODEL TRAINER DATA

SAMPLE MODEL STATE

Our Approach

9

The training examples are ranked according to their difficulty and the learner is

  • nly allowed to use the top por/on of them at /me .
slide-10
SLIDE 10

Our Approach — Algorithm

10

  • 1. Compute the difficulty for each .
slide-11
SLIDE 11

Our Approach — Algorithm

11

  • 1. Compute the difficulty for each .
  • 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es.
slide-12
SLIDE 12

Thank you very much! 4 Barack Obama loves ... 13 My name is ... 6 What did she say ... 123 Sentence Length Thank you very much! 0.01 Barack Obama loves ... 0.15 My name is ... 0.03 What did she say ... 0.95 Sentence Diffjculty

Our Approach — Algorithm

12

  • 1. Compute the difficulty for each .
  • 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es.
slide-13
SLIDE 13

Thank you very much! 4 Barack Obama loves ... 13 My name is ... 6 What did she say ... 123 Sentence Length Thank you very much! 0.01 Barack Obama loves ... 0.15 My name is ... 0.03 What did she say ... 0.95 Sentence Diffjculty

Our Approach — Algorithm

13

  • 1. Compute the difficulty for each .
  • 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es.
slide-14
SLIDE 14

0.5

Thank you very much! 4 Barack Obama loves ... 13 My name is ... 6 What did she say ... 123 Sentence Length Thank you very much! 0.01 Barack Obama loves ... 0.15 My name is ... 0.03 What did she say ... 0.95 Sentence Diffjculty

Our Approach — Algorithm

14

  • 1. Compute the difficulty for each .
  • 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es.

50% shortest sentences

slide-15
SLIDE 15

Our Approach — Algorithm

15

  • 1. Compute the difficulty for each .
  • 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es.
  • 3. For training step = 1, … :

i. Compute the model competence .

slide-16
SLIDE 16

Our Approach — Algorithm

16

  • 1. Compute the difficulty for each .
  • 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es.
  • 3. For training step = 1, … :

i. Compute the model competence .

  • ii. Sample a data batch uniformly from all examples such that:
slide-17
SLIDE 17

Our Approach — Algorithm

17

  • 1. Compute the difficulty for each .
  • 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es.
  • 3. For training step = 1, … :

i. Compute the model competence .

  • ii. Sample a data batch uniformly from all examples such that:
  • iii. Invoke the model trainer using the sampled batch.
slide-18
SLIDE 18

Our Approach — Algorithm

18

  • 1. Compute the difficulty for each .
  • 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es.
  • 3. For training step = 1, … :

i. Compute the model competence .

  • ii. Sample a data batch uniformly from all examples such that:
  • iii. Invoke the model trainer using the sampled batch.

We are not changing the rela:ve probability of each training example under the input data distribu:on. We are constraining the domain of that distribu:on.

slide-19
SLIDE 19

Our Approach — Algorithm

19

Diffjculty

Step 1000

Competence

Competence at current step Sample uniformly from blue region

Step 10000

slide-20
SLIDE 20

Our Approach — Difficulty

20

We denote our training corpus as a collec:on of sentences, , where each sentence is a sequence of words: .

  • Sentence Length:
  • Word Rarity:
slide-21
SLIDE 21

Our Approach — Competence

21

a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step .

slide-22
SLIDE 22

Our Approach — Competence

22

a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Linear Competence

200 400 600 800 1000

Time

0.0 0.2 0.4 0.6 0.8 1.0

Competence

ini/al competence /me a?er which the
 learner is fully competent

slide-23
SLIDE 23

Our Approach — Competence

23

a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Learner-Dependent Competence

E.g., valida:on set performance.

Too Expensive!

slide-24
SLIDE 24

Our Approach — Competence

24

a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Root Competence

Keep the rate in which new examples come in, inversely propor:onal to the training data size:

slide-25
SLIDE 25

200 400 600 800 1000

Time

0.0 0.2 0.4 0.6 0.8 1.0

Competence clinear csqrt cr

  • t
  • 3

cr

  • t
  • 5

cr

  • t
  • 10

200 400 600 800 1000

Time

0.0 0.2 0.4 0.6 0.8 1.0

Competence clinear csqrt cr

  • t
  • 3

cr

  • t
  • 5

cr

  • t
  • 10

Our Approach — Competence

25

a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Root Competence

Keep the rate in which new examples come in, inversely propor:onal to the training data size:

200 400 600 800 1000

Time

0.0 0.2 0.4 0.6 0.8 1.0

Competence clinear csqrt cr

  • t
  • 3

cr

  • t
  • 5

cr

  • t
  • 10

200 400 600 800 1000

Time

0.0 0.2 0.4 0.6 0.8 1.0

Competence clinear csqrt cr

  • t
  • 3

cr

  • t
  • 5

cr

  • t
  • 10
slide-26
SLIDE 26

Our Approach

26

The training examples are ranked according to their difficulty and the learner is

  • nly allowed to use the top por/on of them at /me .

CURRICULUM LEARNING

DIFFICULTY

Use sample only if: diffjculty(sample) ≤ competence(model)

COMPETENCE MODEL TRAINER DATA

SAMPLE MODEL STATE

DIFFICULTY

  • Sentence Length
  • Word Rarity

COMPETENCE

  • Linear
  • Root
slide-27
SLIDE 27

Experiments — Datasets

Dataset # Train # Dev # Test IWSLT-15 En)Vi 133k 768 1268 IWSLT-16 Fr)En 224k 1080 1133 WMT-16 En)De 4.5m 3003 2999

27

slide-28
SLIDE 28
  • RNN:
  • 2-layer bidirec/onal LSTM encoder / 2-layer decoder (4 layers for WMT).
  • 512 hidden units per layer and word embedding size
  • Transformer:
  • 6-layer encoder/decoder.
  • 2,048 units for the feed-forward layers and 512 word embedding size.
  • AMSGrad op/mizer (similar to Adam) with learning rate 0.001
  • Label smoothing factor = 0.1
  • Batch size = 5,120 tokens (i.e., 256 for sentence length 20)
  • Beam width = 10 (using GNMT length normaliza/on)
  • BPE vocabulary with 32,000 merge opera/ons

Experiments — Setup

28

slide-29
SLIDE 29

Experiments — Setup

Ini:al Competence: All models start training using the 1% easiest training examples. We train the baseline model without using any curriculum, and compute the number of training steps it takes to reach ~90% of its final BLEU score. Curriculum Length:

29

slide-30
SLIDE 30

Experiments — Results

5000 10000 Step 15 20 25 30 BLEU

26.00 27.50

RNN 50000 100000 Step 15 20 25 30

28.00 30.00

Transformer Plain SL Linear SL Sqrt SR Linear SR Sqrt

IWSLT15 : En → Vi

30

slide-31
SLIDE 31

Experiments — Results

Plain SL Linear SL Sqrt SR Linear SR Sqrt 10000 20000 Step 20 25 30 35 BLEU

31.00 32.00

RNN 50000 100000 Step 20 25 30 35

34.00 36.00

Transformer

IWSLT16 : Fr → En

31

slide-32
SLIDE 32

Experiments — Results

100000 Step 15 20 25 30 BLEU

25.50 26.50

RNN 100000 200000 Step 15 20 25 30

28.00 30.00

Transformer

WMT16 : En → De

Plain SL Linear SL Sqrt SR Linear SR Sqrt

32

slide-33
SLIDE 33

Experiments — Results

Rela%ve Time to Baseline Performance

IWSLT15: En → Vi Transformer

33

slide-34
SLIDE 34

Experiments — Results

RNN IWSLT15: En → Vi IWSLT16: En → De WMT16: En → De

Rela%ve Time to Baseline Performance

Transformer

34

slide-35
SLIDE 35

Conclusion — Our Approach

35

We propose a con/nuous curriculum learning regime (i.e., no binning), that is:

  • Abstract & Extensible: Is a generaliza/on of mul/ple exis/ng approaches.
  • Simple: Can be applied to exis/ng NMT systems with only a small

modifica/on to their training data pipelines.

  • Automa:c: Has no hyper-parameters other than the curriculum length.
  • Efficient: Reduces training /me by up to 70%, while improving performance

by up to 2.2 BLEU. Also, we perform experiments on both RNNs and Transformers.

Prior work has not evaluated curriculum learning applied to Transformers.

slide-36
SLIDE 36

Thank You!

Ques:ons?

e.a.platanios@cs.cmu.edu