competence based curriculum learning for neural machine
play

Competence-based Curriculum Learning for Neural Machine Transla:on - PowerPoint PPT Presentation

Competence-based Curriculum Learning for Neural Machine Transla:on Anthony Platanios e.a.platanios@cs.cmu.edu Joint work with O,lia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell Neural Machine Transla/on (NMT) NMT represents the


  1. Competence-based Curriculum Learning for Neural Machine Transla:on Anthony Platanios e.a.platanios@cs.cmu.edu Joint work with O,lia Stretcu, Graham Neubig, Barnabas Poczos, and Tom Mitchell

  2. Neural Machine Transla/on (NMT) • NMT represents the state-of-the-art for many machine transla/on systems. • NMT benefits from end-to-end training with large amounts of data . • Large scale NMT systems are o?en hard to train: - Transformers rely on a number of heuris/cs such as specialized learning [Popel 2018] rate schedules and large-batch training . Curriculum Learning � 2

  3. Curriculum Learning Easy Medium Hard Thank you, for being so pa:ent today Thank you! Thank you, for being so pa:ent! and coming to this talk even though Training Example you’re probably :red! Training Time � 3

  4. Curriculum Learning Easy Medium Hard Thank you, for being so pa:ent today Thank you! Thank you, for being so pa:ent! and coming to this talk even though Training Example you’re probably :red! Training Time � 4

  5. Curriculum Learning Easy Medium Hard Thank you, for being so pa:ent today Thank you! Thank you, for being so pa:ent! and coming to this talk even though Training Example you’re probably :red! Training Time Avoid geEng stuck in bad local op/ma early on! - [Elman 1993]: Introduced the idea of curriculum learning. - [Kocmi 2017, Bojar 2017]: Empirical evalua/on on MT. Final performance is hurt. - [Zhang 2018]: Data binning strategy. The results are highly sensi/ve on several hyperparameters. Discrete Improvements in No improvements in regimes. training Bme! performance! � 5

  6. Our Approach We introduce two key concepts: • Difficulty: Represents the difficulty of a training example that may depend on the current state of the learner. (e.g., sentence length) Training Example • Competence: Value between 0 and 1 that represents the progress of a learner during its training and can depend on the learner’s state. (e.g., valida/on set performance) Training Step � 6

  7. Our Approach CURRICULUM LEARNING DIFFICULTY COMPETENCE MODEL STATE SAMPLE Use sample only if: diffjculty(sample) ≤ competence(model) MODEL TRAINER DATA The training examples are ranked according to their difficulty and the learner is only allowed to use the top por/on of them at /me . � 7

  8. Our Approach CURRICULUM LEARNING DIFFICULTY COMPETENCE MODEL STATE SAMPLE Use sample only if: diffjculty(sample) ≤ competence(model) MODEL TRAINER DATA The training examples are ranked according to their difficulty and the learner is only allowed to use the top por/on of them at /me . � 8

  9. Our Approach CURRICULUM LEARNING DIFFICULTY COMPETENCE MODEL STATE SAMPLE Use sample only if: diffjculty(sample) ≤ competence(model) MODEL TRAINER DATA The training examples are ranked according to their difficulty and the learner is only allowed to use the top por/on of them at /me . � 9

  10. Our Approach — Algorithm 1. Compute the difficulty for each . � 10

  11. Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. � 11

  12. Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. Sentence Sentence Length Diffjculty Thank you very much! 4 Thank you very much! 0.01 Barack Obama loves ... 13 Barack Obama loves ... 0.15 My name is ... 6 My name is ... 0.03 What did she say ... 123 What did she say ... 0.95 � 12

  13. Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. Sentence Length Sentence Diffjculty Thank you very much! 4 Thank you very much! 0.01 Barack Obama loves ... 13 Barack Obama loves ... 0.15 My name is ... 6 My name is ... 0.03 What did she say ... 123 What did she say ... 0.95 � 13

  14. Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. Sentence Length Sentence Diffjculty Thank you very much! 4 Thank you very much! 0.01 Barack Obama loves ... 13 Barack Obama loves ... 0.15 0.5 My name is ... 6 My name is ... 0.03 What did she say ... 123 What did she say ... 0.95 50% shortest sentences � 14

  15. Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. 3. For training step = 1, … : i. Compute the model competence . � 15

  16. Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. 3. For training step = 1, … : i. Compute the model competence . ii. Sample a data batch uniformly from all examples such that: � 16

  17. Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. 3. For training step = 1, … : i. Compute the model competence . ii. Sample a data batch uniformly from all examples such that: iii. Invoke the model trainer using the sampled batch. � 17

  18. Our Approach — Algorithm 1. Compute the difficulty for each . 2. Compute the cumula:ve density func:on (CDF), , of the difficul/es. 3. For training step = 1, … : i. Compute the model competence . ii. Sample a data batch uniformly from all examples such that: iii. Invoke the model trainer using the sampled batch. We are not changing the rela:ve probability of each training example under the input data distribu:on. We are constraining the domain of that distribu:on. � 18

  19. Our Approach — Algorithm Diffjculty Competence Step 1000 Sample uniformly from Competence at current step blue region Step 10000 � 19

  20. Our Approach — Difficulty We denote our training corpus as a collec:on of sentences , , where each sentence is a sequence of words : . • Sentence Length: • Word Rarity: � 20

  21. Our Approach — Competence a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . � 21

  22. Our Approach — Competence a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Linear Competence 1.0 ini/al competence 0.8 Competence 0.6 0.4 0.2 /me a?er which the 
 0.0 0 200 400 600 800 1000 learner is fully competent Time � 22

  23. Our Approach — Competence a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Learner-Dependent Competence E.g., valida:on set performance . Too Expensive! � 23

  24. Our Approach — Competence a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Root Competence Keep the rate in which new examples come in, inversely propor:onal to the training data size: � 24

  25. Our Approach — Competence a value in [0, 1] that represents the progress of a learner during its training. propor/on of training data the learner is allowed to use at step . Root Competence 1.0 1.0 1.0 1.0 Keep the rate in which new examples come in, 0.8 0.8 0.8 0.8 inversely propor:onal to the training data size: Competence Competence Competence Competence 0.6 0.6 0.6 0.6 c linear c linear c linear c linear c sqrt c sqrt c sqrt c sqrt 0.4 0.4 0.4 0.4 c r c r c r c r o o o o o o o o t t t t -3 -3 -3 -3 c r c r c r c r o o o o o o o o t t t t -5 -5 -5 -5 0.2 0.2 0.2 0.2 c r c r c r c r o o o o o o o o t t t t -10 -10 -10 -10 0.0 0.0 0.0 0.0 0 0 0 0 200 200 200 200 400 400 400 400 600 600 600 600 800 800 800 800 1000 1000 1000 1000 Time Time Time Time � 25

  26. Our Approach CURRICULUM LEARNING DIFFICULTY COMPETENCE - Sentence Length - Linear DIFFICULTY COMPETENCE - Word Rarity - Root MODEL STATE SAMPLE Use sample only if: diffjculty(sample) ≤ competence(model) MODEL TRAINER DATA The training examples are ranked according to their difficulty and the learner is only allowed to use the top por/on of them at /me . � 26

  27. Experiments — Datasets Dataset # Train # Dev # Test 133 k 768 1268 IWSLT-15 En ) Vi 224 k 1080 1133 IWSLT-16 Fr ) En 4 . 5 m 3003 2999 WMT-16 En ) De � 27

  28. Experiments — Setup ‣ RNN: - 2-layer bidirec/onal LSTM encoder / 2-layer decoder (4 layers for WMT). - 512 hidden units per layer and word embedding size ‣ Transformer: - 6-layer encoder/decoder. - 2,048 units for the feed-forward layers and 512 word embedding size. ‣ AMSGrad op/mizer (similar to Adam) with learning rate 0.001 ‣ Label smoothing factor = 0.1 ‣ Batch size = 5,120 tokens (i.e., 256 for sentence length 20) ‣ Beam width = 10 (using GNMT length normaliza/on) ‣ BPE vocabulary with 32,000 merge opera/ons � 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend