IN5550 Neural Methods in Natural Language Processing Ensembles, - - PowerPoint PPT Presentation

in5550 neural methods in natural language processing
SMART_READER_LITE
LIVE PREVIEW

IN5550 Neural Methods in Natural Language Processing Ensembles, - - PowerPoint PPT Presentation

IN5550 Neural Methods in Natural Language Processing Ensembles, transfer and multi-task learning Erik Velldal University of Oslo 31 March 2020 This session No new bricks. Taking what we already have, putting it together in new


slide-1
SLIDE 1

– IN5550 – Neural Methods in Natural Language Processing Ensembles, transfer and multi-task learning

Erik Velldal

University of Oslo

31 March 2020

slide-2
SLIDE 2

This session

◮ No new bricks. ◮ Taking what we already have, putting it together in new ways. ◮ Ensemble learning

◮ Training several models to do one task.

◮ Multi-task learning

◮ Training one model to do several tasks.

◮ Transfer learning

◮ Training a model for a new task based

  • n a model for some other task.

2

slide-3
SLIDE 3

Standard approach to model selection

◮ Train a bunch of models ◮ Keep the model with best performance on the development set ◮ Discard the rest ◮ Some issues: ◮ Best on dev. is not necessarily best on held-out. ◮ ANNs generally have low bias and high variance, can be unstable and have a danger of overfitting. ◮ Models might have non-overlapping errors. ◮ Ensemble methods may help.

3

slide-4
SLIDE 4

Ensemble learning

◮ Combine multiple models to obtain better performance than for any of the individual base models alone. ◮ The various base models in the ensemble could be based on the same or different learning algorithms. ◮ Several meta-heuristics available for how to create the base models and how to combine their predictions. E.g.:

◮ Boosting ◮ Bagging ◮ Stacking

4

slide-5
SLIDE 5

Examples of ensembling

Boosting ◮ The base learners are generated sequentially: ◮ Incrementally build the ensemble by training each new model to emphasize training instances that previous models misclassified. ◮ Combine predictions through a weighted majority vote (classification) or average (regression). Bagging (Bootstrap AGGregating) ◮ The base learners are generated independently: ◮ Create multiple instances of the training data by sampling with replacement, training a separate model for each. ◮ Combine (‘aggregate’) predictions by voting or averaging.

5

slide-6
SLIDE 6

Examples of ensembling

Stacking ◮ Train several base-level models on the complete training set, ◮ then train a meta-model with the base model predictions as features. ◮ Ofen used with heterogeneous ensembles. Drawbacks of ensembling ◮ ANNs often applied in ensembles to squeeze out some extra F1 points. ◮ But their high leaderboard ranks come at a high computational cost: ◮ Must learn, store, and apply several separate models.

6

slide-7
SLIDE 7

Distillation

◮ High acc./F1 models tend to have a high number of parameters. ◮ Often too inefficient to deploy in real systems. ◮ Knowledge distillation is a technique for reducing the complexity while retaining much of the performance. ◮ Idea: Train a (smaller) student model to mimic the behaviour of a (larger) teacher model. ◮ The student is typically trained using the output probabilities of the teacher as soft labels. ◮ Can be used to distill an ensemble into a single model.

7

slide-8
SLIDE 8

ML as a one-trick pony

◮ Standard single-task models: ◮ Ensembles:

8

slide-9
SLIDE 9

Enter multi-task learning

◮ Train one model to solve multiple tasks. ◮ Each task has its own loss-function, but the model weights are (partly) shared. ◮ Examples for the different labels can be distinct (take turns picking examples) or the same. ◮ Most useful for closely related tasks. ◮ Example: PoS-tagging and syntactic chunking.

9

slide-10
SLIDE 10

Standard single-model approach

10

slide-11
SLIDE 11

Multi-task approach

◮ Often one task will be considered the main task; the others so-called supporting- or auxilliary tasks.

11

slide-12
SLIDE 12

Hierarchical / cascading multi-task learning

◮ Observation: while relying on similar underlying information, tagging intuitively seems more low-level than chunking. ◮ Cascading architecture with selective sharing of parameters: ◮ Note that the units of classifiation for the main and aux. tasks can be different, e.g. sentence- vs word-level.

12

slide-13
SLIDE 13

Transfer learning

◮ Learn a model M1 for task A, and re-use (parts of) M1 in another model M2 to be (re-)trained for task B. ◮ Example: Transfer learning with tagging as the source task and chunking as the target (destination) task. ◮ Can you think of any examples of transfer learning we’ve seen so far?

13

slide-14
SLIDE 14

Related notions

◮ Self-supervised learning: ◮ Making use of unlabeled data while learning in a supervised manner. ◮ E.g. word embeddings, trained by predicting words in context. ◮ Pretrained LMs most widely used instance of transfer in NLP. ◮ Transfer sometimes applied for domain adaptation: ◮ Same task but different domains or genres. ◮ Can also be used as part of distillation.

14

slide-15
SLIDE 15

TL/MTL and regularization

◮ MTL can be seen as a regularizer in its own right; keeps the weights from specializing too much to just one task. ◮ With transfer on the other hand, there is often a risk of unlearning too much of the pre-trained information: ◮ ‘Catastrophic forgetting’ (McCloskey & Cohen, 1989; Ratcliff, 1990). ◮ May need to introduce regularization for the transfered layers. ◮ Extreme case: frozen weights (infinite regularization) ◮ Not unusual to only re-train selected parameters / higher layers. ◮ Other strategies: gradual unfreezing, reduced or layer-specific learning rates (in addition to early stopping, dropout, L2, etc.)

15

slide-16
SLIDE 16

When is TL/MTL most useful

◮ When low-level features learned for task A could be helpful for learning task B. ◮ When you have limited labeled data for your main/target task and want to tap into a larger dataset for some other related aux/source task.

16

slide-17
SLIDE 17

TL/MTL in NLP

◮ TL/MTL is particularly well-suited for neural models: ◮ Representation learners! With a modular design. ◮ Intuitively very well-suited for NLP too: ◮ Due to the complexity of the overall task of NLP (understanding language), it has been split up into innumerable sub-tasks. ◮ Typically have rather small labeled data sets, but closely related tasks. ◮ We’ve unfortunately not seen huge boosts (unlike e.g. computer vision). ◮ But TL/MTL still a very active area of research. ◮ Most promising so far: Transfer of pre-trained word or sentence embeddings as input representations. ◮ Lots of research currently on the representational transferability of different encoding architectures and objectives.

17

slide-18
SLIDE 18

Next:

◮ More about transfer as pre-training ◮ Contextual word embeddings ◮ Universal sentence embeddings

18