Parameter-Efficient Transfer Learning for NLP N. Houlsby, A. - - PowerPoint PPT Presentation

parameter efficient transfer learning for nlp
SMART_READER_LITE
LIVE PREVIEW

Parameter-Efficient Transfer Learning for NLP N. Houlsby, A. - - PowerPoint PPT Presentation

Parameter-Efficient Transfer Learning for NLP N. Houlsby, A. Giurgiu*, S. Jastrzbski*, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly Imagine doing Transfer Learning for NLP Ingredients: A large pretrained model


slide-1
SLIDE 1

Parameter-Efficient Transfer Learning for NLP

  • N. Houlsby, A. Giurgiu*, S. Jastrzębski*, B. Morrone,
  • Q. de Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly
slide-2
SLIDE 2

Ingredients:

  • A large pretrained model (BERT)
  • Fine-tuning

Imagine doing Transfer Learning for NLP

2/5

slide-3
SLIDE 3

Ingredients:

  • A large pretrained model (BERT)
  • Fine-tuning

Imagine doing Transfer Learning for NLP

2/5

BERT Task 1 BERT Task 2 BERT Task N-1 BERT Task N ... Problem for large N

slide-4
SLIDE 4

Ingredients:

  • A large pretrained model (BERT)
  • Fine-tuning

Imagine doing Transfer Learning for NLP

2/5

Task 1 BERT Task 2 Task N-1 Task N + Adapter N-1 + Adapter N + Adapter 2 + Adapter 1

slide-5
SLIDE 5

5

Solution

5

BERT + Adapters

  • Solution: Train tiny adapter modules at

each layer

3/5

slide-6
SLIDE 6

6

Solution

6

BERT + Adapters

  • Solution: Train tiny adapter modules at

each layer

3/5

slide-7
SLIDE 7

7

Solution

7

BERT + Adapters

  • Solution: Train tiny adapter modules at

each layer

3/5

slide-8
SLIDE 8

8

Solution

8

BERT + Adapters

  • Solution: Train tiny adapter modules at

each layer

3/5

Bottleneck

slide-9
SLIDE 9

Results on GLUE Benchmark

4/5

slide-10
SLIDE 10

Results on GLUE Benchmark

4/5

slide-11
SLIDE 11

Results on GLUE Benchmark

4/5

slide-12
SLIDE 12

Results on GLUE Benchmark

4/5

slide-13
SLIDE 13

Results on GLUE Benchmark

4/5

Fewer parameters, degraded performance Fewer parameters, similar performance

slide-14
SLIDE 14

Results on GLUE Benchmark

0.4% accuracy drop for 96.4% reduction in the #

  • f parameters/task

4/5

slide-15
SLIDE 15

Conclusions

  • 1. If we move towards a single model future, we need

to improve parameter-efficiency of transfer learning

  • 2. We propose a module reducing drastically #

params/task for NLP, e.g. by 30x at only 0.4% accuracy drop Related work (@ ICML): “BERT and PALs: Projected Attention Layers for

Efficient Adaptation in Multi-Task Learning”, A. Stickland & I. Murray

Please come to our poster today at 6:30 PM (#102)

5/5