IN5550: Neural Methods in Natural Language Processing IN5550 - - PowerPoint PPT Presentation

in5550 neural methods in natural language processing
SMART_READER_LITE
LIVE PREVIEW

IN5550: Neural Methods in Natural Language Processing IN5550 - - PowerPoint PPT Presentation

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural Language Processing Transformers Jeremy Barnes University of Oslo March 31, 2020 Attention - tl;dr Pay attention to a weighted combination of input


slide-1
SLIDE 1

IN5550: Neural Methods in Natural Language Processing – IN5550 – Neural Methods in Natural Language Processing Transformers

Jeremy Barnes

University of Oslo

March 31, 2020

slide-2
SLIDE 2

Attention - tl;dr

Pay attention to a weighted combination of input states to generate the right output state

2

slide-3
SLIDE 3

Attention

RNNs + attention work great, but are inefficient ◮ impossible to make computation parallel ◮ leads to long train times and smaller models ◮ To enjoy the benefits of deep learning, our models need to be truly deep!

3

slide-4
SLIDE 4

Desiderta for a new kind of model

  • 1. Reduce the total computational complexity per layer
  • 2. Increase the amount of computation that can be parallelized
  • 3. Ensure that the model can efficiently learn long range dependencies.

4

slide-5
SLIDE 5

Self-attention

John Lennon, 1967: love is all u need Vaswani et al., 2017:

5

slide-6
SLIDE 6

Self-attention

Main principle: instead of a target paying attention to different parts of the source, make the source pay attention to itself.

6

slide-7
SLIDE 7

Self-attention

I am not so happy . I am not so happy . Values (V) Keys (K)

◮ By making parts of a sentence pay attention to other parts of itself, we get better representations ◮ This can be an RNN replacement ◮ Where an RNN carries long-term information down a chain, self-attention acts more like a tree

7

slide-8
SLIDE 8

Transformer

Remember for attention: K = key vector V = value vector

8

slide-9
SLIDE 9

Transformer

For Transformers, we will add another: Q = query vector K = key vector V = value vector We will see what the difference is in a minute

9

slide-10
SLIDE 10

Transformer

10

slide-11
SLIDE 11

Transformer attention

11

slide-12
SLIDE 12

Scaled dot product attention

Remember, we have two ways of doing attention

  • 1. easy, fast way: dot product attention
  • 2. parameterized : 1-layer feed-forward network to determine attention

weights But can’t we get the benefits of the 2nd without the extra parameters? hypothesis: With large values, dot product leads to large values, and the gradient becomes small (squashing with sigmoid/tanh) solution: Scaled dot-product attention Make sure the dot product doesn’t get too big

12

slide-13
SLIDE 13

Scaled dot product attention

The important bit: The maths: Attention(Q, K, V ) = softmax(QKT √dk )V

13

slide-14
SLIDE 14

Transformer

Attention(Q, K, V ) = softmax(QKT √dk )V What’s happening at a token level: ◮ Obtain three representations of the input, Q, K and V - query, key and value ◮ Obtain a set of relevance strengths: QKT . For words i and j, Qi · Kj represents the strength of the association - exactly like in seq2seq attention. ◮ Scale it (stabler gradients, boring maths) and softmax for αs. ◮ Unlike seq2seq, use different ‘value’ vectors to weight. In a sense, this is exactly like seq2seq attention, except: a) non-recurrent representations, b) same source/target, c) different value vectors

14

slide-15
SLIDE 15

Intuition behind Query, Key, and Value vectors

15

slide-16
SLIDE 16

Multi-head attention

16

slide-17
SLIDE 17

Adding heads

Revolutionary idea: if representations learn so much from attention, why not learn many attentions Multi-headed attention is many self-attentions (Simplified) transformer:

17

slide-18
SLIDE 18

Point-wise feed-forward layers

18

slide-19
SLIDE 19

Point-wise feed-forward layers

◮ Add 2-layer feed-forward layers after attention layers ◮ Same across all positions, but different for layers ◮ Again a trade-off to increase model complexity while keeping computation costs down

19

slide-20
SLIDE 20

Position embeddings

20

slide-21
SLIDE 21

Position embeddings

But wait, now we lost our sequence information :( ◮ Use an encoding that gives this information ◮ Mix of sine and cosine functions

◮ How would this work? ◮ Why do they need both?

In the end, learning positional embeddings is often better ◮ But, it has a very large disadvantage ◮ No way to represent sequences longer than those seen in training ◮ You have to chop your data off at an arbitrary length

21

slide-22
SLIDE 22

Depth

To get the benefits of deep learning, we need depth. ◮ Let’s make it deep: ◮ encoder: 6 layers ◮ decoder : 6 layers

22

slide-23
SLIDE 23

Transformer

23

slide-24
SLIDE 24

Transformer

◮ Can be complicated to train ◮ Has its own ADAM setup (learning rate is proportional to step−0.5) ◮ dropout added just before residual ◮ label smoothing ◮ during decoding add length penalties ◮ checkpoint averaging

24

slide-25
SLIDE 25

Transformer

◮ Have a look at The Annotated Transformer ◮ http://nlp.seas.harvard.edu/2018/04/03/attention.html

25

slide-26
SLIDE 26

Evolution of Transformer-based Models

26

slide-27
SLIDE 27

Evolution of Transformer-based Models

27

slide-28
SLIDE 28

Evolution of Transformer-based Models

28

slide-29
SLIDE 29

Carbon footprint

*from Strubel et al. (2019) Energy and policy considerations for deep learning in NLP. 29

slide-30
SLIDE 30

What can we to avoid wasting resources?

30

slide-31
SLIDE 31

Sharing is caring

To avoid retraining lots of models ◮ We can share the trained models ◮ Nordic Language Processing Laboratory (NLPL) is a good example ◮ But it’s important to get things right

◮ METADATA!!! ◮ same format for all models

31

slide-32
SLIDE 32

Reduce model size?

What if we can reduce the size of these giant models? ◮ Often, overparameterized transformer models lead to better performance, even with less data ◮ Lottery-ticket hypothesis: for large enough models, there is a small chance that random initialization will lead to a submodel that already has good weights for the task ◮ But interestingly, you can often remove a large number of the parameters for only a small decrease in performance

32

slide-33
SLIDE 33

Reduce model size?

33

slide-34
SLIDE 34

Model distillation

This is an example.

Pretrained T eacher Model

This is an example.

Pretrained T eacher Model Student Model Model Model 34

slide-35
SLIDE 35

Head pruning

Pretrained T eacher Model

Performance = 93.7 Performance = 93.7 Performance = 93.0

slide-36
SLIDE 36

But how can we NLPers contribute to sustainability?

◮ When possible use pre-trained models ◮ If you train a strong model, similarly make it available to the community. ◮ Try to reduce the amount of hyperparameter tuning we do (for example, by working with models that are more robust to hyperparameters)

36