IN5550: Neural Methods in Natural Language Processing IN5550 - PowerPoint PPT Presentation

IN5550: Neural Methods in Natural Language Processing – IN5550 – Neural Methods in Natural Language Processing Transformers Jeremy Barnes University of Oslo March 31, 2020

Attention - tl;dr Pay attention to a weighted combination of input states to generate the right output state 2

Attention RNNs + attention work great, but are inefficient ◮ impossible to make computation parallel ◮ leads to long train times and smaller models ◮ To enjoy the benefits of deep learning, our models need to be truly deep! 3

Desiderta for a new kind of model 1. Reduce the total computational complexity per layer 2. Increase the amount of computation that can be parallelized 3. Ensure that the model can efficiently learn long range dependencies. 4

Self-attention John Lennon, 1967: love is all u need Vaswani et al., 2017: 5

Self-attention Main principle: instead of a target paying attention to different parts of the source, make the source pay attention to itself. 6

Self-attention Values (V) happy not am so . I I am Keys (K) not so happy . ◮ By making parts of a sentence pay attention to other parts of itself, we get better representations ◮ This can be an RNN replacement ◮ Where an RNN carries long-term information down a chain, self-attention acts more like a tree 7

Transformer Remember for attention: K = key vector V = value vector 8

Transformer For Transformers, we will add another: Q = query vector K = key vector V = value vector We will see what the difference is in a minute 9

Transformer 10

Transformer attention 11

Scaled dot product attention Remember, we have two ways of doing attention 1. easy, fast way: dot product attention 2. parameterized : 1-layer feed-forward network to determine attention weights But can’t we get the benefits of the 2nd without the extra parameters? hypothesis: With large values, dot product leads to large values, and the gradient becomes small (squashing with sigmoid/tanh) solution: Scaled dot-product attention Make sure the dot product doesn’t get too big 12

Scaled dot product attention The important bit: The maths: Attention( Q, K, V ) = softmax( QK T ) V √ d k 13

Transformer Attention( Q, K, V ) = softmax( QK T ) V √ d k What’s happening at a token level: ◮ Obtain three representations of the input, Q , K and V - query, key and value ◮ Obtain a set of relevance strengths: QK T . For words i and j , Q i · K j represents the strength of the association - exactly like in seq2seq attention. ◮ Scale it (stabler gradients, boring maths) and softmax for α s. ◮ Unlike seq2seq, use different ‘value’ vectors to weight. In a sense, this is exactly like seq2seq attention, except: a) non-recurrent representations, b) same source/target, c) different value vectors 14

Intuition behind Query, Key, and Value vectors 15

Multi-head attention 16

Adding heads Revolutionary idea: if representations learn so much from attention, why not learn many attentions Multi-headed attention is many self-attentions (Simplified) transformer: 17

Point-wise feed-forward layers 18

Point-wise feed-forward layers ◮ Add 2-layer feed-forward layers after attention layers ◮ Same across all positions, but different for layers ◮ Again a trade-off to increase model complexity while keeping computation costs down 19

Position embeddings 20

Position embeddings But wait, now we lost our sequence information :( ◮ Use an encoding that gives this information ◮ Mix of sine and cosine functions ◮ How would this work? ◮ Why do they need both? In the end, learning positional embeddings is often better ◮ But, it has a very large disadvantage ◮ No way to represent sequences longer than those seen in training ◮ You have to chop your data off at an arbitrary length 21

Depth To get the benefits of deep learning, we need depth. ◮ Let’s make it deep: ◮ encoder: 6 layers ◮ decoder : 6 layers 22

Transformer 23

Transformer ◮ Can be complicated to train ◮ Has its own ADAM setup (learning rate is proportional to step − 0 . 5 ) ◮ dropout added just before residual ◮ label smoothing ◮ during decoding add length penalties ◮ checkpoint averaging 24

Transformer ◮ Have a look at The Annotated Transformer ◮ http://nlp.seas.harvard.edu/2018/04/03/attention.html 25

Evolution of Transformer-based Models 26

Carbon footprint *from Strubel et al. (2019) Energy and policy considerations for deep learning in NLP. 29

What can we to avoid wasting resources? 30

Sharing is caring To avoid retraining lots of models ◮ We can share the trained models ◮ Nordic Language Processing Laboratory (NLPL) is a good example ◮ But it’s important to get things right ◮ METADATA!!! ◮ same format for all models 31

Reduce model size? What if we can reduce the size of these giant models? ◮ Often, overparameterized transformer models lead to better performance, even with less data ◮ Lottery-ticket hypothesis: for large enough models, there is a small chance that random initialization will lead to a submodel that already has good weights for the task ◮ But interestingly, you can often remove a large number of the parameters for only a small decrease in performance 32

Reduce model size? 33

Model distillation This is an example. Pretrained T eacher Model Model This is an example. Pretrained T eacher Model Model Student Model 34

Head pruning Performance = 93.7 eacher Model Pretrained T Performance = 93.7 Performance = 93.0

But how can we NLPers contribute to sustainability? ◮ When possible use pre-trained models ◮ If you train a strong model, similarly make it available to the community. ◮ Try to reduce the amount of hyperparameter tuning we do (for example, by working with models that are more robust to hyperparameters) 36

IN5550: Neural Methods in Natural Language Processing IN5550 - PowerPoint PPT Presentation

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural Language Processing Transformers Jeremy Barnes University of Oslo March 31, 2020 Attention - tl;dr Pay attention to a weighted combination of input

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural

IN5550 Neural Methods in Natural Language Processing Applications of Recurrent Neural Networks

IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Stephan Oepen

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks Erik

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks (2:2)

IN5550 Neural Methods in Natural Language Processing Convolutional Neural Networks Erik

IN5550: Neural Methods in Natural Language Processing Introduction Jeremy Barnes, Andrey

IN5550: Neural Methods in Natural Language Processing Introduction Jeremy Barnes, Andrey

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

IN5550: Neural Methods in Natural Language Processing Lecture 6 Evaluating Word Embeddings and

IN5550: Neural Methods in Natural Language Processing Lecture 2 Supervised Machine Learning:

IN5550: Neural Methods in Natural Language Processing Lecture 4 Dense Representations of

IN5550 Neural Methods in Natural Language Processing Attention! Vinit Ravishankar

IN5550 Neural Methods in Natural Language Processing Home Exam: Task Overview and

IN5550 Neural Methods in Natural Language Processing Home Exam: Task Overview and

A Bayesian approach to estimate the number and position of knots for linear regression splines

Research Goal : reliable and easy-to-use optimizers for ML. 1 10 Challenges in Optimization

Networks on Structured Data Yingyu Liang@UW-Madison Joint work with Yuanzhi Li@Princeton

Theories of Neural Networks Training Lazy and Mean Field Regimes c Chizat * , joint work with

When Ensembling Smaller Models is More Efficient than Single Large Models Dan Kondratyuk, Mingxing

Workshop 10.4: Generalized linear models Murray Logan February 15, 2017 Table of contents 1

Neural Networks: Computation + Gradient Descent LING572 Advanced Statistical Methods in NLP

Deep Learning: From Theory to Algorithm Outline: 1. Overview of

Sambuz

Useful Links

Newsletter

Mail Us