Deep learning 13.3. Transformer Networks Fran cois Fleuret - PowerPoint PPT Presentation

Deep learning 13.3. Transformer Networks Fran¸ cois Fleuret https://fleuret.org/ee559/ Oct 30, 2020

Vaswani et al. (2017) proposed to go one step further: instead of using attention mechanisms as a supplement to standard convolutional and recurrent operations, they designed a models combining only attention layers. They designed this “transformer” for a sequence-to-sequence translation task, but it is currently key to state-of-the-art approaches across NLP tasks. Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 1 / 30

They first introduce a multi-head attention module. Scaled Dot-Product Attention Multi-Head Attention (Vaswani et al., 2017) � Q K ⊤ � Attention( Q , K , V ) = softmax √ d k V MultiHead( Q , K , V ) = Concat ( H 1 , . . . , H h ) W O � � QW Q i , KW K i , VW V H i = Attention , i = 1 , . . . , h i with ∈ R d model × d v , W O ∈ R hd v × d model W Q ∈ R d model × d k , W K ∈ R d model × d k , W V i i i Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 2 / 30

Their complete model is composed of: • An encoder that combines N = 6 modules each composed of a multi-head attention sub-module, and a [per-component] one hidden-layer MLP, with residual pass-through and layer normalization. • A decoder with a similar structure, but with causal attention layers to allow for regression training, and additional attention layers that attend to the layers of the encoder. Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 3 / 30

Their complete model is composed of: • An encoder that combines N = 6 modules each composed of a multi-head attention sub-module, and a [per-component] one hidden-layer MLP, with residual pass-through and layer normalization. • A decoder with a similar structure, but with causal attention layers to allow for regression training, and additional attention layers that attend to the layers of the encoder. Positional information is provided through an additive positional encoding of same dimension d model as the internal representation, and is of the form   t PE t , 2 i = sin   2 i 10 , 000 dmodel   t PE t , 2 i +1 = cos  .  2 i +1 10 , 000 dmodel Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 3 / 30

(Vaswani et al., 2017) Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 4 / 30

The architecture is tested on English-to-German and English-to-French translation using the standard WMT2014 datasets. • English-to-German: 4.5M sentence pairs, 37k tokens vocabulary. • English-to-French: 36M sentence pairs, 32k tokens vocabulary. • 8 P100 GPUs (150 TFlops FP16), 0.5 day for the small model, 3.5 days for the large one. Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 5 / 30

Table 2: The Transformer achieves better BLEU scores than previous state-of-the-art models on the English-to-German and English-to-French newstest2014 tests at a fraction of the training cost. BLEU Training Cost (FLOPs) Model EN-DE EN-FR EN-DE EN-FR ByteNet [18] 23.75 1 . 0 · 10 20 Deep-Att + PosUnk [39] 39.2 2 . 3 · 10 19 1 . 4 · 10 20 GNMT + RL [38] 24.6 39.92 9 . 6 · 10 18 1 . 5 · 10 20 ConvS2S [9] 25.16 40.46 2 . 0 · 10 19 1 . 2 · 10 20 MoE [32] 26.03 40.56 8 . 0 · 10 20 Deep-Att + PosUnk Ensemble [39] 40.4 1 . 8 · 10 20 1 . 1 · 10 21 GNMT + RL Ensemble [38] 26.30 41.16 7 . 7 · 10 19 1 . 2 · 10 21 ConvS2S Ensemble [9] 26.36 41.29 3 . 3 · 10 18 Transformer (base model) 27.3 38.1 2 . 3 · 10 19 Transformer (big) 28.4 41.8 (Vaswani et al., 2017) Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 6 / 30

The The The The Law Law Law Law will will will will never never never never be be be be perfect perfect perfect perfect , , , , but but but but its its its its application application application application should should should should be be be be just just just just - - - - this this this this is is is is what what what what we we we we are are are are missing missing missing missing , , , , in in in in my my my my opinion opinion opinion opinion . . . . <EOS> <EOS> <EOS> <EOS> <pad> <pad> <pad> <pad> (Vaswani et al., 2017) Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 7 / 30

Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the The The The The Law Law Law Law will will will will never never never never be be be be perfect perfect perfect perfect , , , , but but but but its its its its application application application application should should should should be be be be just just just just - - - - this this this this is is is is what what what what we we we we are are are are missing missing missing missing , , , , in in in in my my my my opinion opinion opinion opinion . . . . <EOS> <EOS> <EOS> <EOS> <pad> <pad> <pad> <pad> (Vaswani et al., 2017) Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 8 / 30

The Universal Transformer (Dehghani et al., 2018) is a similar model where all the blocks are identical, resulting in a recurrent model that iterates over consecutive revisions of the representation instead of positions. Additionally the number of steps is modulated per position dynamically. Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 9 / 30

Transformer self-training and fine-tuning for NLP Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 10 / 30

The transformer networks were introduced for translation, and trained with a supervised procedure, from pairs of sentences. However, as for word embeddings, they can be trained in a unsupervised manner, for auto-regression or as denoising auto-encoders, from very large data-sets, and fine-tuned on supervised tasks with small data-sets. Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 11 / 30

BERT (Ours) OpenAI GPT ELMo T 1 T 2 ... T 1 T 2 ... T 1 T 2 ... T N T N T N ... ... Trm Trm Trm Trm Trm Trm Lstm Lstm ... Lstm Lstm Lstm Lstm ... ... ... Trm Trm Trm Trm Trm Trm Lstm Lstm Lstm Lstm Lstm Lstm ... ... E 1 E 2 ... E N E 1 E 2 ... E N E 1 E 2 ... E N Figure 3: Differences in pre-training model architectures. BERT uses a bidirectional Transformer. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of independently trained left-to-right and right-to- left LSTMs to generate features for downstream tasks. Among the three, only BERT representations are jointly conditioned on both left and right context in all layers. In addition to the architecture differences, BERT and OpenAI GPT are fine-tuning approaches, while ELMo is a feature-based approach. (Devlin et al., 2018) Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 12 / 30

GPT (Generative Pre-Training, Radford, 2018) is a transformer trained for auto-regressive text generation. (Radford, 2018) Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 13 / 30

“GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.” (Radford et al., 2019) Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 14 / 30

We can install implementations of the various flavors of transformers from HuggingFace ( https://huggingface.co/ ) pip install transformers and use pre-trained models as we did for vision. Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 15 / 30

import torch from transformers import GPT2Tokenizer, GPT2LMHeadModel tokenizer = GPT2Tokenizer.from_pretrained('gpt2') model = GPT2LMHeadModel.from_pretrained('gpt2') model.eval() tokens = tokenizer.encode('Studying Deep-Learning is') for k in range(11): outputs, _ = model(torch.tensor([tokens])) next_token = torch.argmax(outputs[0, -1]) tokens.append(next_token) print(tokenizer.decode(tokens)) prints Studying Deep-Learning is a great way to learn about the world around you. Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 16 / 30

BERT (Bidirectional Encoder Representation from Transformers, Devlin et al., 2018) is a transformer pre-trained with: • Masked Language Model (MLM), that consists in predicting [15% of] words which have been replaced with a “MASK” token. • Next Sentence Prediction (NSP), which consists in predicting if a certain sentence follows the current one. It is then fine-tuned on multiple NLP tasks. Fran¸ cois Fleuret Deep learning / 13.3. Transformer Networks 17 / 30

Deep learning 13.3. Transformer Networks Fran cois Fleuret - PowerPoint PPT Presentation

Deep learning 13.3. Transformer Networks Fran cois Fleuret https://fleuret.org/ee559/ Oct 30, 2020 Vaswani et al. (2017) proposed to go one step further: instead of using attention mechanisms as a supplement to standard convolutional and

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

JOINT ATTENTION Kaplan and Hafner (2006) Florian Niefind Coli, Universitt des Saarlandes SS

Construal Attention - Our mental filter: We are surrounded by numerous people, objects,

2019 10 16 Outline

Papers Covered Change Blindness Change Blindness Current approaches to change blindness Daniel

Fundamentals of Computational Neuroscience 2e January 1, 2010 Chapter 10: The cognitive brain

The Effect of Delayed Side Information on Fundamental Limitations of Disturbance Attenuation

FerMINI - Fermilab Search for Millicharged Particle & Strongly Interacting Dark Matter Yu-Dai

Automated Video Looping with Progressive Dynamism CS448V: Lecture 5 Background Kwatra et al.

Deep learning 13.3. Transformer Networks Fran cois Fleuret - PowerPoint PPT Presentation

Deep learning 13.3. Transformer Networks Fran cois Fleuret https://fleuret.org/ee559/ Oct 30, 2020 Vaswani et al. (2017) proposed to go one step further: instead of using attention mechanisms as a supplement to standard convolutional and

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Transformer Networks Amir Ali Moinfar - M. Soleymani Deep Learning Sharif University of

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

E9 205 Machine Learning for Signal Processing Understanding Deep Networks 08-11-2019 Instructor

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

JOINT ATTENTION Kaplan and Hafner (2006) Florian Niefind Coli, Universitt des Saarlandes SS

Construal Attention - Our mental filter: We are surrounded by numerous people, objects,

2019 10 16 Outline

Papers Covered Change Blindness Change Blindness Current approaches to change blindness Daniel

Fundamentals of Computational Neuroscience 2e January 1, 2010 Chapter 10: The cognitive brain

The Effect of Delayed Side Information on Fundamental Limitations of Disturbance Attenuation

FerMINI - Fermilab Search for Millicharged Particle &amp; Strongly Interacting Dark Matter Yu-Dai

Automated Video Looping with Progressive Dynamism CS448V: Lecture 5 Background Kwatra et al.

FerMINI - Fermilab Search for Millicharged Particle & Strongly Interacting Dark Matter Yu-Dai