cs480 680 lecture 19 july 10 2019
play

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer - PowerPoint PPT Presentation

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al., Attention is All You Need, NeurIPS , 2017] University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1 Attention Attention in Computer Vision


  1. CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al., Attention is All You Need, NeurIPS , 2017] University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1

  2. Attention • Attention in Computer Vision – 2014: Attention used to highlight important parts of an image that contribute to a desired output • Attention in NLP – 2015: Aligned machine translation – 2017: Language modeling with Transformer networks University of Waterloo CS480/680 Spring 2019 Pascal Poupart 2

  3. Sequence Modeling Challenges with RNNs Transformer Networks • Long range dependencies • Facilitate long range dependencies • Gradient vanishing and • No gradient vanishing and explosion explosion • Large # of training steps • Fewer training steps • Recurrence prevents • No recurrence that facilitate parallel computation parallel computation University of Waterloo CS480/680 Spring 2019 Pascal Poupart 3

  4. Attention Mechanism • Mimics the retrieval of a value ! " for a query # based on a key $ " in database • Picture %&&'(&)*( #, ,, - = ∑ " 0)1)2%3)&4 #, $ " ×! " University of Waterloo CS480/680 Spring 2019 Pascal Poupart 4

  5. Attention Mechanism • Neural architecture • Example: machine translation – Query: ! "#$ (hidden vector for % − 1 () output word) – Key: ℎ + (hidden vector for , () input word) – Value: ℎ + (hidden vector for , () input word) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 5

  6. Transformer Network • Vaswani et al., (2017) Attention is all you need. • Encoder-decoder based on attention (no recurrence) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 6

  7. Multihead attention • Multihead attention: compute multiple attentions per query with different weights !"#$%ℎ'() *, ,, - = / 0 1231($ ℎ'() 4 , ℎ'() 5 , … , ℎ'() 7 9 *, / ; - : ,, / ℎ'() 8 = ($$'3$%23 / 8 8 8 9 ? : ($$'3$%23 *, ,, - = <2=$!(> @ A - University of Waterloo CS480/680 Spring 2019 Pascal Poupart 7

  8. Masked Multi-head attention • Masked multi-head attention: multi-head where some values are masked (i.e., probabilities of masked values are nullified to prevent them from being selected). • When decoding, an output value should only depend on previous outputs (not future outputs). Hence we mask future outputs. 0 1 2 !""#$"%&$ ', ), * = ,&-".!/ 3 4 * 0 1 289 .!,5#67""#$"%&$ ', ), * = ,&-".!/ * 3 4 where : is a mask matrix of 0’s and −∞ ’s University of Waterloo CS480/680 Spring 2019 Pascal Poupart 8

  9. Other layers • Layer normalization: – Normalize values in each layer to have 0 mean and 1 variance – For each hidden unit ℎ " compute ℎ " ← $ % (ℎ " − () where * is a variable, ( = , , - - - ∑ "/, - ∑ "/, ℎ " − ( 1 ℎ " and 0 = – This reduces “covariate shift” (i.e., gradient dependencies between each layer) and therefore fewer training iterations are needed • Positional embedding – Embedding to distinguish each position 23 456"7"58,1" = sin(=>?@A@>B/10000 1"/F ) 23 456"7"58,1"G, = cos(=>?@A@>B/10000 1"/F ) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 9

  10. Comparison • Attention reduces sequential operations and maximum path length, which facilitates long range dependencies University of Waterloo CS480/680 Spring 2019 Pascal Poupart 10

  11. Results University of Waterloo CS480/680 Spring 2019 Pascal Poupart 11

  12. GPT and GPT-2 • Radford et al., (2018) Language models are unsupervised multitask learners – Decoder transformer that predicts next word based on previous words by computing !(# $ |# &..$(& ) – SOTA in “zero-shot” setting for 7/8 language tasks (where zero-shot means no task training, only unsupervised language modeling) University of Waterloo CS480/680 Spring 2019 Pascal Poupart 12

  13. BERT (Bidirectional Encoder Representations from Transformers) • Devlin et al., (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding – Decoder transformer that predicts a missing word based on surrounding words by computing !(# $ |# &..$(&,$*&..+ ) – Mask missing word with masked multi-head attention – Improved state of the art on 11 tasks University of Waterloo CS480/680 Spring 2019 Pascal Poupart 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend