transformer models
play

Transformer Models CSE545 - Spring 2019 Review: Feed Forward - PowerPoint PPT Presentation

Transformer Models CSE545 - Spring 2019 Review: Feed Forward Network Z (full-connected) (skymind, AI Wiki) Review: Convolutional NN (Barter, 2018) Review: Recurrent Neural Network y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1)


  1. Transformer Models CSE545 - Spring 2019

  2. Review: Feed Forward Network Z (full-connected) (skymind, AI Wiki)

  3. Review: Convolutional NN (Barter, 2018)

  4. Review: Recurrent Neural Network y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1) U + x (t) V) “hidden layer” (Jurafsky, 2019)

  5. FFN CNN RNN Can model computation (e.g. matrix operations for a single input) be parallelized?

  6. FFN CNN RNN Can model computation (e.g. matrix operations for a single input) be parallelized?

  7. FFN CNN RNN Can model computation (e.g. matrix operations for a single input) be parallelized?

  8. FFN CNN RNN Ultimately limits how complex the model can be (i.e. it’s total number of paramers/weights) as compared to a CNN. Can model computation (e.g. matrix operations for a single input) be parallelized?

  9. The Transformer: “Attention-only” models Can handle sequences and long-distance dependencies, but…. ● Don’t want complexity of LSTM/GRU cells ● Constant num edges between input steps ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

  10. The Transformer: “ Attention -only” models The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.

  11. The Transformer: “ Attention -only” models The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.

  12. Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 values 2 4 z 1 z 2 z 3 z 4

  13. Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s W 1 3 values 2 4 Score function: z 1 z 2 z 3 z 4

  14. Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s W 1 3 values 2 4 Score function: keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4

  15. The Transformer: “ Attention -only” models Challenge: ● Long distance dependency when translating: Attention came about for encoder decoder models. Then self-attention was introduced:

  16. Attention query c hi h i 𝜔 α hi->s α hi->s keys α hi->s α hi->s W 1 3 values 2 4 Score function: z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4

  17. Self-Attention c i 𝜔 α hi->s α hi->s keys y r α hi->s α hi->s W e u q 1 3 values 2 4 Score function: z 1 z 2 z i z 4 s 1 s 2 s i s 4

  18. The Transformer: “Attention-only” models Attention as weighting a value based on a query and key: (Eisenstein, 2018)

  19. The Transformer: “Attention-only” models Output x α 𝜔 h h i-1 h i h i+1 (Eisenstein, 2018)

  20. The Transformer: “Attention-only” models Output α 𝜔 self attention h h i h i-1 h i h i+1 h i-1 h i-1 (Eisenstein, 2018)

  21. The Transformer: “Attention-only” models Output α 𝜔 h h i-1 h i h i+1 h i+2

  22. The Transformer: “Attention-only” models FFN Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  23. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  24. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 ... Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2 ….

  25. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α Attend to all hidden states in your “neighborhood”. 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  26. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  27. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  28. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  29. The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.

  30. The Transformer Solution: Multi-head attention

  31. Multi-head Attention

  32. Transformer for Encoder-Decoder

  33. Transformer for Encoder-Decoder sequence index (t)

  34. Transformer for Encoder-Decoder

  35. Transformer for Encoder-Decoder Residualized Connections

  36. Transformer for Encoder-Decoder residuals enable positional information to be passed along Residualized Connections

  37. Transformer for Encoder-Decoder

  38. Transformer for Encoder-Decoder essentially, a language model

  39. Transformer for Encoder-Decoder essentially, a language model Decoder blocks out future inputs

  40. Transformer for Encoder-Decoder Add conditioning of the LM based on the encoder essentially, a language model

  41. Transformer for Encoder-Decoder

  42. Transformer (as of 2017) “WMT-2014” Data Set. BLEU scores:

  43. Transformer ● Utilize Self-Attention ● Simple att scoring function (dot product, scaled) ● Added linear layers for Q, K, and V ● Multi-head attention ● Added positional encoding ● Added residual connection ● Simulate decoding by masking https://4.bp.blogspot.com/-OlrV-PAtEkQ/W3RkOJCBkaI/AAAAAAAADOg/gNZXo_eK3tMNOmIfsuvPzrRfNb3qFQwJwCLcB GAs/s640/image1.gif

  44. Transformer Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  45. BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  46. BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Bidirectional context by “masking” in the middle ● Easy to parallelize -- don’t need sequential ● A lot of layers, hidden states, attention heads. processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  47. tokenize into “word pieces” BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  48. Bert: Attention by Layers https://colab.research.google.com/drive/1vlOJ1lhdujVjfH857hvYKIdKPTD9Kid8 (Vig, 2019)

  49. BERT Performance: e.g. Question Answering https://rajpurkar.github.io/SQuAD-explorer/

  50. BERT: Pre-training; Fine-tuning 12 or 24 layers

  51. BERT: Pre-training; Fine-tuning 12 or 24 layers

  52. BERT: Pre-training; Fine-tuning Novel classifier (e.g. sentiment classifier; stance detector...etc..) 12 or 24 layers

  53. The Transformer: “Attention-only” models Can handle sequences and long-distance dependencies, but…. ● Don’t want complexity of LSTM/GRU cells ● Constant num edges between input steps ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend