transformer sequence models
play

Transformer Sequence Models CSE354 - Spring 2020 Natural Language - PowerPoint PPT Presentation

Transformer Sequence Models CSE354 - Spring 2020 Natural Language Processing Most NLP Tasks. E.g. Transformer Networks Sequence Tasks Transformers Language Modeling BERT Machine Translation Speech


  1. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  2. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  3. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  4. The Transformer: “Attention-only” models Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

  5. The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.

  6. The Transformer Solution: Multi-head attention

  7. Multi-head Attention

  8. Transformer for Encoder-Decoder

  9. Transformer for Encoder-Decoder sequence index (t)

  10. Transformer for Encoder-Decoder

  11. Transformer for Encoder-Decoder Residualized Connections

  12. Transformer for Encoder-Decoder residuals enable positional information to be passed along Residualized Connections

  13. Transformer for Encoder-Decoder

  14. Transformer for Encoder-Decoder essentially, a language model

  15. Transformer for Encoder-Decoder essentially, a language model Decoder blocks out future inputs

  16. Transformer for Encoder-Decoder Add conditioning of the LM based on the encoder essentially, a language model

  17. Transformer for Encoder-Decoder

  18. Transformer (as of 2017) “WMT-2014” Data Set. BLEU scores:

  19. Transformer ● Utilize Self-Attention ● Simple att scoring function (dot product, scaled) ● Added linear layers for Q, K, and V ● Multi-head attention ● Added positional encoding ● Added residual connection ● Simulate decoding by masking

  20. Transformer Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  21. BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  22. BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Bidirectional context by “masking” in the middle ● Easy to parallelize -- don’t need sequential ● A lot of layers, hidden states, attention heads. processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

  23. BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. She saw the man on the hill with the telescope. She [mask] the man on the hill [mask] the telescope.

  24. BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. She saw the man on the hill with the telescope. Mask 1 in 7 words: Too few: expensive, less robust ● She [mask] the man on the hill [mask] the telescope. Too many: not enough context ●

  25. BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. ● BERT-Base, Cased : 12-layer, 768-hidden, 12-heads , 110M parameters

  26. BERT B idirectional E ncoder R epresentations from T ransformers Produces contextualized embeddings (or pre-trained contextualized encoder) ● Bidirectional context by “masking” in the middle ● A lot of layers, hidden states, attention heads. ● BERT-Base, Cased : 12-layer, 768-hidden, 12-heads , 110M parameters ● BERT-Large, Cased : 24-layer, 1024-hidden, 16-heads, 340M parameters ● BERT-Base, Multilingual Cased : 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters

  27. BERT (Devlin et al., 2019)

  28. BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. (Devlin et al., 2019)

  29. BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  30. BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  31. BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  32. BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  33. tokenize into “word pieces” BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

  34. BERT Performance: e.g. Question Answering https://rajpurkar.github.io/SQuAD-explorer/

  35. Bert: Attention by Layers https://colab.research.google.com/drive/1vlOJ1lhdujVjfH857hvYKIdKPTD9Kid8 (Vig, 2019)

  36. BERT: Pre-training; Fine-tuning 12 or 24 layers

  37. BERT: Pre-training; Fine-tuning 12 or 24 layers

  38. BERT: Pre-training; Fine-tuning Novel classifier (e.g. sentiment classifier; stance detector...etc..) 12 or 24 layers

  39. BERT: Pre-training; Fine-tuning [CLS] vector at start Novel classifier is supposed to (e.g. sentiment classifier; stance detector...etc..) capture meaning of whole sequence.

  40. BERT: Pre-training; Fine-tuning Novel classifier [CLS] vector at start (e.g. sentiment classifier; stance detector...etc..) is supposed to capture meaning of avg whole sequence. Average of top layer (or second to top) also often used.

  41. Extra Material:

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend