transformer sequence models and sequence applications
play

Transformer Sequence Models and Sequence Applications (Machine - PowerPoint PPT Presentation

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition) CSE392 - Spring 2019 Special Topic in CS Most NLP Tasks. E.g. Transformer Networks Sequence Tasks Transformers Language


  1. Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition) CSE392 - Spring 2019 Special Topic in CS

  2. Most NLP Tasks. E.g. ● Transformer Networks ● Sequence Tasks ○ Transformers ○ Language Modeling ○ BERT ○ Machine Translation ○ Speech Recognition

  3. Multi-level bidirectional RNN (LSTM or GRU) (Eisenstein, 2018)

  4. Multi-level bidirectional RNN (LSTM or GRU) Each node has a forward -> and backward <- hidden state: Can represent as a concatenation of both. (Eisenstein, 2018)

  5. Multi-level bidirectional RNN (LSTM or GRU) Average of top layer is an embedding (average of concated vectors) (Eisenstein, 2018)

  6. Multi-level bidirectional RNN (LSTM or GRU) Sometimes just use left-most and right-most hidden state instead (Eisenstein, 2018)

  7. Encoder A representation of input. (Eisenstein, 2018)

  8. Encoder-Decoder Representing input and converting to output (Eisenstein, 2018)

  9. y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax (Eisenstein, 2018)

  10. y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax <go> y (0) y (1) y (2) ….

  11. y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go>

  12. y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go> essentially a language model conditioned on the final state from the encoder.

  13. Encoder-Decoder When applied to new data... <go> essentially a language model conditioned on the final state from the encoder.

  14. y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go>

  15. Language 2: (e.g. English) y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax “seq2seq” model <go> Language 1: (e.g. Chinese)

  16. Encoder-Decoder Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) ….

  17. Encoder-Decoder Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) ….

  18. Encoder-Decoder The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.

  19. Encoder-Decoder The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. A lot of responsibility put fixed-size hidden Kayla kicked the ball. state passed from encoder to decoder

  20. y (0) y (1) y (2) y (3) Long Distance / …. Out of order Softmax dependencies <go> A lot of responsibility put fixed-size hidden state passed from encoder to decoder

  21. y (0) y (1) y (2) y (3) Long Distance / …. Out of order Softmax dependencies <go>

  22. y (0) y (1) y (2) y (3) Attention …. Softmax s 1 s 3 s 4 s 2 <go>

  23. y (0) y (1) y (2) y (3) Attention …. Softmax Analogy: random access memory s 1 s 3 s 4 s 2 <go>

  24. y (0) y (1) y (2) y (3) Attention …. Softmax attention layer s 1 s 3 s 4 s 2 <go>

  25. y (0) y (1) y (2) y (3) Attention …. Softmax c hi attention layer h i-1 h i h i+1 s 1 z n-1 s 3 z n s 4 z n+1 s 2 <go> h n-1 h n h n+1 i: current token of output N: tokens of input h n-1 h n h n+1

  26. Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4

  27. Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 z 1 z 2 z 3 z 4 Z is the vector to be attended to (the value in memory). It is typically hidden states of the input (i.e. s n ) but can be anything.

  28. Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4

  29. Attention c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4

  30. Attention c hi h i 𝜔 α hi->s α hi->s v , W h , W s α hi->s α hi->s 1 3 Score function: 2 4 s 1 s 2 s 3 s 4

  31. Attention c hi h i 𝜔 α hi->s α hi->s v , W h , W s α hi->s α hi->s 1 3 Score function: 2 4 z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).

  32. Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 Score function: keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).

  33. Attention c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s 1 3 Alternative Scoring Functions 2 4 s 1 s 2 s 3 s 4

  34. Attention If variables are standardized, c hi h i 𝜔 matrix multiply produces a similarity score. α hi->s α hi->s α hi->s α hi->s 1 3 Alternative Scoring Functions 2 4 s 1 s 2 s 3 s 4

  35. Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

  36. Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

  37. (Bahdanau et al., 2015) Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

  38. (Bahdanau et al., 2015) Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

  39. Machine Translation Why? ● $40billion/year industry ● A center piece of many genres of science fiction ● A fairly “universal” problem: ○ Language understanding ○ Language generation ● Societal benefits of inter- cultural communication

  40. Machine Translation Why? ● $40billion/year industry ● A center piece of many genres of science fiction ● A fairly “universal” problem: ○ Language understanding ○ Language generation ● Societal benefits of inter- cultural communication (Douglas Adams)

  41. Machine Translation Why Neural Network Approach works? (Manning, 2018) ● Joint end-to-end training: learning all parameters at once. ● Exploiting distributed representations (embeddings) ● Exploiting variable-length context ● High quality generation from deep decoders - stronger language models (even when wrong, make sense)

  42. Machine Translation As an optimization problem (Eisenstein, 2018):

  43. Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

  44. y (0) y (1) y (2) y (3) Attention …. Softmax Analogy: random access memory s 1 s 3 s 4 s 2 <go>

  45. y (0) y (1) y (2) y (3) Attention …. Softmax s 1 s 3 s 4 s 2 <go> Do we even need all these RNNs? (Vaswani et al., 2017: Attention is all you need )

  46. Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).

  47. Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 keys z 1 z 2 z 3 z 4 h i s 1 s 2 s 3 s 4 s j z j A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key (Eisenstein, 2018) vector” (s).

  48. The Transformer: “Attention-only” models Attention as weighting a value based on a query and key: (Eisenstein, 2018)

  49. The Transformer: “Attention-only” models Output x α 𝜔 h h i-1 h i h i+1 (Eisenstein, 2018)

  50. The Transformer: “Attention-only” models Output α 𝜔 self attention h h i h i-1 h i h i+1 h i-1 h i-1 (Eisenstein, 2018)

  51. The Transformer: “Attention-only” models Output α 𝜔 h h i-1 h i h i+1 h i+2

  52. The Transformer: “Attention-only” models FFN Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  53. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  54. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 ... Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2 ….

  55. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α Attend to all hidden states in your “neighborhood”. 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  56. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  57. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  58. The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

  59. The Transformer: “Attention-only” models Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

  60. The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend