Lecture 12: Attention and Transformers Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 12:   Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Lecture 12:   Attention and Transformers Attention Mechanisms CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 2

Encoder-Decoder (seq2seq) model Task: Read an input sequence   and return an output sequence – Machine translation: translate source into target language – Dialog system/chatbot: generate a response Reading the input sequence: RNN Encoder Generating the output sequence: RNN Decoder Encoder Decoder output hidden input 3 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

              A more general view of seq2seq Insight 1: In general, any function of the encoder’s output can be used as a representation of the context   we want to condition the decoder on.   Insight 2: We can feed the context in at any time step during decoding (not just at the beginning). 4 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  Adding attention to the decoder Basic idea: Feed a d -dimensional representation of the entire (arbitrary-length) input sequence into the decoder   at each time step during decoding. This representation of the input can be a weighted average of the encoder’s representation of the input (i.e. its output) The weights of each encoder output element tell us how much attention we should pay to different parts of the input sequence Since different parts of the input may be more or less important for different parts of the output, we want to vary the weights over the input during the decoding process. (Cf. Word alignments in machine translation) 5 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Adding attention to the decoder We want to condition the output generation of the decoder on a context-dependent representation of the input sequence. Attention computes a probability distribution over the encoder’s hidden states that depends on the decoder’s current hidden state (This distribution is computed anew for each output symbol ) This attention distribution is used to compute a weighted average of the encoder’s hidden state vectors . This context-dependent embedding of the input sequence   is fed into the output of the decoder RNN. 6 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  Attention, more formally α ( t ) = ( α ( t ) 1 , . . . , α ( t ) S ) Define a probability distribution   over the S elements of the input sequence   that depends on the current output element t   Use this distribution to compute a weighted average of the ∑ ∑ α ( t ) α ( t ) s o s s h s encoder’s output or hidden states s =1.. S s =1.. S and feed that into the decoder. hhttps://www.tensorflow.org/tutorials/text/nmt_with_attention 7 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Attention, more formally α ( t ) = ( α ( t ) 1 , . . . , α ( t ) S ) 1. Compute a probability distribution   h ( s ) over the encoder’s hidden states   h ( t ) that depends on the decoder’s current exp ( s ( h ( t ) , h ( s ) ) ) α ( t ) s = ∑ s ′ exp ( s ( h ( t ) , h ( s ′ ) ) ) α ( t ) c ( t ) h ( s ) 2. Use to compute a weighted avg . of the encoder’s : c ( t ) = ∑ α ( t ) s h ( s ) s =1.. S c ( t ) h ( t ) o ( t ) 3. Use both and to compute a new output , e.g. as   o ( t ) = tanh ( W 1 h ( t ) + W 2 c ( t ) ) 8 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Defining Attention Weights Hard attention (degenerate case, non-differentiable):   α ( t ) = ( α ( t ) 1 , . . . , α ( t ) S ) is a one-hot vector   (e.g. 1 = most similar element to decoder’s vector, 0 = all other elements) Soft attention (general case):   α ( t ) = ( α ( t ) 1 , . . . , α ( t ) S ) is not a one-hot — Use the dot product (no learned parameters):   s ( h ( t ) , h ( s ) ) = h ( t ) ⋅ h ( s ) — Learn a bilinear matrix W:   T W h ( s ) s ( h ( t ) , h ( s ) ) = ( h ( t ) ) — Learn separate weights for the hidden states:   s ( h ( t ) , h ( s ) ) = v T tanh ( W 1 h ( t ) + W 2 h ( s ) ) 9 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Lecture 12:   Attention and Transformers s r e m r o f l l a s s i n n o a i t n r e 7 t 1 t T 0 A 2 . l S a P I t N e , i d n a e e w n h s u a o V y CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/ 10

Transformers Sequence transduction model based on attention ( no convolutions or recurrence ) — easier to parallelize than recurrent nets — faster to train than recurrent nets — captures more long-range dependencies   than CNNs with fewer parameters Transformers use stacked self-attention   and position-wise, fully-connected layers   for the encoder and decoder Transformers form the basis of BERT, GPT(2-3), and other state-of-the-art neural sequence models. 11 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  Seq2seq attention mechanisms α ( t ) = ( α ( t ) 1 , . . . , α ( t ) S ) Define a probability distribution   over the S elements of the input sequence   that depends on the current output element t   Use this distribution to compute a weighted average of the ∑ ∑ α ( t ) α ( t ) s o s s h s encoder’s output or hidden states s =1.. S s =1.. S and feed that into the decoder. hhttps://www.tensorflow.org/tutorials/text/nmt_with_attention 12 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Self-Attention Attention so far (in seq2seq architectures): In the decoder (which has access to the complete input sequence), compute attention weights over encoder positions   that depend on each decoder position   Self-attention: If the encoder has access to the complete input sequence,   we can also compute attention weights over encoder positions that depend on each encoder position self-attention: encoder For each decoder position t…, …Compute an attention weight for each encoder position s …Renormalize these weights (that depend on t ) w/ softmax to get a new weighted avg. of the input sequence vectors 13 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Self-attention: Simple variant Given T k -dimensional input vectors x (1) … x (i) … x (T) , compute T k- dimensional output vectors y (1) … y (i) … y (T) where each output y (i) is a weighted average of the input vectors, and where the weights w ij depend on y (i) and x (j) y ( i ) = ∑ w ij x ( j ) j =1.. T Computing weights w ij naively (no learned parameters)   ij = ∑ k x ( j ) x ( i ) w ′ Dot product: k k exp( w ′ ij ) w ij = Followed by softmax: ∑ j exp( w ′ ij ) 14 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

  Towards more flexible self-attention y ( i ) = ∑ j =1.. T w ij x ( j ) To compute , we must… … take the element x (i) … … decide the weight of each x (j) depending on x (i) w ij … average all elements x (j) according to their weights   Observation 1: Dot product-based weights are large when x (i) , x (j) are similar. But we may want a more flexible approach. Idea 1: Learn attention weights that depend on x (i) and x (j)   w ij in a manner that works best for the task Observation 2: This weighted average is still just a simple function of the original x (j) s Idea 2: Learn weights that re-weight the elements of x (j)   in a manner that works best for the task 15 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Self-attention with queries, keys, values k × k W Let’s add learnable parameters (three weight matrices ),   x ( i ) that allow us turn any input vector into three versions : q ( i ) = W q x ( i ) — Query vector to compute averaging weights at pos. i k ( i ) = W k x ( i ) — Key vector: to compute averaging weights of pos. i v ( i ) = W v x ( i ) — Value vector: to compute the value of pos. i to be averaged The attention weight of the j- th position used in the weighted average   at the i- th position depends on the query of i and the key of j : exp ( ∑ l q ( i ) l k ( j ) l ) exp ( q ( i ) k ( j ) ) w ( i ) = = j ∑ j exp ( q ( i ) k ( j ) ) l k ( j ) ∑ j exp ( ∑ l q ( i ) l ) The new output vector for the i -th position depends on   the attention weights and value vectors of all input positions j : y ( i ) = ∑ w ( i ) j v ( j ) j =1.. T 16 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Transformer Architecture Non-Recurrent Encoder-Decoder   architecture — No hidden states — Context information   captured via attention and positional encodings — Consists of stacks of layers   with various sublayers 17 CS447 Natural Language Processing (J. Hockenmaier) https://courses.grainger.illinois.edu/cs447/

Lecture 12: Attention and Transformers Julia Hockenmaier - PowerPoint PPT Presentation

CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Lecture 12: Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Lecture 12: Attention and Transformers Attention Mechanisms

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention Angelos

Transformers Willem Maes High Voltage Safety Transformers Willem Maes High Voltage Safety

Status of CIGRE JWG A2/B4-28 HVDC Converter Transformers HVDC Converter Transformers Ugo Piovan

CSC413/2516 Lecture 8: Attention and Transformers Jimmy Ba Jimmy Ba CSC413/2516 Lecture 8:

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

QUALITY PLAN POWER TRANSFORMERS MANUFACTURING CUSTOMER: SIDOR C.A. PROJECT: POWER TRANSFORMERS

Task Force on Partial Discharge Testing of Class I Power Transformers IEEE/PES Transformers

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

multi-hop attention and Transformers Outline Review of common (old fashioned) neural

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

The Attention Economy What is the attention economy? A business model where you (as the

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia Tech Slide Credits: Andrej

Libraries and Tools Transformers, AllenNLP LING575 Analyzing Neural Language Models Shane

Commencing Development of Just Approved New Guide on Moisture in Transformers and Reactors

Extended Foster Care Hot Topics Susan Zimny Program and Policy Analyst, Transition Age Youth

CONTENT DISCLAIMER Optimisation is the art of making something faster Desire: It must go too

Barycentric Coordinates Interpolation Barycentric given data at sites, interpolate

Rationality and Traffic Attraction Rationality and Traffic Attraction Incentives for Honest Path

Transformer Models CSE545 - Spring 2019 Review: Feed Forward Network Z (full-connected)

APBA Football Innovation Innovation not previously discussed Ease of use without slowing

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks Juho

A System-Wide Debugging Assistant Powered by Natural Language Processing Karthik Narasimhan