Transformer Models CSE545 - Spring 2019 Review: Feed Forward - PowerPoint PPT Presentation

Transformer Models CSE545 - Spring 2019

Review: Feed Forward Network Z (full-connected) (skymind, AI Wiki)

Review: Convolutional NN (Barter, 2018)

Review: Recurrent Neural Network y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1) U + x (t) V) “hidden layer” (Jurafsky, 2019)

FFN CNN RNN Can model computation (e.g. matrix operations for a single input) be parallelized?

FFN CNN RNN Ultimately limits how complex the model can be (i.e. it’s total number of paramers/weights) as compared to a CNN. Can model computation (e.g. matrix operations for a single input) be parallelized?

The Transformer: “Attention-only” models Can handle sequences and long-distance dependencies, but…. ● Don’t want complexity of LSTM/GRU cells ● Constant num edges between input steps ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

The Transformer: “ Attention -only” models The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.

Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 values 2 4 z 1 z 2 z 3 z 4

Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s W 1 3 values 2 4 Score function: z 1 z 2 z 3 z 4

Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s W 1 3 values 2 4 Score function: keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4

The Transformer: “ Attention -only” models Challenge: ● Long distance dependency when translating: Attention came about for encoder decoder models. Then self-attention was introduced:

Attention query c hi h i 𝜔 α hi->s α hi->s keys α hi->s α hi->s W 1 3 values 2 4 Score function: z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4

Self-Attention c i 𝜔 α hi->s α hi->s keys y r α hi->s α hi->s W e u q 1 3 values 2 4 Score function: z 1 z 2 z i z 4 s 1 s 2 s i s 4

The Transformer: “Attention-only” models Attention as weighting a value based on a query and key: (Eisenstein, 2018)

The Transformer: “Attention-only” models Output x α 𝜔 h h i-1 h i h i+1 (Eisenstein, 2018)

The Transformer: “Attention-only” models Output α 𝜔 self attention h h i h i-1 h i h i+1 h i-1 h i-1 (Eisenstein, 2018)

The Transformer: “Attention-only” models Output α 𝜔 h h i-1 h i h i+1 h i+2

The Transformer: “Attention-only” models FFN Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 ... Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2 ….

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α Attend to all hidden states in your “neighborhood”. 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.

The Transformer Solution: Multi-head attention

Multi-head Attention

Transformer for Encoder-Decoder

Transformer for Encoder-Decoder sequence index (t)

Transformer for Encoder-Decoder Residualized Connections

Transformer for Encoder-Decoder residuals enable positional information to be passed along Residualized Connections

Transformer for Encoder-Decoder essentially, a language model

Transformer for Encoder-Decoder essentially, a language model Decoder blocks out future inputs

Transformer for Encoder-Decoder Add conditioning of the LM based on the encoder essentially, a language model

Transformer (as of 2017) “WMT-2014” Data Set. BLEU scores:

Transformer ● Utilize Self-Attention ● Simple att scoring function (dot product, scaled) ● Added linear layers for Q, K, and V ● Multi-head attention ● Added positional encoding ● Added residual connection ● Simulate decoding by masking https://4.bp.blogspot.com/-OlrV-PAtEkQ/W3RkOJCBkaI/AAAAAAAADOg/gNZXo_eK3tMNOmIfsuvPzrRfNb3qFQwJwCLcB GAs/s640/image1.gif

Transformer Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

BERT Why? B idirectional E ncoder R epresentations from T ransformers ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input Produces contextualized embeddings steps) (or pre-trained contextualized encoder) ● Enables “interactions” (i.e. adaptations) between words ● Bidirectional context by “masking” in the middle ● Easy to parallelize -- don’t need sequential ● A lot of layers, hidden states, attention heads. processing. Drawbacks of Vanilla Transformers: ● Only unidirectional by default ● Only a “single-hop” relationship per layer (multiple layers to capture multiple)

tokenize into “word pieces” BERT Differences from previous state of the art: ● Bidirectional transformer (through masking) ● Directions jointly trained at once. ● Capture sentence-level relations (Devlin et al., 2019)

Bert: Attention by Layers https://colab.research.google.com/drive/1vlOJ1lhdujVjfH857hvYKIdKPTD9Kid8 (Vig, 2019)

BERT Performance: e.g. Question Answering https://rajpurkar.github.io/SQuAD-explorer/

BERT: Pre-training; Fine-tuning 12 or 24 layers

BERT: Pre-training; Fine-tuning Novel classifier (e.g. sentiment classifier; stance detector...etc..) 12 or 24 layers

The Transformer: “Attention-only” models Can handle sequences and long-distance dependencies, but…. ● Don’t want complexity of LSTM/GRU cells ● Constant num edges between input steps ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

Transformer Models CSE545 - Spring 2019 Review: Feed Forward - PowerPoint PPT Presentation

Transformer Models CSE545 - Spring 2019 Review: Feed Forward Network Z (full-connected) (skymind, AI Wiki) Review: Convolutional NN (Barter, 2018) Review: Recurrent Neural Network y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1)

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition)

Transformer Sequence Models CSE354 - Spring 2020 Natural Language Processing Most NLP Tasks.

GPT-2 Language Models are Unsupervised Multi-Task Learners GPT-2 Fvrier 2019 Transformer XL

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 & Submersible Transformer

Matrix Transformer Building Blocks for High Frequency Applications James Lau CWS (Coil

Transformer at base of Tower Critical link to successful Wind Turbine connection to the grid

The Voltage Waveform of Transformer Core Halves with Magnetization and an Air Gap ... and Maybe

High Voltage Pad-mount Transformer SEPTEMBER 5 - 7, 2018 What is the High Voltage Padmount

Transformer Maintenance October 17, 2013 Prepared for: VELCO Operating Committee 10/17/2013

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang, Felipe Perez,

Lecture 12: Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel

Extended Foster Care Hot Topics Susan Zimny Program and Policy Analyst, Transition Age Youth

CONTENT DISCLAIMER Optimisation is the art of making something faster Desire: It must go too

Barycentric Coordinates Interpolation Barycentric given data at sites, interpolate

APBA Football Innovation Innovation not previously discussed Ease of use without slowing

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks Juho

A System-Wide Debugging Assistant Powered by Natural Language Processing Karthik Narasimhan

Persistence of Gaussian Stationary Processes: a spectral perspective Naomi Feldheim (Stanford)

Transformer Models CSE545 - Spring 2019 Review: Feed Forward - PowerPoint PPT Presentation

Transformer Models CSE545 - Spring 2019 Review: Feed Forward Network Z (full-connected) (skymind, AI Wiki) Review: Convolutional NN (Barter, 2018) Review: Recurrent Neural Network y (t) = f(h (t) W) Activation Function h (t) = g(h (t-1)

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition)

Transformer Sequence Models CSE354 - Spring 2020 Natural Language Processing Most NLP Tasks.

GPT-2 Language Models are Unsupervised Multi-Task Learners GPT-2 Fvrier 2019 Transformer XL

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 &amp; Submersible Transformer

Matrix Transformer Building Blocks for High Frequency Applications James Lau CWS (Coil

Transformer at base of Tower Critical link to successful Wind Turbine connection to the grid

The Voltage Waveform of Transformer Core Halves with Magnetization and an Air Gap ... and Maybe

High Voltage Pad-mount Transformer SEPTEMBER 5 - 7, 2018 What is the High Voltage Padmount

Transformer Maintenance October 17, 2013 Prepared for: VELCO Operating Committee 10/17/2013

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang*, Felipe Perez*,

Lecture 12: Attention and Transformers Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel

Extended Foster Care Hot Topics Susan Zimny Program and Policy Analyst, Transition Age Youth

CONTENT DISCLAIMER Optimisation is the art of making something faster Desire: It must go too

Barycentric Coordinates Interpolation Barycentric given data at sites, interpolate

APBA Football Innovation Innovation not previously discussed Ease of use without slowing

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks Juho

A System-Wide Debugging Assistant Powered by Natural Language Processing Karthik Narasimhan

Persistence of Gaussian Stationary Processes: a spectral perspective Naomi Feldheim (Stanford)

Utility Task Force Corrosion Pictures for IEEE C57.12.24 Fall 2018 & Submersible Transformer

Improving Transformer Optimization Through Better Initialization Xiao Shi Huang, Felipe Perez,