Transformer Sequence Models and Sequence Applications (Machine - PowerPoint PPT Presentation

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition) CSE392 - Spring 2019 Special Topic in CS

Most NLP Tasks. E.g. ● Transformer Networks ● Sequence Tasks ○ Transformers ○ Language Modeling ○ BERT ○ Machine Translation ○ Speech Recognition

Multi-level bidirectional RNN (LSTM or GRU) (Eisenstein, 2018)

Multi-level bidirectional RNN (LSTM or GRU) Each node has a forward -> and backward <- hidden state: Can represent as a concatenation of both. (Eisenstein, 2018)

Multi-level bidirectional RNN (LSTM or GRU) Average of top layer is an embedding (average of concated vectors) (Eisenstein, 2018)

Multi-level bidirectional RNN (LSTM or GRU) Sometimes just use left-most and right-most hidden state instead (Eisenstein, 2018)

Encoder A representation of input. (Eisenstein, 2018)

Encoder-Decoder Representing input and converting to output (Eisenstein, 2018)

y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax (Eisenstein, 2018)

y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax <go> y (0) y (1) y (2) ….

y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go>

y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go> essentially a language model conditioned on the final state from the encoder.

Encoder-Decoder When applied to new data... <go> essentially a language model conditioned on the final state from the encoder.

y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax A representation of input. <go>

Language 2: (e.g. English) y (0) y (1) y (2) y (3) Encoder-Decoder …. Softmax “seq2seq” model <go> Language 1: (e.g. Chinese)

Encoder-Decoder Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) ….

Encoder-Decoder The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. Kayla kicked the ball.

Encoder-Decoder The ball was kicked by kayla. Challenge: ● Long distance dependency when translating: y (0) y (1) y (2) y (3) y (4) <go> y (0) y (1) y (2) …. A lot of responsibility put fixed-size hidden Kayla kicked the ball. state passed from encoder to decoder

y (0) y (1) y (2) y (3) Long Distance / …. Out of order Softmax dependencies <go> A lot of responsibility put fixed-size hidden state passed from encoder to decoder

y (0) y (1) y (2) y (3) Long Distance / …. Out of order Softmax dependencies <go>

y (0) y (1) y (2) y (3) Attention …. Softmax s 1 s 3 s 4 s 2 <go>

y (0) y (1) y (2) y (3) Attention …. Softmax Analogy: random access memory s 1 s 3 s 4 s 2 <go>

y (0) y (1) y (2) y (3) Attention …. Softmax attention layer s 1 s 3 s 4 s 2 <go>

y (0) y (1) y (2) y (3) Attention …. Softmax c hi attention layer h i-1 h i h i+1 s 1 z n-1 s 3 z n s 4 z n+1 s 2 <go> h n-1 h n h n+1 i: current token of output N: tokens of input h n-1 h n h n+1

Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4

Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 z 1 z 2 z 3 z 4 Z is the vector to be attended to (the value in memory). It is typically hidden states of the input (i.e. s n ) but can be anything.

Attention c hi α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4

Attention c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s 1 3 2 4 s 1 s 2 s 3 s 4

Attention c hi h i 𝜔 α hi->s α hi->s v , W h , W s α hi->s α hi->s 1 3 Score function: 2 4 s 1 s 2 s 3 s 4

Attention c hi h i 𝜔 α hi->s α hi->s v , W h , W s α hi->s α hi->s 1 3 Score function: 2 4 z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).

Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 Score function: keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).

Attention c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s 1 3 Alternative Scoring Functions 2 4 s 1 s 2 s 3 s 4

Attention If variables are standardized, c hi h i 𝜔 matrix multiply produces a similarity score. α hi->s α hi->s α hi->s α hi->s 1 3 Alternative Scoring Functions 2 4 s 1 s 2 s 3 s 4

Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

(Bahdanau et al., 2015) Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

Machine Translation Why? ● $40billion/year industry ● A center piece of many genres of science fiction ● A fairly “universal” problem: ○ Language understanding ○ Language generation ● Societal benefits of inter- cultural communication

Machine Translation Why? ● $40billion/year industry ● A center piece of many genres of science fiction ● A fairly “universal” problem: ○ Language understanding ○ Language generation ● Societal benefits of inter- cultural communication (Douglas Adams)

Machine Translation Why Neural Network Approach works? (Manning, 2018) ● Joint end-to-end training: learning all parameters at once. ● Exploiting distributed representations (embeddings) ● Exploiting variable-length context ● High quality generation from deep decoders - stronger language models (even when wrong, make sense)

Machine Translation As an optimization problem (Eisenstein, 2018):

Attention s 2 s 3 s 1 s 4 h i (“synced”, 2017)

y (0) y (1) y (2) y (3) Attention …. Softmax Analogy: random access memory s 1 s 3 s 4 s 2 <go>

y (0) y (1) y (2) y (3) Attention …. Softmax s 1 s 3 s 4 s 2 <go> Do we even need all these RNNs? (Vaswani et al., 2017: Attention is all you need )

Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 keys z 1 z 2 z 3 z 4 s 1 s 2 s 3 s 4 A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key vector” (s).

Attention query c hi h i 𝜔 α hi->s α hi->s α hi->s α hi->s v , W h , W s 1 3 values 2 4 keys z 1 z 2 z 3 z 4 h i s 1 s 2 s 3 s 4 s j z j A useful abstraction is to make the vector attended to (the “value vector”, Z ) separate than the “key (Eisenstein, 2018) vector” (s).

The Transformer: “Attention-only” models Attention as weighting a value based on a query and key: (Eisenstein, 2018)

The Transformer: “Attention-only” models Output x α 𝜔 h h i-1 h i h i+1 (Eisenstein, 2018)

The Transformer: “Attention-only” models Output α 𝜔 self attention h h i h i-1 h i h i+1 h i-1 h i-1 (Eisenstein, 2018)

The Transformer: “Attention-only” models Output α 𝜔 h h i-1 h i h i+1 h i+2

The Transformer: “Attention-only” models FFN Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 ... Output α 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2 ….

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output α Attend to all hidden states in your “neighborhood”. 𝜔 h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 Output + k t q α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 scaling parameter (k t q) σ Output ( k,q ) + α X X X X dot product 𝜔 dp dp dp h h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models y i-1 y i y i+1 y i+2 (k t q) σ k t q Output ( k,q ) + α X X X X Linear layer: dot product W T X 𝜔 dp dp dp One set of weights for h each of for K, Q, and V h i-1 h i h i+1 h i+2 w i-1 w i w i+1 w i+2

The Transformer: “Attention-only” models Why? ● Don’t need complexity of LSTM/GRU cells ● Constant num edges between words (or input steps) ● Enables “interactions” (i.e. adaptations) between words ● Easy to parallelize -- don’t need sequential processing.

The Transformer Limitation (thus far): Can’t capture multiple types of dependencies between words.

Transformer Sequence Models and Sequence Applications (Machine - PowerPoint PPT Presentation

Transformer Sequence Models and Sequence Applications (Machine Translation, Speech Recognition) CSE392 - Spring 2019 Special Topic in CS Most NLP Tasks. E.g. Transformer Networks Sequence Tasks Transformers Language

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Transformer Program Trevor Foster Electrical Engineering Manager Calpine Transformer Program 1

Electronic DC Transformer Pavol Bauer Learning objectives What is an electronic DC

West Cape Transformer Replacement Purpose Purchase a 50 MVA Load Tap Changer power

IEEE Transformer Committee PC57.167 Distribution Transformer Monitoring - User Mark Scarborough

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Transformer Sequence Models CSE354 - Spring 2020 Natural Language Processing Most NLP Tasks.

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Magnetics Design 3.1 Important magnetic equations 3.2 Magnetic losses 3.3 Transformer 3.3.1

Matrix Transformer Building Blocks for High Frequency Applications James Lau CWS (Coil

GPT-2 Language Models are Unsupervised Multi-Task Learners GPT-2 Fvrier 2019 Transformer XL

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Non-Autoregressive Decoding Xiachong Feng Outline Transformer The Importance of

Workshop to Advance the Use of Electronic Data for Conducting PCOR Lessons from the Field: HMO

Community Branching for Parallel Portfolio SAT Solvers Tomohiro SONOBE*, Shuya KONDOH**, Mary

Community-based Partitioning for MaxSAT Solving Ruben Martins Vasco Manquinho In es Lynce

Whichvarieties are birational to Ph rational Difficult problem k x k ti Tn smooth projective

Fidelity versus Interpretability Derek Bridge Insight Centre for Data Analytics University

q r

r sssst

Stroke Disparities Jose G. Romano, MD, FAHA, FANA Professor of Clinical Neurology Director,