Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating - PowerPoint PPT Presentation

Eric Mintun HEP-AI Journal Club May 15th, 2018

Outline • Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly learning to align and translate.” In International Conference on Learning � Representations, 2015. arXiv:1409.0473 [cs.CL] • Generalizations and a little theory Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush. “Structured attention networks.” In International Conference on Learning Representations, 2017. arXiv: � 1702.00887 [cs.CL] • Why attention might be better than RNNs and CNNs Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Ł ukasz Kaiser, Illia Polosukhin. “Attention is all you need.” In 31st Conference on Neural Information Processing Systems (NIPS 2017). arXiv:1706.03762 [cs.CL]

Translation French L’ accord sur la zone économique européenne a été signé en août 1992 . <end> R R R R R R R R R R R R R R R N N N N N N N N N N N N N N N N N N N N N N N N N N N N N N c Context Vector R R R R R R R R R R R R R N N N N N N N N N N N N N N N N N N N N N N N N N N The agreement on the European Economic Area was signed in 1992 . <end> English

Translation • Fixed-size context vector struggles with long sentences, fails later in sentence. � � � • Underlined portion becomes ‘based on his state of health’.

Translation w/ Attention L’ accord sur la zone économique européenne a été signé en août 1992 . <end> R R R R R R R R R R R R R R R N N N N N N N N N N N N N N N s i − 1 N N N N N N N N N N N N N N N X α ji = 1 0 ≤ α ji ≤ 1 c i j α 1 i ( h 1 , s i − 1 ) ⊕ h 8 h 9 h 11 h 12 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 10 h 13 R R R R R R R R R R R R R N N N N N N N N N N N N N N N N N N N N N N N N N N The agreement on the European Economic Area was signed in 1992 . <end>

Translation w/ Attention

Attention • Attention consists of learned key-value pairs. • Input query is compared against the key. A better match lets more of the value through: V i  � Q � X X Compare /w w i = 1 × w i � i K i i out i • Additive compare: Q and K fed into neural net • Multiplicative compare: Dot-product Q and K

Keys/Values for Example L’ accord sur la zone économique R R R R R R Query: s i − 1 N N N N N N · · · s i − 1 N N N N N N Keys: h j c i ⊕ α 1 i ( h 1 , s i − 1 ) Values: h j h 1 h 2 h 3 h 4 h 5 R R R R R N N N N N · · · Compare: α ji ( h j , s i − 1 ) N N N N N Additive Attention The agreement on the European

Structured Attention • What if we know trained attention should have a known structure? E.g.: • Each output by decoder should attend to a connected subsequence in encoder (character to word conversion). • Output sequence is organized as a tree (sentence parsing, equation input and output).

Structured Attention • Attention weights define a probability α i distribution. Write context vector as: n f ( x , z ) = x z X c = E z ∼ p ( z | x,q ) [ f ( x, z )] = p ( z = i | x, q ) x i � z ∈ 1 , . . . , n i =1 α i ( k, q ) • Generalize this by adding more latent variables, changing annotation function. Add structure by dividing into cliques: X c = E z ∼ p ( z | x,q ) [ f ( x, z )] = E z ∼ p ( z C | x,q ) [ f C ( x, z C )] C X ! p ( z | x, q ; θ ) = softmax θ C ( z C ) C

Subsequence Attention • (a) original unstructured attention network • (b) 1 independent binary latent variable per input: n n X X c = E z 1 ,...,z n [ f ( x, z )] = p ( z i = 1 | x, q ) x i f ( x , z ) = { z i = 1 } x i � i =1 i =1 � p ( z i = 1 | x, q ) = sigmoid( θ i ) z i ∈ 0 , 1 • (c) probability of each z depends on neighbors. n − 1 ! X p ( z 1 , . . . , z n ) = softmax θ i,i +1 ( z i , z i +1 ) i =1

Subsequence Attention (a) (b) (c) Truth

Tree Attention • Task: � • Latent variables if symbol has parent : z ij = 1 i j 0 1 X @ { z is valid } p ( z | x, q ) = softmax { z ij = 1 } θ ij � A i 6 = j • Context vector per symbol that attends to its parent in the tree: n X c j = p ( z ij = 1 | x, q ) x i � i =1 • No input query in this case, since a symbol’s parent doesn’t depend on decoder’s location.

Tree Attention Simple Structured

Attention Is All You Need • Can we replace CNNs and RNNs with attention for sequential tasks? • Self attention: the sequence is the query, key, and value. • Stack attention layers: output of attention layer is a sequence which is fed into the next layer. • Attention loses positional information; must insert as additional input.

Attention Is All You Need Outputs probabilities Regular attention: keys for just the next word. and values from encoder, query from decoder. All linear layers applied per position with weight sharing. Stacked a fixed Masked to prevent N number of times attending to words that were written later. Self attention: keys, Pointwise add sinusoids values, queries all of different frequencies from previous layer to the input features. Input entire sequence, size is n x d model Input sequence generated so far

Multi-Head Attention After concat, dimension is d model again. Learn linear projections into h separate d model /h Run h separate multiplicative size vectors attention steps. Scale the dot-product by (d model /h) 1/2

Self Attention • Why? Self-attention improves long-range correlations and parallelization, and sometimes complexity. n: sequence length RNNs and CNNs need d: representation length a d x d matrix of weights, k: kernel size attention uses length d r: restriction size dot product. Using dilated convolutions, Whole sequence attends otherwise O(n/k) to every position

Attention is All You Need

Other Cool Things • Image captioning: like translation but replace encoder with CNN. Can see where network is ‘looking’. � � Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio. “Show, attend, and tell: neural image caption generation with visual attention.” In International Conference on Machine Learning, 2015. arXiv:1502.03044 [cs.LG] • Hard attention: sample from probability distribution instead of taking expectation value. No longer differentiable, so train as RL algorithm where choosing attention target is an action.

Summary • Attention is an architecture-level construct for sequence analysis. • It is essentially learned, differentiable dictionary look-up. • More generally, it is an input-dependent, learned probability distribution for latent variables that annotate output values. • Better long range correlation and parallelization than RNNs, often less complex. • Produces human interpretable intermediate data.

Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating - PowerPoint PPT Presentation

Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International

Capsule Networks Eric Mintun Motivation An improvement* to regular Convolutional Neural

Modular Rapid Monitoring System Timothy Kritzler and Joseph Mintun Sponsor: Martin Engineering,

European Research Initiative on CLL (ERIC) 25-27 October, 2018 ERIC Meeting, Barcelona ERIC

2941 Fairview Park 2941 Fairview Park Eric Sobel 2941 Fairview Park PROJECT TEAM Eric Sobel

Eric Wahlforss CTO/SoundCloud GOTO Aarhus 2011 L O O C Eric Wahlforss CTO/SoundCloud GOTO

Self Assembly (talk for the AERES evaluation) Eric R emila based on Florent Becker s Ph. D.

Eric Sparks eric.sparks@msstate.edu AL.com AL.com NOAA Gulf Spill Restoration Ecological

Behind the Scenes at ERIC North East Consultants Forum 26 th October 2012 Katherine Pinnock

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric

Complexity Theory Eric Price UT Austin CS 331, Spring 2020 Coronavirus Edition Eric Price (UT

Eric L. Green Eric is a partner with Green & Sklarz, LLC in Connecticut and New York The

Requirements in Conflict Player vs. Designer vs. Cheater David Callele David Callele Eric

Logic and Group Decision Making Eric Pacuit Department of Philosophy University of Maryland,

leukemia: An ERIC project within HARMONY Lesley Ann Sutton Uppsala Universitet Sweden 43 rd ERIC

Immunoglobulin Supplementation in Clinical Studies Eric Weaver, PhD Eric Weaver, PhD Proliant

2008 Investor Day October 29, 2008 October 29, 2008 Eric Krasnoff Eric Krasnoff Chairman &

SME1013 SME1013 PROGRAMMING PROGRAMMING FOR ENGINEERS FOR ENGINEERS Ainullotfi bin Abdul

Tips N Tricks How Not to Blow Stuff Up (Safety, IT Security, Maintenance, etc.) Agenda

0 1 0 1 3 A = 3 B = 0 0 1 2 0 1 2

Medicare: What It Means to Seniors Lina Walker, Ph.D. Vice President, AARP The 24 th Princeton

Its been a tough few years! 2 1 2018-09-24 Some important drivers 3 1) How we manage our

Why does water fall from an inverted glass ? Olivier Soulard CEA-DAM CEMRACS, Marseille 14

crowdsourcing using location- based questionnaires Google StreetView to rebuild a cyclists

Probability and Statistics for Computer Science All models are wrong, but some models are

Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating - PowerPoint PPT Presentation

Eric Mintun HEP-AI Journal Club May 15th, 2018 Outline Motivating example and definition Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International

Capsule Networks Eric Mintun Motivation An improvement* to regular Convolutional Neural

Modular Rapid Monitoring System Timothy Kritzler and Joseph Mintun Sponsor: Martin Engineering,

European Research Initiative on CLL (ERIC) 25-27 October, 2018 ERIC Meeting, Barcelona ERIC

2941 Fairview Park 2941 Fairview Park Eric Sobel 2941 Fairview Park PROJECT TEAM Eric Sobel

Eric Wahlforss CTO/SoundCloud GOTO Aarhus 2011 L O O C Eric Wahlforss CTO/SoundCloud GOTO

Self Assembly (talk for the AERES evaluation) Eric R emila based on Florent Becker s Ph. D.

Eric Sparks eric.sparks@msstate.edu AL.com AL.com NOAA Gulf Spill Restoration Ecological

Behind the Scenes at ERIC North East Consultants Forum 26 th October 2012 Katherine Pinnock

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric

Complexity Theory Eric Price UT Austin CS 331, Spring 2020 Coronavirus Edition Eric Price (UT

Eric L. Green Eric is a partner with Green &amp; Sklarz, LLC in Connecticut and New York The

Requirements in Conflict Player vs. Designer vs. Cheater David Callele David Callele Eric

Logic and Group Decision Making Eric Pacuit Department of Philosophy University of Maryland,

leukemia: An ERIC project within HARMONY Lesley Ann Sutton Uppsala Universitet Sweden 43 rd ERIC

Immunoglobulin Supplementation in Clinical Studies Eric Weaver, PhD Eric Weaver, PhD Proliant

2008 Investor Day October 29, 2008 October 29, 2008 Eric Krasnoff Eric Krasnoff Chairman &amp;

SME1013 SME1013 PROGRAMMING PROGRAMMING FOR ENGINEERS FOR ENGINEERS Ainullotfi bin Abdul

Tips N Tricks How Not to Blow Stuff Up (Safety, IT Security, Maintenance, etc.) Agenda

0 1 0 1 3 A = 3 B = 0 0 1 2 0 1 2

Medicare: What It Means to Seniors Lina Walker, Ph.D. Vice President, AARP The 24 th Princeton

Its been a tough few years! 2 1 2018-09-24 Some important drivers 3 1) How we manage our

Why does water fall from an inverted glass ? Olivier Soulard CEA-DAM CEMRACS, Marseille 14

crowdsourcing using location- based questionnaires Google StreetView to rebuild a cyclists

Probability and Statistics for Computer Science All models are wrong, but some models are

Eric L. Green Eric is a partner with Green & Sklarz, LLC in Connecticut and New York The

2008 Investor Day October 29, 2008 October 29, 2008 Eric Krasnoff Eric Krasnoff Chairman &