Soft Attention Models in Deep Networks Praveen Krishnan CVIT, IIIT - PowerPoint PPT Presentation

Soft Attention Models in Deep Networks Praveen Krishnan CVIT, IIIT Hyderabad June 21, 2017 Everyone knows what attention is. It is the taking possession of the mind, in clear and vivid form, of one out of what seem several simultaneously possible objects or trains of thought. Focalization, concentration of consciousness are of its essence....James’1980 1

Outline Motivation Primer on Prediction using RNNs Handwriting prediction [Read] Handwriting synthesis [Write] Deep Recurrent Attentive Writer [DRAW] 2

Motivation Few questions to begin ◮ How do we perceive an image and start interpreting? ◮ Given the knowledge of any languages, how do we manually translate between its sentences? ◮ . . . 3

Motivation Few questions to begin ◮ How do we perceive an image and start interpreting? ◮ Given the knowledge of any languages, how do we manually translate between its sentences? ◮ . . . Figure 1: Left: Source Wikipedia, Right: Bahdanau et al. ICLR’15 3

Motivation Why Attention? ◮ You don’t see every pixel! ◮ You remove the clutter and process the salient parts . ◮ You process one step at a time and aggregate the information in your memory . 4

Attention Mechanism Definition In cognitive neuroscience, it is viewed as a neural system for the selection of information similar in many ways to the visual, auditory, or motor systems [Posner 1994]. Visual Attention Components [Tsotsos, et al. 1995] ◮ Selection of a region of interest in the visual field. ◮ Selection of feature dimensions and values of interest ◮ Control of information flow through visual cortex. ◮ Shifting from one selected region to the next in time 5

Attention Mechanism Attention in Neural Networks An architecture-level feature in neural networks to allow attending different parts of image sequentially and aggregate information over time. Types of attention ◮ Hard: Picking discrete location to attend. However the model is non-differentiable. ◮ Soft: Spread out the attention weights over the entire image. In this talk, we limit ourselves to soft attention models which are differentiable and uses standard back-propagation. Before we dig deeper, lets brush up RNN. 6

Recurrent Neural Network Figure 2: An unrolled recurrent neural network 1 RNN network ◮ Neural network with loops. ◮ Captures temporal information. ◮ Issue of long-term dependency is solved by gated units (LSTMs, GRUs, . . . ). ◮ Wide range of applications in image captioning, speech processing, language modelling, . . . 1 colah’s blog , Understanding LSTM Networks. 7

LSTM Long Short-term Memory Cell f t = σ ( W xf x t + W hf h t − 1 + W cf c t − 1 + b f ) i t = σ ( W xi x t + W hi h t − 1 + W ci c t − 1 + b i ) c t = f t c t − 1 + i t tanh( W xc x t + W hc h t − 1 + b c ) o t = σ ( W xo x t + W ho h t − 1 + W co c t + b o ) h t = o t tanh( c t ) ◮ Uses memory cells to store information. ◮ The above version uses peephole connections. Let us now see how we can use them for prediction. 8

LSTMs as Prediction Network Given an input sample x t , predict the next sample x t +1 . Prediction Problem Learn a distribution P ( x t +1 | y t ) where x = ( x 1 , . . . , x T ) is the input sequence, given to N hidden layers ( h n = h n 1 , . . . , h n T ) to predict an output sequence y = ( y 1 , . . . , y T ) 9

LSTMs as Prediction Network Choice of predictive distribution (Density Modeling) The probability given by the network to the input sequence x is:- T � P ( x ) = P ( x t +1 | y t ) t =1 and the sequence loss is:- T � L ( x ) = − log P ( x t +1 | y t ) t =1 Training is done through back-propagation through time. For e.g: In text prediction, one can parameterize the output distribution using a softmax function. 10

Handwriting Prediction Problem Given online handwriting data (recorded pen tip x 1 , x 2 locations) at time step t , predict the location of pen at t + 1 time step along with the end of stroke variable. Figure 3: Left: Samples of online handwritten data of multiple authors. Right: Demo 2 2 Carter et al. , Experiments in Handwriting with a Neural Network, Distill, 2016. 11

Handwriting Prediction Mixture Density Outputs [Graves arxiv’13] A mixture of bivariate Gaussians is used to predict x 1 , x 2 , while a Bernoulli distribution is used for x 3 . x t ∈ R × R × { 0 , 1 } y t =( e t , { π j t , µ j t , σ j t , ρ j t } M j =1 ) ◮ e t ∈ (0 , 1) is the end of stroke probability, ◮ π j ∈ (0 , 1) is the mixture weights, ◮ µ j ∈ R 2 the means vector, ◮ σ j > 0 the standard deviation, and ◮ ρ j ∈ ( − 1 , 1) are the correlations Note that x 1 , x 2 are now the offsets from the previous location and the above parameters are normalized from network outputs. 12

Handwriting Prediction Mixture Density Outputs [Graves arxiv’13] The probability of next input is given as:- M � if ( x t +1 ) 3 = 1 e t � π j t N ( x t +1 | µ j t , σ j t , ρ j P ( x t +1 | y t ) = t ) 1 − e t otherwise j =1 As shown earlier the sequence loss is given as:- � M T � � � π j t N ( x t +1 | µ j t , σ j t , ρ j L ( x ) = − log t ) t =1 j =1 � log e t if ( x t +1 ) 3 = 1 − log(1 − e t ) otherwise 13

Handwriting Prediction Visualization Figure 4: Heat map showing the mixture density outputs for handwriting prediction. 14

Handwriting Prediction Demo Available at :- Link Figure 5: Carter et al. , Experiments in Handwriting with a Neural Network, Distill, 2016. 15

Handwriting synthesis [Graves arxiv’13] HW Synthesis Generation of handwriting conditioned on an input text. Key Question Ques: How to resolve the alignment problem between two sequences of varying length? 16

Handwriting synthesis [Graves arxiv’13] HW Synthesis Generation of handwriting conditioned on an input text. Key Question Ques: How to resolve the alignment problem between two sequences of varying length? Sol: Add “attention” as a soft window which is convolved with the input text and given as input to prediction network. Learning to decide which character to write next. 16

Handwriting synthesis 17

Handwriting synthesis The soft window w t into c at timestep t is defined as:- K � t − u ) 2 ) α k t exp( − β k t ( κ k φ ( t , u ) = k =1 U � w t = φ ( t , u ) c u u =1 φ ( t , u ) acts as window weight for c u (one-hot encoding) at time t . The soft attention is modeled by a mixture of K Gaussians, where κ t → location, β t → width and α t → weight of the Gaussian. 17

Handwriting synthesis Window Parameters α t , ˆ κ t ) = W h 1 p h 1 (ˆ β t , ˆ t + b p α t = exp(ˆ α t ) β t = exp(ˆ β t ) κ t = κ t − 1 + exp(ˆ κ t ) Figure 6: Alignment between the text sequence and handwriting. 18

Handwriting synthesis Qualitative Results Questions ◮ How is stochasticity induced in the generation of different samples? 19

Handwriting synthesis Qualitative Results Questions ◮ How is stochasticity induced in the generation of different samples? ◮ How to decide the network has finished writing text? 19

Handwriting synthesis Qualitative Results Questions ◮ How is stochasticity induced in the generation of different samples? ◮ How to decide the network has finished writing text? ◮ How to control the quality of writing? 19

Handwriting synthesis Qualitative Results Questions ◮ How is stochasticity induced in the generation of different samples? ◮ How to decide the network has finished writing text? ◮ How to control the quality of writing? ◮ How to generate handwriting in a particular style? 19

Handwriting synthesis Biased Sampling vs. Primed Sampling 20

Deep Recurrent Attentive Writer (DRAW) DRAW Combines spatial attention mechanism with a sequential variational auto-encoding framework for iterative construction of complex images. Figure 7: MNIST digits drawn using recurrent attention model. 21

Deep Recurrent Attentive Writer (DRAW) Major contribution ◮ Progressive Refinement (Temporal): Suppose C is the canvas on which the image is drawn. The joint distribution of P ( C ) can be split into multiple latent variables C 1 , C 2 , . . . , C T − 1 , given the observed variable P ( C T ). P ( C ) = P ( C T | C T − 1 ) P ( C T − 1 | C T − 2 ) . . . P ( C 1 | C 0 ) P (0) (1) ◮ Spatial Attention (Spatial): Drawing a part of canvas at a time which simplifies the drawing process by defining “where to look” and “where to write”. Figure 8: Recurrence relation 22

Deep Recurrent Attentive Writer (DRAW) Figure 9: Left: Traditional VAE network, Right: DRAW network ◮ Encoder and decoder are recurrent networks. ◮ Encoder oversees the previous output of decoder to tailor its current output while decoder output is successively added to the output distribution. ◮ An dynamic attention mechanism for “where to read” and “where to write”. 23

Deep Recurrent Attentive Writer (DRAW) Training x t = x − σ ( c t − 1 ) ˆ x t , h dec r t = read ( x t , ˆ t − 1 ) h enc = RNN enc ( h enc t − 1 , [ r t , h dec t − 1 ]) t z t ∼ Q ( Z t | h enc ) t h dec = RNN dec ( h dec t − 1 , z t ) t c t = c t − 1 + write ( h dec ) t 1 Here σ ( x ) = 1+exp − x is the logistic sigmoid function and the latent distribution is taken as diagonal Gaussian N ( Z t | µ t , σ t ) where:- µ t = W ( h enc ) t σ t = exp W ( h enc ) t 24

Soft Attention Models in Deep Networks Praveen Krishnan CVIT, IIIT - PowerPoint PPT Presentation

Soft Attention Models in Deep Networks Praveen Krishnan CVIT, IIIT Hyderabad June 21, 2017 Everyone knows what attention is. It is the taking possession of the mind, in clear and vivid form, of one out of what seem several simultaneously

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Attention Models Focus on parts of input Olof Mogren Improves NN performance on different

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

> SOFT EDGE < By Iskos-Be rlin > SOFT EDGE < Soft Edge chair series is based on the

Kvadrat Soft Cells Acoustic excellence. Sustainable design. Where it all began. Kvadrat Soft

Soft body physics and fracture generation Erich Jagomgis What is a soft body? What is not a

Importance of Soft Tissue Modeling Importance of Soft Tissue Modeling Most medical procedures

Soft Soft Soft LArSoft coord, Oct 10 th , 2017 G. Petrillo (FNAL) Proxies for data products 1

CSE/NB 528 Lecture 10: Recurrent Networks (Chapter 7) Lecture figures are from Dayan &

YouTube Video Analytics for Health Literacy and Chronic Care Management: An Augmented Intelligence

How Does Selective Mechanism Improve Self-Attention Networks? Xinwei Geng 1 , Longyue Wang 2 , Xing

22/10/2020 Keeping Yourself Safe Sometimes difficult emotional problems can lead to feelings

CS-5630 / CS-6630 Visualization for Data Science The Visualization Alphabet: Marks and Channels

Building Resiliency in our Children Beth Hayes, D.C.S., C. Psych. Assoc Susan Wood, M.A., C.

Positive Discipline The Solution Studio What is Positive Discipline? Positive discipline is a

Overcomers Romans 12:14-21 A. Self-Discipline: A willingness to subordinate ______________

Soft Attention Models in Deep Networks Praveen Krishnan CVIT, IIIT - PowerPoint PPT Presentation

Soft Attention Models in Deep Networks Praveen Krishnan CVIT, IIIT Hyderabad June 21, 2017 Everyone knows what attention is. It is the taking possession of the mind, in clear and vivid form, of one out of what seem several simultaneously

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

WALES SOFT POWER BAROMETER 2018 Measuring soft power beyond the nation-state April 2018 01 WHAT

On Fuzzy Soft Rings Banu Pazar Varol and Halis Ayg un Department of Mathematics, Kocaeli

Introduction 1 Turbo Principle 2 Coding and uncoding SISO (Soft Input Soft Output) 3

Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Attention Models Focus on parts of input Olof Mogren Improves NN performance on different

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

&gt; SOFT EDGE &lt; By Iskos-Be rlin &gt; SOFT EDGE &lt; Soft Edge chair series is based on the

Kvadrat Soft Cells Acoustic excellence. Sustainable design. Where it all began. Kvadrat Soft

Soft body physics and fracture generation Erich Jagomgis What is a soft body? What is not a

Importance of Soft Tissue Modeling Importance of Soft Tissue Modeling Most medical procedures

Soft Soft Soft LArSoft coord, Oct 10 th , 2017 G. Petrillo (FNAL) Proxies for data products 1

CSE/NB 528 Lecture 10: Recurrent Networks (Chapter 7) Lecture figures are from Dayan &amp;

YouTube Video Analytics for Health Literacy and Chronic Care Management: An Augmented Intelligence

How Does Selective Mechanism Improve Self-Attention Networks? Xinwei Geng 1 , Longyue Wang 2 , Xing

22/10/2020 Keeping Yourself Safe Sometimes difficult emotional problems can lead to feelings

CS-5630 / CS-6630 Visualization for Data Science The Visualization Alphabet: Marks and Channels

Building Resiliency in our Children Beth Hayes, D.C.S., C. Psych. Assoc Susan Wood, M.A., C.

Positive Discipline The Solution Studio What is Positive Discipline? Positive discipline is a

Overcomers Romans 12:14-21 A. Self-Discipline: A willingness to subordinate ______________

> SOFT EDGE < By Iskos-Be rlin > SOFT EDGE < Soft Edge chair series is based on the

CSE/NB 528 Lecture 10: Recurrent Networks (Chapter 7) Lecture figures are from Dayan &