Structured Attention Networks Yoon Kim Carl Denton Luong Hoang - PowerPoint PPT Presentation

Structured Attention Networks for Neural Machine Translation

Motivation: Structured Output Prediction Modeling the structured output (i.e. graphical model on top of a neural net) has improved performance (LeCun et al., 1998; Lafferty et al., 2001; Collobert et al., 2011) Given a sequence x = x 1 , . . . , x T Factored potentials θ i,i +1 ( z i , z i +1 ; x ) � T − 1 � � p ( z 1 . . . , z T | x ; θ ) = softmax θ i,i +1 ( z i , z i +1 ; x ) i =1 � T − 1 = 1 � � Z exp θ i,i +1 ( z i , z i +1 ; x ) i =1 � T − 1 � � � θ i,i +1 ( z ′ i , z ′ Z = exp i +1 ; x ) z ′ ∈C i =1

Example: Part-of-Speech Tagging

Neural CRF for Sequence Tagging (Collobert et al., 2011)

Neural CRF for Sequence Tagging (Collobert et al., 2011) Unary potentials θ i ( c ) = w ⊤ c x i come from neural network

Inference in Linear-Chain CRF Pairwise potentials are simple parameters b , so altogether θ i,i +1 ( c, d ) = θ i ( c ) + θ i +1 ( d ) + b c,d

1 Deep Neural Networks for Text Processing and Generation 2 Attention Networks 3 Structured Attention Networks Computational Challenges Structured Attention In Practice 4 Conclusion and Future Work

Structured Attention Networks: Notation x 1 , . . . , x T Memory bank q Query z = z 1 , . . . , z T Memory selection over structures p ( z | x, q ; θ ) Attention distribution over structures f ( x, z ) Annotation function (Neural representation) c = ❊ z ∼ p ( z | x,q ) [ f ( x, z )] Context vector Need to calculate T � c = p ( z i = 1 | x, q ) x i i =1

Challenge: End-to-End Training Requirements: 1 Compute attention distribution (marginals) p ( z i | x, q ; θ ) = ⇒ Forward-backward algorithm 2 Gradients wrt attention distribution parameters θ = ⇒ Backpropagation through forward-backward algorithm

Review: Forward-Backward Algorithm θ : input potentials (e.g. from NN) α, β : dynamic programming tables procedure ForwardBackward ( θ ) Forward for i = 1 , . . . , n ; z i do α [ i, z i ] ← � z i − 1 α [ i − 1 , z i − 1 ] × exp( θ i − 1 ,i ( z i − 1 , z i )) Backward for i = n, . . . , 1; z i do β [ i, z i ] ← � z i +1 ı β [ i + 1 , z i +1 ] × exp( θ i,i +1 ( z i , z i +1 )) Marginals for i = 1 , . . . , n ; c ∈ C do p ( z i = c | x ) ← α [ i, c ] × β [ i, c ] /Z

Forward-Backward Algorithm in Practice (Log-Space Semiring Trick) x ⊕ y = log(exp( x ) + exp( y )) x ⊗ y = x + y procedure ForwardBackward ( θ ) Forward for i = 1 , . . . , n ; z i do α [ i, z i ] ← � z i − 1 α [ i − 1 , y ] ⊗ θ i − 1 ,i ( z i − 1 , z i ) Backward for i = n, . . . , 1; z i do β [ i, z i ] ← � z i +1 β [ i + 1 , z i +1 ] ⊗ θ i,i +1 ( z i , z i +1 ) Marginals for i = 1 , . . . , n ; c ∈ C do p ( z i = c | x ) ← exp( α [ i, c ] ⊗ β [ i, c ] ⊗ − log Z )

Backpropagating through Forward-Backward ∇ L p : Gradient of arbitrary loss L with respect to marginals p procedure BackpropForwardBackward ( θ, p, ∇ L p ) Backprop Backward for i = n, . . . 1; z i do ˆ z i +1 θ i,i +1 ( z i , z i +1 ) ⊗ ˆ β [ i, z i ] ← ∇ L α [ i, z i ] ⊕ � β [ i + 1 , z i +1 ] Backprop Forward for i = 1 , . . . , n ; z i do α [ i, z i ] ← ∇ L ˆ β [ i, z i ] ⊕ � z i − 1 θ i − 1 ,i ( z i − 1 , z i ) ⊗ ˆ α [ i − 1 , z i − 1 ] Potential Gradients for i = 1 , . . . , n ; z i , z i +1 do ∇ L θ i − 1 ,i ( z i ,z i +1 ) ← exp(ˆ α [ i, z i ] ⊗ β [ i + 1 , z i +1 ] ⊕ α [ i, z i ] ⊗ ˆ β [ i + 1 , z i +1 ] ⊕ α [ i, z i ] ⊗ β [ i + 1 , z i +1 ] ⊗ − log Z )

Interesting Issue: Negative Gradients Through Attention ∇ L p : Gradient could be negative, but working in log-space! Signed Log-space semifield trick (Li and Eisner, 2009) Use tuples ( l a , s a ) where l a = log | a | and s a = sign( a ) ⊕ s a s b l a + b s a + b + + l a + log(1 + d ) + + − l a + log(1 − d ) + − + l a + log(1 − d ) − − − l a + log(1 + d ) − (Similar rules for ⊗ )

1 Deep Neural Networks for Text Processing and Generation 2 Attention Networks 3 Structured Attention Networks Computational Challenges Structured Attention In Practice 4 Conclusion and Future Work

Implementation http://github.com/harvardnlp/struct-attn General-purpose structured attention unit “Plug-and-play” neural network layers Dynamic programming is GPU-optimized for speed

NLP Experiments Replace existing attention layers for Machine Translation Segmental Attention : 2 -state linear-chain CRF Question Answering Sequential Attention : N -state linear-chain CRF Natural Language Inference Syntactic Attention : graph-based dependency parser

Segmental Attention for Neural Machine Translation Use segmentation CRF for attention, i.e. binary vectors of length n p ( z 1 , . . . , z T | x, q ) parameterized with a linear-chain CRF. Unary potentials (Encoder RNN):  x i Wq, k = 1  θ i ( k ) = 0 , k = 0  Pairwise potentials (Simple Parameters): 4 additional binary parameters (i.e., b 0 , 0 , b 0 , 1 , b 1 , 0 , b 1 , 1 )

Segmental Attention for Neural Machine Translation Data: Japanese → English (from WAT 2015) Traditionally, word segmentation as a preprocessing step Use structured attention learn an implicit segmentation model Experiments: Japanese characters → English words Japanese words → English words

Segmental Attention for Neural Machine Translation Simple Sigmoid Structured Char → Word 12 . 6 13 . 1 14 . 6 Word → Word 14 . 1 13 . 8 14 . 3 BLEU scores on test set (higher is better) Models: Simple softmax attention: softmax( θ i ) Sigmoid attention: sigmoid( θ i ) Structured attention: ForwardBackward( θ )

Attention Visualization: Ground Truth

Attention Visualization: Simple Attention

Attention Visualization: Sigmoid Attention

Attention Visualization: Structured Attention

Sequential Attention over Facts for Question Answering Simple attention: Greedy soft-selection of K supporting facts

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang - PowerPoint PPT Presentation

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush HarvardNLP 1 Deep Neural Networks for Text Processing and Generation 2 Attention Networks 3 Structured Attention Networks Computational Challenges

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

The Attention Economy What is the attention economy? A business model where you (as the

CS480/680 Lecture 19: July 10, 2019 Attention and Transformer Networks [Vaswani et al.,

Structured Electronic Design Structured Electronic Design ET 8016 5 ECTS credits 1

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Semi-structured data Data is not just text, but is not as well- Semi-structured data

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Variational Inference for Tutorial Outline Structured NLP Models 1. Structured Models and Factor

The Ubiquitous Web The Ubiquitous Web as a model to lead our environment as a model to lead our

Leaving Certificate Applied Hotel Catering and Tourism Aims To help students develop

Science One Math March 21, 2018 Announcements Webwork MATH 2.9 due on Saturday short, on

UEC Quality of Life Committee Frank Chlebana, Fermilab (chair) Your name here (deputy) TBD

Defining Yield Policies in a Viability Approach L. Chapel 1 G. Deffuant 1 S. Martin 1 C. Mullon 2 1

[1] (jmc@uac.pt) Overview Aores Geographic Location Agent-Based Modelling Simulation

rt srr s

Pinning and Wetting Transition for (1+1)-Dimensional Fields with Laplacian Interaction Francesco