Bilinear Attention Networks 2018 VQA Challenge runner-up (1st - PowerPoint PPT Presentation

ł ł ł Bilinear Attention Networks 2018 VQA Challenge runner-up   (1st single model) Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang Biointelligence Lab.   Seoul National University

Visual Question Answering • Learning joint representation of multiple modalities • Linguistic and visual information Credit: visualqa.org

  VQA 2.0 Dataset Images Questions Answers Train 80K 443K 4.4M Val 40K 214K 2.1M Test 80K 447K unknown • 204K MS COCO images • 760K questions (10 answers for each question from unique AMT workers) • train, val, test-dev (remote validation), test-standard (publications) • and test-challenge (competitions), test-reserve (check overfitting)   # annotations VQA Score 0 0.0 1 • VQA Score is the avg. of 10 choose 9 accuracies → 0.3 2 0.6 https://github.com/GT-Vision-Lab/VQA/issues/1 3 0.9 https://github.com/hengyuan-hu/bottom-up-attention-vqa/pull/18 >3 1.0 Goyal et al., 2017

Objective • Introducing bilinear attention - Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time complexity) on top of low-rank bilinear pooling • Residual learning of attention - Residual learning with attention mechanism for incremental inference • Integration of counting module (Zhang et al., 2018)

https://github.com/peteanderson80/bottom-up-attention Preliminary • Question embedding (fine-tuning) - Use the all outputs of GRU (every time steps) - X ∈ R N ×ρ where N is hidden dim., and ρ =|{x i }| is # of tokens • Image embedding (fixed bottom-up-attention ) - Select 10-100 detected objects (rectangles) using pre-trained Faster RCNN, to extract rich features for each object (1600 classes, 400 attributes ) - Y ∈ R M ×φ where M is feature dim., and φ =|{y j }| is # of objects

    Low-rank Bilinear Pooling • Bilinear model and its approximation (Wolf et al., 2007, Pirsiavash et al., 2009) i y = 1 T ( U T f i = x T W i y = x T U i V T i x � V T i y ) ≈ • Low-rank bilinear pooling (Kim et al., 2017)   f = P T ( U T x � V T y ) For vector output, instead of using three-dimensional tensors U and V , replace the vector of ones with a pooling matrix P (use three two-dimensional tensors).

Unitary Attention • This pooling is used to get attention Q V weights with a question embedding Linear Conv Tanh Tanh (single-channel) vector and visual Replicate feature vectors (multi-channel) as the two inputs. Linear Conv Tanh Softmax • We call it unitary attention since a question embedding vector queried Linear Linear the feature vectors, unidirectionally . Tanh Softmax A Kim et al., 2017

φ Bilinear Attention Maps K φ = ρ K ρ • U and V are linear embeddings • p is a learnable projection vector element-wise multiplication M ×φ K × M ( 1 · p T ) � X T U ⇣� ⌘ V T Y � A := softmax ρ × φ ρ× K ρ× N N × K ρ × K K × φ

φ Bilinear Attention Maps K φ = ρ K ρ • Exactly the same approach with   low-rank bilinear pooling element-wise multiplication ( 1 · p T ) � X T U ⇣� ⌘ V T Y � A := softmax ( U T X i ) � ( V T Y j ) A i,j = p T � � .

φ Bilinear Attention Maps K φ = ρ K ρ • Multiple bilinear attention maps are   acquired by different projection vectors p g as: ⇣� ⌘ g ) � X T U V T Y ( 1 · p T � A g := softmax

Bilinear Attention Networks • Each multimodal joint feature is filled with following equation ( k is the index of K; broadcasting in PyTorch let you avoid for-loop for this): A 2 ρ×φ k = ( X T U 0 ) T k A ( Y T V 0 ) k k-th f 0 K 1 × ρ φ × 1 φ 1 ρ = ρ 1 φ k-th k-th K K ※ broadcasting: automatically repeat tensor operations in api-level supported by Numpy, Tensorflow, Pytorch

Bilinear Attention Networks • One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation A 2 k = ( X T U 0 ) T k A ( Y T V 0 ) k f 0 |{ y j }| |{ x i }| X X A i,j ( X T k )( V 0 T f 0 i U 0 k = k Y j ) i =1 j =1 |{ y j }| |{ x i }| X X A i,j X T k V 0 T i ( U 0 = k ) Y j i =1 j =1 low-rank bilinear pooling

Bilinear Attention Networks • One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation • Low-rank bilinear feature learning inside bilinear attention |{ y j }| |{ x i }| X X A i,j ( X T k )( V 0 T f 0 i U 0 k = k Y j ) i =1 j =1 |{ y j }| |{ x i }| X X A i,j X T k V 0 T i ( U 0 = k ) Y j i =1 j =1 low-rank bilinear pooling

Bilinear Attention Networks • One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation • Low-rank bilinear feature learning inside bilinear attention • Similarly to MLB (Kim et al., ICLR 2017), activation functions can be applied

Bilinear Attention Networks X Y Linear Linear ReLU ReLU transpose Linear Linear Linear ReLU ReLU Softmax transpose Linear Classifier

Time Complexity • Assuming that M ≥ N > K > φ ≥ ρ , the time complexity of bilinear attention networks is O(KM φ ) where K denotes hidden size, since BAN consists of matrix chain multiplication • Empirically, BAN takes 284s/epoch while unitary attention control takes 190s/epoch • Largely due to the increment of input size for Softmax function, φ to φ × ρ

Residual Learning of Attention • Residual learning exploits multiple attention maps ( f 0 is X and { ɑ i } is fixed to ones): bilinear attention map f i +1 = BAN i ( f i , Y ; A i ) · 1 T + α i f i shortcut bilinear attention networks repeat {# of tokens} times

Overview • After getting bilinear attention maps, we can stack multiple BANs. Residual Learning K X φ What is the mustache Y made of ? + U’ T X Y T V’ = K=N K ρ φ 1 GRU Att_1 ρ Object Detection All hidden repeat 1 → ρ 1 X φ states X’ X T U Residual Learning φ K x K a m X’ t f o φ = ρ S ∙ Sum V T Y att_1 K + Pooling ρ U’’ T X’ Y T V’’ MLP   att_2 = K K ρ 1 Att_2 φ classifier ρ p repeat 1 → ρ 1 Step 2. Bilinear Attention Networks Step 1. Bilinear Attention Maps

Multiple Attention Maps • Single model on validation score Validation   +% VQA 2.0 Score 63.37 Bottom-Up (Teney et al., 2017) 65.36 +1.99 BAN-1 65.61 +0.25 BAN-2 65.81 +0.2 BAN-4 66.00 +0.19 BAN-8 66.04 BAN-12 +0.04

Residual Learning Validation   +/- VQA 2.0 Score BAN-4 (Residual) 65.81 ±0.09 ∑ BAN-4 (Sum) BAN i ( X , Y ; A i ) 64.78 ±0.08 -1.03 i ! BAN i ( X , Y ; A i ) BAN-4 (Concat) 64.71 ±0.21 -0.07 i

Comparison with Co-attention Validation   +/- VQA 2.0 Score BAN-1 Bilinear Attention 65.36 ±0.14 Co-Attention 64.79 ±0.06 -0.57 Attention 64.59 ±0.04 -0.20 ※ The number of parameters is controlled (all comparison models have 32M parameters).

Comparison with Co-attention 67.0 85 85 (a) (a) (b (b) (c) Training curves 65.5 74 Validation score Validation score 64.0 Validation curves 63 63 62.5 Bi-Att train 52 52 Co-Att train Uni-Att 61.0 Uni-Att train Co-Att Bi-Att val BAN-1 41 41 59.5 Co-Att val BAN-4 Uni-Att val BAN-1+MFB 30 30 58.0 1 4 7 10 13 16 19 0M 15M 30M 45M 60M Epoch The number of parameters ※ The number of parameters is controlled (all comparison models have 32M parameters).

Visualization

Integration of Counting module with BAN • Counter (Zhang et al., 2018) gets a multinoulli distribution and detected box info • Our method — maxout from bilinear attention distribution to unitary attention distribution, easy to adapt multitask learning ɸ Softmax Bilinear Attention ρ Logits Maxout Sigmoid Counting Module maxout 1 ɸ · 1 T + f i � � f i +1 = BAN i ( f i , Y ; A i ) + g i ( c i ) inference steps for   linear embedding of the   residual learning of multi-glimpse attention i-th output of counter module attention

test-dev Numbers Comparison with State-of-the-arts Zhang et al. 51.62 (2018) Ours 54.04 • Single model on test-dev score Test-dev   +% VQA 2.0 Score Prior 25.70 44.22 +18.52% Language-Only 2016 winner MCB (ResNet) 61.96 +17.74% 2017 winner Bottom-Up (FRCNN) 65.32 +3.36% 2017 runner-up MFH (ResNet) 65.80 +0.48% image feature +2.96% 68.76 MFH (FRCNN) attention model +0.76% BAN (Ours; FRCNN) 69.52 +0.14% BAN-Glove (Ours; FRCNN) 69.66 counting feature 70.04 +0.38% BAN-Glove-Counter (Ours; FRCNN)

Flickr30k Entities • Visual grounding task — mapping entity phrases to regions in an image 0.48<0.5 [/EN#40120/people A girl] in [/EN#40122/clothing a yellow tennis suit] , [/ EN#40125/other green visor] and [/EN#40128/clothing white tennis shoes] holding [/EN#40124/other a tennis racket] in a position where she is going to hit [/EN#40121/other the tennis ball] .

Flickr30k Entities • Visual grounding task — mapping entity phrases to regions in an image [/EN#38656/people A male conductor] wearing [/EN#38657/clothing all black] leading [/EN#38653/people a orchestra] and [/EN#38658/people choir] on [/ EN#38659/scene a brown stage] playing and singing [/EN#38664/other a musical number] .

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st - PowerPoint PPT Presentation

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang Biointelligence Lab. Seoul National University Visual Question Answering Learning joint representation

Pairing-Based Cryptography & Generic Groups Lecture 22 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography & Generic Groups Lecture 21 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography & Generic Groups Lecture 22 1 Bilinear Pairing 2 Bilinear

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Abstract rule representations in a Abstract rule representations in a bilinear model bilinear

Weakly-coupled bilinear quantum systems Thomas Chambrion Nabile Boussad (Besanon) and Marco

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

The Attention Economy What is the attention economy? A business model where you (as the

Bilinear Text Regression and Applications Vasileios Lampos Department of Computer Science

Bilinear Models For System Dynamics, . . . The Fact that We Are . . . from System Approach

Poisson algebras of block-upper-triangular bilinear forms and braid group action Marta Mazzocco,

Small Normalized Boolean Circuits for Semi-disjoint Bilinear Forms Require Logarithmic

Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg

Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang Zhang 2 , Fei Wu 1 , Tat-Seng

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image captioning and

Bradley/Taylor: Inadequate attention to and investment in services that address the broader

GANocracy Outline Background: Text Generation Latent-Variable Generation Learning

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang,

Interpretable and Accurate Fine-grained Recognition via Region Grouping Zixuan Huang 1 , Yin Li

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st - PowerPoint PPT Presentation

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang Biointelligence Lab. Seoul National University Visual Question Answering Learning joint representation

Pairing-Based Cryptography &amp; Generic Groups Lecture 22 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography &amp; Generic Groups Lecture 21 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography &amp; Generic Groups Lecture 22 1 Bilinear Pairing 2 Bilinear

Attention in NLP CS 6956: Deep Learning for NLP Overview What is attention Attention in

Abstract rule representations in a Abstract rule representations in a bilinear model bilinear

Weakly-coupled bilinear quantum systems Thomas Chambrion Nabile Boussad (Besanon) and Marco

Attention Eye tracking seminar 2/19/15 Presented by Tatiana Emmanouil Outline What is

Attention, Transformer and BERT Prof. Kuan-Ting Lai 2020/6/16 Attention is All You Need! A.

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Structured Attention Networks Yoon Kim Carl Denton Luong Hoang Alexander M. Rush

Attention! 1. Definitions and behavioral effects 2. Effects on neural firing rates: Spatial

The Attention Economy What is the attention economy? A business model where you (as the

Bilinear Text Regression and Applications Vasileios Lampos Department of Computer Science

Bilinear Models For System Dynamics, . . . The Fact that We Are . . . from System Approach

Poisson algebras of block-upper-triangular bilinear forms and braid group action Marta Mazzocco,

Small Normalized Boolean Circuits for Semi-disjoint Bilinear Forms Require Logarithmic

Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg

Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang Zhang 2 , Fei Wu 1 , Tat-Seng

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image captioning and

Bradley/Taylor: Inadequate attention to and investment in services that address the broader

GANocracy Outline Background: Text Generation Latent-Variable Generation Learning

Sequence-to-Sequence Models Can Directly Translate Foreign Speech Ron J. Weiss, Jan Chorowski ,

Video Paragraph Captioning using Hierarchical Recurrent Neural Networks Haonan Yu, Jiang Wang,

Interpretable and Accurate Fine-grained Recognition via Region Grouping Zixuan Huang 1 , Yin Li

Pairing-Based Cryptography & Generic Groups Lecture 22 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography & Generic Groups Lecture 21 Bilinear Pairing Bilinear Pairing

Pairing-Based Cryptography & Generic Groups Lecture 22 1 Bilinear Pairing 2 Bilinear