bilinear attention networks
play

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st - PowerPoint PPT Presentation

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang Biointelligence Lab. Seoul National University Visual Question Answering Learning joint representation


  1. ł ł ł Bilinear Attention Networks 2018 VQA Challenge runner-up 
 (1st single model) Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang Biointelligence Lab. 
 Seoul National University

  2. Visual Question Answering • Learning joint representation of multiple modalities • Linguistic and visual information Credit: visualqa.org

  3. 
 VQA 2.0 Dataset Images Questions Answers Train 80K 443K 4.4M Val 40K 214K 2.1M Test 80K 447K unknown • 204K MS COCO images • 760K questions (10 answers for each question from unique AMT workers) • train, val, test-dev (remote validation), test-standard (publications) • and test-challenge (competitions), test-reserve (check overfitting) 
 # annotations VQA Score 0 0.0 1 • VQA Score is the avg. of 10 choose 9 accuracies → 0.3 2 0.6 https://github.com/GT-Vision-Lab/VQA/issues/1 3 0.9 https://github.com/hengyuan-hu/bottom-up-attention-vqa/pull/18 >3 1.0 Goyal et al., 2017

  4. Objective • Introducing bilinear attention - Interactions between words and visual concepts are meaningful - Proposing an efficient method (with the same time complexity) on top of low-rank bilinear pooling • Residual learning of attention - Residual learning with attention mechanism for incremental inference • Integration of counting module (Zhang et al., 2018)

  5. https://github.com/peteanderson80/bottom-up-attention Preliminary • Question embedding (fine-tuning) - Use the all outputs of GRU (every time steps) - X ∈ R N ×ρ where N is hidden dim., and ρ =|{x i }| is # of tokens • Image embedding (fixed bottom-up-attention ) - Select 10-100 detected objects (rectangles) using pre-trained Faster RCNN, to extract rich features for each object (1600 classes, 400 attributes ) - Y ∈ R M ×φ where M is feature dim., and φ =|{y j }| is # of objects

  6. 
 
 Low-rank Bilinear Pooling • Bilinear model and its approximation (Wolf et al., 2007, Pirsiavash et al., 2009) i y = 1 T ( U T f i = x T W i y = x T U i V T i x � V T i y ) ≈ • Low-rank bilinear pooling (Kim et al., 2017) 
 f = P T ( U T x � V T y ) For vector output, instead of using three-dimensional tensors U and V , replace the vector of ones with a pooling matrix P (use three two-dimensional tensors).

  7. Unitary Attention • This pooling is used to get attention Q V weights with a question embedding Linear Conv Tanh Tanh (single-channel) vector and visual Replicate feature vectors (multi-channel) as the two inputs. Linear Conv Tanh Softmax • We call it unitary attention since a question embedding vector queried Linear Linear the feature vectors, unidirectionally . Tanh Softmax A Kim et al., 2017

  8. φ Bilinear Attention Maps K φ = ρ K ρ • U and V are linear embeddings • p is a learnable projection vector element-wise multiplication M ×φ K × M ( 1 · p T ) � X T U ⇣� ⌘ V T Y � A := softmax ρ × φ ρ× K ρ× N N × K ρ × K K × φ

  9. φ Bilinear Attention Maps K φ = ρ K ρ • Exactly the same approach with 
 low-rank bilinear pooling element-wise multiplication ( 1 · p T ) � X T U ⇣� ⌘ V T Y � A := softmax ( U T X i ) � ( V T Y j ) A i,j = p T � � .

  10. φ Bilinear Attention Maps K φ = ρ K ρ • Multiple bilinear attention maps are 
 acquired by different projection vectors p g as: ⇣� ⌘ g ) � X T U V T Y ( 1 · p T � A g := softmax

  11. Bilinear Attention Networks • Each multimodal joint feature is filled with following equation ( k is the index of K; broadcasting in PyTorch let you avoid for-loop for this): A 2 ρ×φ k = ( X T U 0 ) T k A ( Y T V 0 ) k k-th f 0 K 1 × ρ φ × 1 φ 1 ρ = ρ 1 φ k-th k-th K K ※ broadcasting: automatically repeat tensor operations in api-level supported by Numpy, Tensorflow, Pytorch

  12. Bilinear Attention Networks • One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation A 2 k = ( X T U 0 ) T k A ( Y T V 0 ) k f 0 |{ y j }| |{ x i }| X X A i,j ( X T k )( V 0 T f 0 i U 0 k = k Y j ) i =1 j =1 |{ y j }| |{ x i }| X X A i,j X T k V 0 T i ( U 0 = k ) Y j i =1 j =1 low-rank bilinear pooling

  13. Bilinear Attention Networks • One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation • Low-rank bilinear feature learning inside bilinear attention |{ y j }| |{ x i }| X X A i,j ( X T k )( V 0 T f 0 i U 0 k = k Y j ) i =1 j =1 |{ y j }| |{ x i }| X X A i,j X T k V 0 T i ( U 0 = k ) Y j i =1 j =1 low-rank bilinear pooling

  14. Bilinear Attention Networks • One can show that this is equivalent to a bilinear attention model where each feature is pooled by low-rank bilinear approximation • Low-rank bilinear feature learning inside bilinear attention • Similarly to MLB (Kim et al., ICLR 2017), activation functions can be applied

  15. Bilinear Attention Networks X Y Linear Linear ReLU ReLU transpose Linear Linear Linear ReLU ReLU Softmax transpose Linear Classifier

  16. Time Complexity • Assuming that M ≥ N > K > φ ≥ ρ , the time complexity of bilinear attention networks is O(KM φ ) where K denotes hidden size, since BAN consists of matrix chain multiplication • Empirically, BAN takes 284s/epoch while unitary attention control takes 190s/epoch • Largely due to the increment of input size for Softmax function, φ to φ × ρ

  17. Residual Learning of Attention • Residual learning exploits multiple attention maps ( f 0 is X and { ɑ i } is fixed to ones): bilinear attention map f i +1 = BAN i ( f i , Y ; A i ) · 1 T + α i f i shortcut bilinear attention networks repeat {# of tokens} times

  18. Overview • After getting bilinear attention maps, we can stack multiple BANs. Residual Learning K X φ What is the mustache Y made of ? + U’ T X Y T V’ = K=N K ρ φ 1 GRU Att_1 ρ Object Detection All hidden repeat 1 → ρ 1 X φ states X’ X T U Residual Learning φ K x K a m X’ t f o φ = ρ S ∙ Sum V T Y att_1 K + Pooling ρ U’’ T X’ Y T V’’ MLP 
 att_2 = K K ρ 1 Att_2 φ classifier ρ p repeat 1 → ρ 1 Step 2. Bilinear Attention Networks Step 1. Bilinear Attention Maps

  19. Multiple Attention Maps • Single model on validation score Validation 
 +% VQA 2.0 Score 63.37 Bottom-Up (Teney et al., 2017) 65.36 +1.99 BAN-1 65.61 +0.25 BAN-2 65.81 +0.2 BAN-4 66.00 +0.19 BAN-8 66.04 BAN-12 +0.04

  20. Residual Learning Validation 
 +/- VQA 2.0 Score BAN-4 (Residual) 65.81 ±0.09 ∑ BAN-4 (Sum) BAN i ( X , Y ; A i ) 64.78 ±0.08 -1.03 i ! BAN i ( X , Y ; A i ) BAN-4 (Concat) 64.71 ±0.21 -0.07 i

  21. Comparison with Co-attention Validation 
 +/- VQA 2.0 Score BAN-1 Bilinear Attention 65.36 ±0.14 Co-Attention 64.79 ±0.06 -0.57 Attention 64.59 ±0.04 -0.20 ※ The number of parameters is controlled (all comparison models have 32M parameters).

  22. Comparison with Co-attention 67.0 85 85 (a) (a) (b (b) (c) Training curves 65.5 74 Validation score Validation score 64.0 Validation curves 63 63 62.5 Bi-Att train 52 52 Co-Att train Uni-Att 61.0 Uni-Att train Co-Att Bi-Att val BAN-1 41 41 59.5 Co-Att val BAN-4 Uni-Att val BAN-1+MFB 30 30 58.0 1 4 7 10 13 16 19 0M 15M 30M 45M 60M Epoch The number of parameters ※ The number of parameters is controlled (all comparison models have 32M parameters).

  23. Visualization

  24. Integration of Counting module with BAN • Counter (Zhang et al., 2018) gets a multinoulli distribution and detected box info • Our method — maxout from bilinear attention distribution to unitary attention distribution, easy to adapt multitask learning ɸ Softmax Bilinear Attention ρ Logits Maxout Sigmoid Counting Module maxout 1 ɸ · 1 T + f i � � f i +1 = BAN i ( f i , Y ; A i ) + g i ( c i ) inference steps for 
 linear embedding of the 
 residual learning of multi-glimpse attention i-th output of counter module attention

  25. test-dev Numbers Comparison with State-of-the-arts Zhang et al. 51.62 (2018) Ours 54.04 • Single model on test-dev score Test-dev 
 +% VQA 2.0 Score Prior 25.70 44.22 +18.52% Language-Only 2016 winner MCB (ResNet) 61.96 +17.74% 2017 winner Bottom-Up (FRCNN) 65.32 +3.36% 2017 runner-up MFH (ResNet) 65.80 +0.48% image feature +2.96% 68.76 MFH (FRCNN) attention model +0.76% BAN (Ours; FRCNN) 69.52 +0.14% BAN-Glove (Ours; FRCNN) 69.66 counting feature 70.04 +0.38% BAN-Glove-Counter (Ours; FRCNN)

  26. Flickr30k Entities • Visual grounding task — mapping entity phrases to regions in an image 0.48<0.5 [/EN#40120/people A girl] in [/EN#40122/clothing a yellow tennis suit] , [/ EN#40125/other green visor] and [/EN#40128/clothing white tennis shoes] holding [/EN#40124/other a tennis racket] in a position where she is going to hit [/EN#40121/other the tennis ball] .

  27. Flickr30k Entities • Visual grounding task — mapping entity phrases to regions in an image [/EN#38656/people A male conductor] wearing [/EN#38657/clothing all black] leading [/EN#38653/people a orchestra] and [/EN#38658/people choir] on [/ EN#38659/scene a brown stage] playing and singing [/EN#38664/other a musical number] .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend