Explainable Neural Computation via Stack Neural Module Networks (July, 2018)
Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko, UC Berkley
Explainable Neural Computation via Stack Neural Module Networks - - PowerPoint PPT Presentation
Explainable Neural Computation via Stack Neural Module Networks (July, 2018) Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko, UC Berkley Outline The Problem Motivation and Importance The Approach Module layout
Ronghang Hu, Jacob Andreas, Trevor Darrell, and Kate Saenko, UC Berkley
❏ Single heat map highlighting important spatial regions may not tell the full story ❏ Existing modular nets: analyse question → predict sequence of predefined modules → predict answer ❏ But, need supervised module layouts (expert layout) for training layout policy ❏ Explicit modular reasoning process with low supervision
Question : There is a small gray block. Are there any spheres to the left of it
❏ Replace layout graph with a stack based data structure ❏ [Instead of making discrete choices of layout, this makes layout soft and continuous → model can be optimised with in fully differentiable way with SGD] ❏ Steps : ❖ Module layout controller ❖ Neural modules with a memory stack ❖ Soft program execution
❖ Module layout controller ❖ Neural modules with a memory stack ❖ Soft program execution
ct = dxd wt = |M| dim
W(t)
1= d x d
W2 = d x 2d W3= 1 x d input Q [S words] Encodes in sequence [h1,. . ., hs] {l = S, dim = d} with BiLSTM hs = Concatenation of the forward LSTM output and backward LSTM output at the s-th word At each t applies time dependent linear transform to Q and linearly combines it with previous ct-1 as u = W2[ W(t)
1q + b1; ct-1] +b2
Next, controller runs in a recurrent manner from t=0 to T-1 At each t, a small MLP is applied to u to predict w(t) w(t) = softmax(MLP(u; 𝛊MLP)) ∑M
wm (t) = 1
At each t, controller predicts ct
cvt,s = softmax(W
3(u*hs))
ct = ∑Scvt,s . hs
How many objects are right of the blue object → Answer[how many] (transform[right] (find[blue]))
Stack = stores values of fixed length dimension length L memory array | A={A i}i =1
L + stack top pointer p | L-dim 1-hot vec
Push function : pointer increment + value writing p := 1d_cov(p,[0,0,1]) Ai := Ai (1-pi) + z.pi i = 1,...,L Pop function : pointer decrement + value reading p := 1d_cov(p,[1,0,0]) z := ∑LAi . pi
❏ Store HxW image attention maps ❏ Each module first pops this image attention map → then pushes it back ❏ Eg: Compare(Find(),Transform(Find())) Find → pushes its localization result into stack Then transform pops one attention map then pushes the transformed attention Then compare module pops two image attention maps & uses them to predict the answers
Thus, model performs continuous selection of layout through wm
(t)
At t = 0 → Initialize (A,P) with uniform image attention & p = [0,...,0,1] i.e at bottom At every t → execute every module on current(A(t),P(t)) During execution each module m may pop/push to get (Am
(t),Pm (t))
then → use wm
(t) to weight + sharpen the stack pointer with
softmax A(t+1)= ∑M Am
(t)wm (t)
p(t+1)= softmax(∑
M pm (t)wm (t) )
VQA: collect outputs from all the modules that have answer outputs from all timesteps y= ∑T-1 ∑M(ans) ym
(t)wm (t) M(ans) = answer+compare modules
REF: Take the image-attention map at the top of the final stack at t=T and extract attended image features from this attention map. Then, a linear layer is applied on the attended image feature to predict the bounding box offsets from the feature grid location.
Joint training can lead to higher performance on both of these two task (especially when not using the expert layout)
Best models perform better with supervision but fail to converge without it.
MAC : Also performs multi step sequential reasoning and has image and textual attention at each step. Subject understanding: Can you understand what the step is doing from attention. Forward Prediction: Can you tell what the model will predict? [tell us if the person can tell if where the model will go wrong].
Percentage of each choice
○ Novel idea to increase applicability of modular neural networks which are more interpretable.
○ Novel end-to-end differentiable training approach to modular networks. ○ Additional advantage of reduction in model parameters [ PG+EE : 40.4M, TbD-net : 115M , StackNMN : 7.32M]
○ Performed ablation study of all the important model components giving reasoning behind model design decisions.
○ Synthetic datasets are know to suffer from biases. An analysis of the created CLEVR-Ref would have been good.
○ How many modules are sufficient? [PG+EE, TbD-net : 39 modules | Stack-NMN : 9 modules] ○ Can modules themselves be made reusable to decrease parameters? ○ Perhaps, learnable generic modules?
○ Could given breakdown of accuracy over Count, Compare Numbers, Exist, Query Attribute, Compare Attribute. ○ Performance on CLEVR-CoGenT dataset provides an excellent test for generalization.
○ Could have shown instances of where the model is going wrong.
span all the required modules for a visual reasoning task.
perform a weighted sum of outputs of different arithmetic operations applied on ′the input feature maps x1 and x2
Networks, ECCV, 2018.