Slide Credits:Agrawal Slide Credits:Agrawal Slide Credits:Agrawal - - PowerPoint PPT Presentation
Slide Credits:Agrawal Slide Credits:Agrawal Slide Credits:Agrawal - - PowerPoint PPT Presentation
Slide Credits:Agrawal Slide Credits:Agrawal Slide Credits:Agrawal Kolmogorov-Smirnov Test p(Captions vs (Q+A))<0.001 LSTM : one hidden layer MLP : 2 hidden layer fc network output size 1024 1000 dropout(0.5) units tanh each word size
Slide Credits:Agrawal
Slide Credits:Agrawal
Slide Credits:Agrawal
Kolmogorov-Smirnov Test p(Captions vs (Q+A))<0.001
LSTM : one hidden layer MLP : 2 hidden layer fc network
- utput size 1024
1000 dropout(0.5) units tanh each word size 300 end-to-end learning cross-entropy Deeper LSTM: two hidden layer
- utput :
2048 > fc+tanh >1024 Input Vocabulary : All question words
2-Channel VQA Model
Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP
4096-dim
Embedding Embedding
“How many horses are in this image?”
Neural Network Softmax
- ver top K answers
Image Question
1024-dim
Slide Credits:Agrawal
Ablation #1: Language-alone
Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP
1k output units
Embedding
Neural Network Softmax
- ver top K answers
Image
“How many horses are in this image?”
Question Embedding
1024-dim
Slide Credits:Agrawal
Ablation #2: Vision-alone
Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP
4096-dim
Embedding
Neural Network Softmax
- ver top K answers
Image
“How many horses are in this image?”
Question Embedding
Slide Credits:Agrawal
Slide Credits:Agrawal
Slide Credits:Agrawal