Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao - PowerPoint PPT Presentation

February 2020 Botto Bottom-up up and Top and Top-Down Down Attention f Atten tion for Image or Image Captioning Captioning and Visua and Visual l Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao

Background Image captioning and visual question answering are • problems combining image and language understanding. To solve these problems, it is often necessary to perform • visual processing, or even reasoning to generate high quality outputs. Most conventional visual attention mechanisms are of the • top-down variety: Given context, model attend to one or more layers of CNN.

Problem CNN processes input regions • in a uniform grid space, regardless of the content of the image Attention on grid space - only • on partial object

Our Model Top-down mechanism: use task-specific context to predict an • attention distribution over the image Bottom-up mechanism : use Faster R-CNN to propose a set of • salient image regions

Advantages With Faster R-CNN, the model • attends to the full object now. We are able to pre-train it on • object detection datasets, leveraging cross-domain knowledge.

Overview Bottom-up Attention Model - Top-down Attention Model - Captioning Model - VQA Model - Datasets - Results - Conclusion - Critique - Discussion -

Bottom-up Attention Model

Bottom-up Attention Model (mean pooling)

Bottom-up Attention Model Object (mean pooling) Embeddings Linear + Softmax 5. Final classification score (attributes) Attribute

Captioning Model (Attention LSTM) Last timestep Word Mean output (from Embedding Pooling language LSTM) (learned)

Captioning Model (Attention LSTM) Last timestep Mean Word output (from Pooling Embedding language LSTM) (learned)

Objective

VQA Model

VQA Model Truncate

VQA Model Confidence score for every candidate answers, trained with binary cross entropy loss

Dataset Visual Genome dataset • • pretrain bottom-up attention model • the dataset contains 108K images densely annotated, containing objects, attributes and relationships, and visual question answers • ensure that any images found in both datasets are contained in the same split • augment VQA v2.0 training data Microsoft COCO Dataset • • Image caption task VQA v2.0 Dataset • • Visual Question Answering task • attempts to minimize the effectiveness of learning dataset priors by balancing the answers to each question

ResNet Baseline To quantify the impact of bottom-up attention • Uses a ResNet CNN pretrained on ImageNet to encode each • image in place of the bottom-up attention Image caption: use the final convolutional layer of Resnet- • 101, resize the output to a fixed size spatial representation of 10×10 VQA: varying the size of output representations, 14×14, 7×7, • 1×1

Image caption results

SPICE: Semantic Propositional Image Caption Evaluation

dependency parse trees semantic scene graph

VQA Results

Qualitative Analysis

Errors

Critique • Randomly initialized word embedding in image captioning task, but GloVe vectors on VQA model? • Why don't merge overlapping classes when processing Visual Genome Dataset? - Perform stemming to reduce the class size (e.g. trees->tree) - Use WordNet to merge synonyms • The model submitted to VQA challenge is trained with additional Q&A from Visual Genome - cheating? • Also - they use 30 ensembled models on the test evaluation server? • Their image captioning model forces the decoder to generate unique words in a row, but some prepositions can appear for twice or more - only filter nouns

Critique • Curious about the length of image features with relation to the performance. Will it be harder to generate captions for more complicated images. • Evaluation only includes automatic metrics, needs more human evaluation in image caption generation task, like relevance, expressiveness, concreteness, creativity. • Need analysis of results of different types of questions, e.g. “Is the” or “what is” questions. And it will be interesting to show the distribution of age of questions for different levels of accuracies achieved by our system, estimate the model can perform as well as humans in which age. • Other things could try: • Is it possible to also apply attention to words in the question for VQA?

Thank you!

Non-maximum Suppression

Why Sigmoid?

What is SPICE? ● (a) A young girl standing on top of a tennis court. ● (b) A giraffe standing on top of a green field. High n-gram similarity (c) A shiny metal pot filled with some diced veggies. • (d) The pan on the stove has chopped vegetables in it. • Low n-gram similarity

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao - PowerPoint PPT Presentation

February 2020 Botto Bottom-up up and Top and Top-Down Down Attention f Atten tion for Image or Image Captioning Captioning and Visua and Visual l Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image

Question Answering What is Ques+on Answering? Dan Jurafsky Ques%on

BV LC Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein Visual ques>on answering

Ques%on Answering One of the oldest NLP tasks (punched card

Qu Ques estions tions Answ Answered! ered! David R. Gelinas Senior Associate Dean Office

Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University

Question Answering and AnswerFinder Diego Moll a Centre for Language Technology Department of

A Multilingual Hybrid Question-Answering System Cross-Lingual Open-Domain Question Answering

Empow ering Educators, Empow ering Students: Navigating the New ESL Landscape Alberta Teachers

Answering Queries Using Answering Queries Using Materialized view: result set is stored

Statistical NLP Spring 2011 Lecture 26: Question Answering Dan Klein UC Berkeley Question

Question Answering and Reading Comprehension Kevin Duh Fall 2019, Intro to HLT, Johns Hopkins

An Question Recommendation System for Question Answer Community (Stackoverflow) Presenter: Haoyu

Cheap transla,on Automated ques,on answering Visualizing

Co Commonsense for r Generative Mu Multi-Ho Hop Ques p Questio ion n An Answering Tasks

PhysicsAndMathsTutor.com uest Answ er Question Marks Guidance 1 (iv) Negative or very

Review , Catch-up, Question&Answ er &A ti Q C t h i R Outline Dear Prof.

Bradley/Taylor: Inadequate attention to and investment in services that address the broader

Attention Attention is the taking possession by the mind, in clear and vivid form, of one

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin

Modeling Sub-Document Attention Using Viewport Time Max Grusky Jeiran Jahani Josh Schwartz Dan

Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang Zhang 2 , Fei Wu 1 , Tat-Seng

Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim,

GANocracy Outline Background: Text Generation Latent-Variable Generation Learning

Sambuz

Useful Links

Newsletter

Mail Us

Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao - PowerPoint PPT Presentation

February 2020 Botto Bottom-up up and Top and Top-Down Down Attention f Atten tion for Image or Image Captioning Captioning and Visua and Visual l Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image

Question Answering What is Ques+on Answering? Dan Jurafsky Ques%on

BV LC Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein Visual ques&gt;on answering

Ques%on Answering One of the oldest NLP tasks (punched card

Qu Ques estions tions Answ Answered! ered! David R. Gelinas Senior Associate Dean Office

Designing deep architectures for Visual Question Answering Matthieu Cord Sorbonne University

Question Answering and AnswerFinder Diego Moll a Centre for Language Technology Department of

A Multilingual Hybrid Question-Answering System Cross-Lingual Open-Domain Question Answering

Empow ering Educators, Empow ering Students: Navigating the New ESL Landscape Alberta Teachers

Answering Queries Using Answering Queries Using Materialized view: result set is stored

Statistical NLP Spring 2011 Lecture 26: Question Answering Dan Klein UC Berkeley Question

Question Answering and Reading Comprehension Kevin Duh Fall 2019, Intro to HLT, Johns Hopkins

An Question Recommendation System for Question Answer Community (Stackoverflow) Presenter: Haoyu

Cheap transla,on Automated ques,on answering Visualizing

Co Commonsense for r Generative Mu Multi-Ho Hop Ques p Questio ion n An Answering Tasks

PhysicsAndMathsTutor.com uest Answ er Question Marks Guidance 1 (iv) Negative or very

Review , Catch-up, Question&amp;Answ er &amp;A ti Q C t h i R Outline Dear Prof.

Bradley/Taylor: Inadequate attention to and investment in services that address the broader

Attention Attention is the taking possession by the mind, in clear and vivid form, of one

Hint-Based Training for Non-Autoregressive Translation Zhuohan Li Zi Lin Fei Tian Tao Qin

Modeling Sub-Document Attention Using Viewport Time Max Grusky Jeiran Jahani Josh Schwartz Dan

Attention Networks Jun Xiao 1 , Hao Ye 1 , Xiangnan He 2 , Hanwang Zhang 2 , Fei Wu 1 , Tat-Seng

Neural Machine Translation Luke Zettlemoyer (Slides adapted from Karthik Narasimhan, Greg

Bilinear Attention Networks 2018 VQA Challenge runner-up (1st single model) Jin-Hwa Kim,

GANocracy Outline Background: Text Generation Latent-Variable Generation Learning

Sambuz

Useful Links

Newsletter

Mail Us

BV LC Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein Visual ques>on answering

Review , Catch-up, Question&Answ er &A ti Q C t h i R Outline Dear Prof.