Self-Critical Reasoning for Robust Visual Question Answering Jialin - - PowerPoint PPT Presentation
Self-Critical Reasoning for Robust Visual Question Answering Jialin - - PowerPoint PPT Presentation
Self-Critical Reasoning for Robust Visual Question Answering Jialin Wu and Raymond J. Mooney Visual Question Answering (VQA) Common VQA system What utensil is pictured? Knife (0.72) Answer Prediction Fork Visual feature set
Visual Question Answering (VQA)
- Common VQA system
What utensil is pictured?
Answer Prediction Knife (0.72) Fork (0.66) Visual feature set π² Original image
Capture superficial statistical correlations between QA pairs
VQA system
Knife
I wonβt bother to look at the image, I can answer your question by just looking at the question
What utensil is pictured?
Original image
20 40 60 80 100
knife fork
Training Answer Distribution
Force VQA to focus on what humans focus on
- Extract a proposal set of objects ( ) that humans focus on.
OR There is a fork near the cake.
Proposal object set Human visual explanation Human textual explanation
Force VQA to focus on what humans focus on
- Enforce the gradients for the correct answer to have the largest value
for at least one of the extracted objects.
β#π(πππ π|π , π²)
Proposal object set
Influence Strengthen Loss
Results
- Compared to baseline model on VQA-CP dataset
- VQA-CP dataset manually set the train and test set in very different
distribution
38 43 48 53
All
VQA scores
Baseline Ours (infl)
Over sensitivity to the most common objects
VQA system
I can focus on the fork but I still think it is a knife
What utensil is pictured? Knife
Focused objects for answer βforkβ Focused objects for answer βknifeβ
Criticizing the false influential object
- Find the most influential object for the correct answer using gradients
What utensil is pictured?
β#π(πππ π|π , π²)
OR There is a fork near the cake.
Answer Prediction
Knife (0.72) Fork (0.66) Proposal object set Explaining prediction βforkβ
Visual feature set π²
Original image Human visual explanation Human textual explanation The most influential object
Criticizing the false influential object
- Force the object to contribute more to the correct answer.
What utensil is pictured?
β#π(πππ π|π , π²)
OR There is a fork near the cake.
Answer Prediction
Knife (0.72) Fork (0.66) Proposal object set Explaining prediction βforkβ
Visual feature set π²
Original image Human visual explanation Human textual explanation The most influential object β#π(πππππ|π , π²)
Explaining prediction βknifeβ
Self Critical Loss
Our self-critical approach
VQA system
Fork
Oh, yes, the utensil should be a fork.
What utensil is pictured?
Results
- Compared to baseline model on VQA-CP dataset
38 40 42 44 46 48 50 52
All
VQA scores
Baseline Ours (infl) Ours (infl + crit)