Ph.D. Student Machine Learning and Perception Lab 2 sky stop - - PowerPoint PPT Presentation
Ph.D. Student Machine Learning and Perception Lab 2 sky stop - - PowerPoint PPT Presentation
Aishwarya Agrawal Ph.D. Student Machine Learning and Perception Lab 2 sky stop light building bus car person sidewalk Identify objects in scene 3 blue green tall sky stop light building many red cars bus one bicycle Identify
2
Identify objects in scene
3
sky bus car stop light person building sidewalk
Identify attributes of objects
4
blue sky red bus many cars green stop light
- ne
bicycle tall building
Identify activities in scene
5
person wearing a helmet riding bicycle man walking
- n sidewalk
Identify the scene
6
street scene
Describe the scene
8
A person on bike going through green light with bus nearby
A giraffe standing in the grass next to a tree.
11
- Answer questions about the scene
– Q: How many buses are there? – Q: What is the name of the street? – Q: Is the man on bicycle wearing a helmet?
13
14
Visual Question Answering (VQA)
Task: Given an image and a natural language open- ended question, generate a natural language answer.
15
VQA Task
16
VQA CloudCV Demo
cloudcv.org/vqa/?useVoice=1&listenAnswer=1
17
Applications of VQA
- An aid to visually-impaired
Is it safe to cross the street now?
18
Applications of VQA
- Surveillance
What kind of car did the man in red shirt leave in?
19
Applications of VQA
- Interacting with robot
Is my laptop in my bedroom upstairs?
20
VQA Dataset
21
Real images (from MSCOCO)
Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in COntext.” ECCV 2014. http://mscoco.org/
22
Questions
Stump a smart robot! Ask a question that a human can answer, but a smart robot probably can’t!
23
Two modalities of answering
- Open Ended
- Multiple Choice
24
Open Ended Task
What is the girl holding in her hand? How many mirrors? Why is the girl holding an umbrella?
25
Multiple Choice Task
What is the bus number?
a) 3 b) 1 c) green d) 4 e) window trim f) blue g) m5 h) corn, carrots, onions, rice i) red j) 125 k) san antonio l) sign pen m) 478 n) no
- ) 25
p) 2 q) yes r) white
26
Dataset Stats
- >250K images (MSCOCO + 50K Abstract
Scenes)
- >750K questions (3 per image)
- ~10M answers (10 w/ image + 3 w/o image)
27
Please visit www.visualqa.org for more details.
28
Browse the Dataset
http://visualqa.org/browser/
29
Questions
30
Dataset Visualization
http://visualqa.org/visualize/
32
Answers
- 38.4% of questions are binary yes/no
- 98.97% questions have answers <= 3 words
– 23k unique 1 word answers
33
Answers
34
2-Channel VQA Model
Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP
4096-dim
Embedding Embedding
“How many horses are in this image?”
Neural Network Softmax
- ver top K answers
Image Question
36 1024-dim
Ablation #1: Language-alone
Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP
1k output units
Embedding
Neural Network Softmax
- ver top K answers
Image
“How many horses are in this image?”
Question Embedding
37 1024-dim
Ablation #2: Vision-alone
Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP
4096-dim
Embedding
Neural Network Softmax
- ver top K answers
Image
“How many horses are in this image?”
Question Embedding
38
Accuracy Metric
39
Open-Ended Task Accuracies
40
Human Machine
25.14 room for improvement
Human vs. Machine performance Human performance
Results
41
Code available!
- Multiple-Choice > Open-Ended
- Question alone does quite well
- Image helps
Commonsense
- Does this person have 20/20 vision?
42
Does this question need commonsense?
43
Q: How many calories are in this pizza?
How old does a person need to be?
44
Q: How many calories are in this pizza?
Most “commonsense” questions
45
Least “commonsense” questions
46
Spectrum
3-4 (15.3%) 5-8 (39.7%) 9-12 (28.4%) 13-17 (11.2%) 18+ (5.5%)
Is that a bird in the sky? How many pizzas are shown? Where was this picture taken? Is he likely to get mugged if he walked down a dark alleyway like this? What type of architecture is this? What color is the shoe? What are the sheep eating? What ceremony does the cake commemorate? Is this a vegetarian meal? Is this a Flemish bricklaying pattern? How many zebras are there? What color is his hair? Are these boats too tall to fit under the bridge? What type of beverage is in the glass? How many calories are in this pizza? Is there food on the table? What sport is being played? What is the name of the white shape under the batter? Can you name the performer in the purple costume? What government document is needed to partake in this activity? Is this man wearing shoes? Name one ingredient in the skillet. Is this at the stadium? Besides these humans, what other animals eat here? What is the make and model of this vehicle?
47
Question Average Age what brand 12.5 why 11.18 what type 11.04 what kind 10.55 is this 10.13 what does 10.06 what time 9.81 who 9.58 where 9.54 which 9.32 does 9.29 do 9.23 what is 9.11 what are 9.04 are 8.65 is the 8.52 is there 8.24 what sport 8.06 how many 7.67 what animal 6.74 what color 6.6
48
VQA Age
- Average “age of questions” = 8.98 years.
- Our model =* 4.74 years old!
* age as estimated by untrained crowd-sourced workers
49
VQA Common sense
- Average common sense required = 31%.
- Our best algorithm has* 17% common sense!
* as estimated by untrained crowd-sourced workers
50
VQA Challenges on
www.codalab.org
51
VQA Challenge @ CVPR16
52
VQA Challenge @ CVPR16
53
code available!
VQA Workshop @ CVPR16
54
Papers using VQA
… and many more
55
Dataset: >1k downloads Code: >1.5k views Academia, industry, start ups
56
Conclusions
- VQA: Visual Question Answering
– The next “grand challenge” in vision, language, AI
- Spectrum: Easy to Difficult
– “What room is this?” Scene Recognition – “How many …” Object Recognition – … – “Does this person have 20/20 vision” Common sense
- Exciting times ahead!
57
VQA Team
Aishwarya Agrawal Virginia Tech Meg Mitchell Microsoft Research Dhruv Batra Virginia Tech Larry Zitnick Facebook AI Research Jiasen Lu Virginia Tech Devi Parikh Virginia Tech Stanislaw Antol Virginia Tech Akrit Mohapatra Virginia Tech Webmaster
58
Closing Remarks
- CloudCV VQA Exhibition: Booth 101
- Contact email: aish@vt.edu
- Please complete the Presenter Evaluation sent to
you by email or through the GTC Mobile App. Your feedback is important!
59
Thanks! Questions?
60
Visual Question Answering (VQA)
61