Ph.D. Student Machine Learning and Perception Lab 2 sky stop - - PowerPoint PPT Presentation

ph d student
SMART_READER_LITE
LIVE PREVIEW

Ph.D. Student Machine Learning and Perception Lab 2 sky stop - - PowerPoint PPT Presentation

Aishwarya Agrawal Ph.D. Student Machine Learning and Perception Lab 2 sky stop light building bus car person sidewalk Identify objects in scene 3 blue green tall sky stop light building many red cars bus one bicycle Identify


slide-1
SLIDE 1

Aishwarya Agrawal Ph.D. Student Machine Learning and Perception Lab

slide-2
SLIDE 2

2

slide-3
SLIDE 3

Identify objects in scene

3

sky bus car stop light person building sidewalk

slide-4
SLIDE 4

Identify attributes of objects

4

blue sky red bus many cars green stop light

  • ne

bicycle tall building

slide-5
SLIDE 5

Identify activities in scene

5

person wearing a helmet riding bicycle man walking

  • n sidewalk
slide-6
SLIDE 6

Identify the scene

6

street scene

slide-7
SLIDE 7

Describe the scene

8

A person on bike going through green light with bus nearby

slide-8
SLIDE 8

A giraffe standing in the grass next to a tree.

11

slide-9
SLIDE 9
  • Answer questions about the scene

– Q: How many buses are there? – Q: What is the name of the street? – Q: Is the man on bicycle wearing a helmet?

13

slide-10
SLIDE 10

14

slide-11
SLIDE 11

Visual Question Answering (VQA)

Task: Given an image and a natural language open- ended question, generate a natural language answer.

15

slide-12
SLIDE 12

VQA Task

16

slide-13
SLIDE 13

VQA CloudCV Demo

cloudcv.org/vqa/?useVoice=1&listenAnswer=1

17

slide-14
SLIDE 14

Applications of VQA

  • An aid to visually-impaired

Is it safe to cross the street now?

18

slide-15
SLIDE 15

Applications of VQA

  • Surveillance

What kind of car did the man in red shirt leave in?

19

slide-16
SLIDE 16

Applications of VQA

  • Interacting with robot

Is my laptop in my bedroom upstairs?

20

slide-17
SLIDE 17

VQA Dataset

21

slide-18
SLIDE 18

Real images (from MSCOCO)

Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in COntext.” ECCV 2014. http://mscoco.org/

22

slide-19
SLIDE 19

Questions

Stump a smart robot! Ask a question that a human can answer, but a smart robot probably can’t!

23

slide-20
SLIDE 20

Two modalities of answering

  • Open Ended
  • Multiple Choice

24

slide-21
SLIDE 21

Open Ended Task

What is the girl holding in her hand? How many mirrors? Why is the girl holding an umbrella?

25

slide-22
SLIDE 22

Multiple Choice Task

What is the bus number?

a) 3 b) 1 c) green d) 4 e) window trim f) blue g) m5 h) corn, carrots, onions, rice i) red j) 125 k) san antonio l) sign pen m) 478 n) no

  • ) 25

p) 2 q) yes r) white

26

slide-23
SLIDE 23

Dataset Stats

  • >250K images (MSCOCO + 50K Abstract

Scenes)

  • >750K questions (3 per image)
  • ~10M answers (10 w/ image + 3 w/o image)

27

slide-24
SLIDE 24

Please visit www.visualqa.org for more details.

28

slide-25
SLIDE 25

Browse the Dataset

http://visualqa.org/browser/

29

slide-26
SLIDE 26

Questions

30

slide-27
SLIDE 27

Dataset Visualization

http://visualqa.org/visualize/

32

slide-28
SLIDE 28

Answers

  • 38.4% of questions are binary yes/no
  • 98.97% questions have answers <= 3 words

– 23k unique 1 word answers

33

slide-29
SLIDE 29

Answers

34

slide-30
SLIDE 30

2-Channel VQA Model

Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP

4096-dim

Embedding Embedding

“How many horses are in this image?”

Neural Network Softmax

  • ver top K answers

Image Question

36 1024-dim

slide-31
SLIDE 31

Ablation #1: Language-alone

Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP

1k output units

Embedding

Neural Network Softmax

  • ver top K answers

Image

“How many horses are in this image?”

Question Embedding

37 1024-dim

slide-32
SLIDE 32

Ablation #2: Vision-alone

Convolution Layer + Non-Linearity Pooling Layer Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP

4096-dim

Embedding

Neural Network Softmax

  • ver top K answers

Image

“How many horses are in this image?”

Question Embedding

38

slide-33
SLIDE 33

Accuracy Metric

39

slide-34
SLIDE 34

Open-Ended Task Accuracies

40

Human Machine

25.14 room for improvement

Human vs. Machine performance Human performance

slide-35
SLIDE 35

Results

41

Code available!

  • Multiple-Choice > Open-Ended
  • Question alone does quite well
  • Image helps
slide-36
SLIDE 36

Commonsense

  • Does this person have 20/20 vision?

42

slide-37
SLIDE 37

Does this question need commonsense?

43

Q: How many calories are in this pizza?

slide-38
SLIDE 38

How old does a person need to be?

44

Q: How many calories are in this pizza?

slide-39
SLIDE 39

Most “commonsense” questions

45

slide-40
SLIDE 40

Least “commonsense” questions

46

slide-41
SLIDE 41

Spectrum

3-4 (15.3%) 5-8 (39.7%) 9-12 (28.4%) 13-17 (11.2%) 18+ (5.5%)

Is that a bird in the sky? How many pizzas are shown? Where was this picture taken? Is he likely to get mugged if he walked down a dark alleyway like this? What type of architecture is this? What color is the shoe? What are the sheep eating? What ceremony does the cake commemorate? Is this a vegetarian meal? Is this a Flemish bricklaying pattern? How many zebras are there? What color is his hair? Are these boats too tall to fit under the bridge? What type of beverage is in the glass? How many calories are in this pizza? Is there food on the table? What sport is being played? What is the name of the white shape under the batter? Can you name the performer in the purple costume? What government document is needed to partake in this activity? Is this man wearing shoes? Name one ingredient in the skillet. Is this at the stadium? Besides these humans, what other animals eat here? What is the make and model of this vehicle?

47

slide-42
SLIDE 42

Question Average Age what brand 12.5 why 11.18 what type 11.04 what kind 10.55 is this 10.13 what does 10.06 what time 9.81 who 9.58 where 9.54 which 9.32 does 9.29 do 9.23 what is 9.11 what are 9.04 are 8.65 is the 8.52 is there 8.24 what sport 8.06 how many 7.67 what animal 6.74 what color 6.6

48

slide-43
SLIDE 43

VQA Age

  • Average “age of questions” = 8.98 years.
  • Our model =* 4.74 years old!

* age as estimated by untrained crowd-sourced workers

49

slide-44
SLIDE 44

VQA Common sense

  • Average common sense required = 31%.
  • Our best algorithm has* 17% common sense!

* as estimated by untrained crowd-sourced workers

50

slide-45
SLIDE 45

VQA Challenges on

www.codalab.org

51

slide-46
SLIDE 46

VQA Challenge @ CVPR16

52

slide-47
SLIDE 47

VQA Challenge @ CVPR16

53

code available!

slide-48
SLIDE 48

VQA Workshop @ CVPR16

54

slide-49
SLIDE 49

Papers using VQA

… and many more

55

slide-50
SLIDE 50

Dataset: >1k downloads Code: >1.5k views Academia, industry, start ups

56

slide-51
SLIDE 51

Conclusions

  • VQA: Visual Question Answering

– The next “grand challenge” in vision, language, AI

  • Spectrum: Easy to Difficult

– “What room is this?”  Scene Recognition – “How many …”  Object Recognition – … – “Does this person have 20/20 vision”  Common sense

  • Exciting times ahead!

57

slide-52
SLIDE 52

VQA Team

Aishwarya Agrawal Virginia Tech Meg Mitchell Microsoft Research Dhruv Batra Virginia Tech Larry Zitnick Facebook AI Research Jiasen Lu Virginia Tech Devi Parikh Virginia Tech Stanislaw Antol Virginia Tech Akrit Mohapatra Virginia Tech Webmaster

58

slide-53
SLIDE 53

Closing Remarks

  • CloudCV VQA Exhibition: Booth 101
  • Contact email: aish@vt.edu
  • Please complete the Presenter Evaluation sent to

you by email or through the GTC Mobile App. Your feedback is important!

59

slide-54
SLIDE 54

Thanks! Questions?

60

slide-55
SLIDE 55

Visual Question Answering (VQA)

61