Inferring the Why in Images
[Pirsiavash et al]
CSC2523 Winter 2015: Paper Presentation Micha Livne
Inferring the Why in Images [Pirsiavash et al] CSC2523 Winter - - PowerPoint PPT Presentation
Inferring the Why in Images [Pirsiavash et al] CSC2523 Winter 2015: Paper Presentation Micha Livne Goals (a) (b) Sitting Sitting because he wants because she intends to watch television to see the doctor Related Work Predicted Intents
[Pirsiavash et al]
CSC2523 Winter 2015: Paper Presentation Micha Livne
(a) (b)
because he wants to watch television Sitting Sitting because she intends to see the doctor
Most FAVORABLE Least FAVORABLE Most COMFORTING Least COMFORTING Most COMPETENT Least COMPETENT Most DOMINANT Least DOMINANT
(d) Example images and predicted intents
Predicted Intents
Favorable Happy Energetic Dominant Trustworthy Comforting Comforting Competent Fearful Angry Favorable Happy Energetic Dominant Trustworthy Comforting Comforting Competent Fearful Angry Favorable Happy Energetic Dominant Trustworthy Comforting Comforting Competent Fearful Angry Favorable Happy Energetic Dominant Trustworthy Comforting Comforting Competent Fearful Angry
Visual Persuasion: Inferring Communicative Intents of Images [Joo et al 2014]
[Krizhevsky et al 2012]
T y φ(x)
argmax
y∈{1,...,M}
wT
y φ(x)
Visual classifier
Relationship Query to Language Model action + object + motivation action the object in order to motivation action the object to motivation action the object because pronoun wants to motivation action + object + scene action the object in a scene in a scene, action the object action + scene + motivation action in a scene in order to motivation action in order to motivation in a scene action because pronoun wants to motivation in a scene
Language Potentials
20 40 60 80 100 120
take eat look ride pose drink play drive read walk pet talk wait listen sail win go perform race sing sleep have rest catch relax show cross dance give hold jump kiss prepare serve travel cut enjoy fix fly pour protest write admire blow board build celebrate clean climb compete cook count crawl enter float hang help hit inspect laugh lead marry
paddle practice remove rock row sell skate smash smell smile throw toast transport visit wave work
Count
Statistics of Motivations
m s a
Ω(y; w, u, x, L) =
N
X
i
wT
yiφi(x)
+
N
X
i
uiLi(yi) +
N
X
i<j
uijLij(yi, yj) +
N
X
i<j<k
uijkLijk(yi, yj, yk)
Learning argmin
θ,ξn≥0
1 2||θ||2 + C X
n
ξn s.t. θT ψ(yn, xn) − θT ψ(h, xn) ≥ ∆(yn, h) − ξn ∀n, ∀h Inference
y
Human Label: sitting on bench in a train station because he is waiting Top Predictions: 1. sitting on bench in a park because he is waiting
Human Label: sitting on chair in a dining room because she wants to eat Top Predictions: 1. sitting near table in dining room because she wants to eat
Success
Human Label: holding a person in a living room because she wants to show Top Predictions: 1. sitting on sofa in living room because she wants to pet
Human Label: standing next to table because she wants to prepare Top Predictions: 1. talking to person in dining because she wants to eat
Failure
Failure: Vision Only
Human Label: sitting on a bus in a parking lot because he wants to drive Top Predictions: 1. because he wants to look
Human Label: sitting on chair in living room because she wants to read Top Predictions: 1. because she wants to eat
Baseline Our Method (Vision Only) (With Language) Given Ideal Detectors for: Action+Object+Scene 13 10 Action+Object 12 11 Object+Scene 15 12 Action+Scene 19 13 Object 19 13 Action 18 15 Scene1 37 18 Fully Automatic 232 15
Chance has rank of 39
10 20 30 40 50 60 70 80 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Number of Top Retrievals Accuracy Our Model (automatic) Our Model (given ideal detectors) Baseline (automatic) Baseline (given ideal detectors) Chance
was proven to be effective
behind people’s actions to the computer vision community.
web to improve computer vision systems.
language model