better action retrieval in images Inkyu An Content 1. Background - - PowerPoint PPT Presentation
better action retrieval in images Inkyu An Content 1. Background - - PowerPoint PPT Presentation
Learning semantic relationships for better action retrieval in images Inkyu An Content 1. Background 2. Motivation 3. Related Work 4. Approach 5. Result 2 Background | Semantic ? What comes to mind when you see below picture ? There are
2
- 1. Background
- 2. Motivation
- 3. Related Work
- 4. Approach
- 5. Result
Content
3
Background | Semantic ?
What comes to mind when you see below picture ? There are many parked vehicles on either side of the road.
4
Background | Semantic labeling
http://rodrigob.github.io/are_we_there_yet/build/semantic_l abeling_datasets_results.html#4d5352432d3231
5
Background | Semantic labeling
More complex - A wide variety of classes
6
Background | Semantic labeling
Poodle Yorkshire Terrier Collie Samoyed Great Dane Labrador Retriever Pomeranian Retriever Vizsla Bull Terrier
More complex - A wide variety of classes
7
Background | More and more complex
She is stretching her right leg over listening a music
8
Motivation | Action retrieval in images
Person interacting with panda Query image Image Search ???
9
Motivation | Action retrieval in images
Person interacting with panda Query image Image Search Result of Prior work
False Positive
10
Motivation | Action retrieval in images
Person interacting with panda
Query image
Person feeding panda Person holding animals Person feeding calf
Implied-by Type-of Mutual-exclusive Result images
11
Motivation | Action retrieval in images
Three kinds of relations
- 1. Implied-by
- 2. Type-of
- 3. Mutual-exclusive
HEX-graph
Large-scale object classification using label relation graphs [ECCV 2014]
12
Motivation | Action retrieval in images
βPerson interacting with pandaβ is represented by a weight vector π§π©
Skip-grams
Distributed Representations of Words and Phrases and their Compositionality [NIPS 2013]
13
Motivation | Action retrieval in images
They needed to get a score of relationship of sentences pair.
Neural Tensor Network
Reasoning With Neural Tensor Networks for Knowledge Base Completion [NIPS 2013]
14
Related Work |
- 1. HEX-graph
- Three kinds of relations
- 2. Skip-grams
- Weight vectors of actions(Sentence)
- 3. Neural Tensor Network
- Scores of relationship of pairs of actions
15
Bengal cat Russian Blue Siberian Husky Poodle Bulldog
Classifier
Related Work | HEX-graph _ Motivation
Dog Cat
16
Classifier
Related Work | HEX-graph _ Motivation
Siberian Husky Puppy Dog Cat Subsumption Exclusion
HEX-graph
17
Related Work | HEX-graph _ Problem Definition
<HEX-graph>
subsumption
exclusion
Dog Puppy Husky Cat
πππππ‘ π βΆ
Dog Cat Puppy Husky
πΌπππ ππ πβπ§ ππππ πΉβ βΆ πΉπ¦πππ£π‘πππ ππππ πΉπ βΆ
subsumption exclusion Relations : Dog Puppy Dog Cat Husky Puppy : subsumption : exclusion : overlap
21
Related Work | skip-grams
Nearby words
- The training objective is to learn word vector representations
that are good at predicting the nearby words The average log probability Input sentence ο Training
23
Related Work | Neural Tensor Networks (NTN)
- The model returns a high score if they are in that
relationship and a low on otherwise
24
Approach | Problem setup
Action : Person riding bike A set of actions π
- Person riding bike
- Person riding horse
- Person preparing food
- Chef cooking pasta
Two SVO structure : 1. <subject, verb, object>
- 2. <subject, verb, prepositional object>
- Person walking with a horse
Related images
25
Approach | Problem setup _ three kinds of relations
- 1. Implied-by :
- 2. Type-of :
- 3. Mutually exclusive :
Person preparing food Chef cooking pasta Person doing football Man playing soccer Person riding horse Man riding camel
26
Approach | Full model
π· = π·ππ + π½π π·π ππ + π½ππ·πππ + π½ππ·ππππ‘ + π π 2
2
Full model : Basic action retrieval model
[Image + Action]
Language prior [only Action] Visual objective [Image + Action] Consistency
- bjective
[only Action] The weights in the model
27
Approach | Full model
π· = π·ππ + π½π π·π ππ + π½ππ·πππ + π½ππ·ππππ‘ + π π 2
2
Full model : Basic action retrieval model
[Image + Action]
Language prior [only Action] Visual objective [Image + Action] Consistency
- bjective
[only Action] The weights in the model
π = π
ππ, π΅βπ
π₯π΅ , π
π ππ
ππ π π2 π πππ£πππ ππ¨ππ π₯ππ’β π π πππ£πππ ππ¨ππ’πππ πππππππππππ’ π
28
Approach | Full model
π· = π·ππ + π½π π·π ππ + π½ππ·πππ + π½ππ·ππππ‘ + π π 2
2
Full model : Basic action retrieval model
[Image + Action]
Language prior [only Action] Visual objective [Image + Action] Consistency
- bjective
[only Action] The weights in the model
29
Approach | Basic action retrieval model
π
π½ = π πππ·ππ π½ + πππ
π·ππ =
π΅ π½+βπ°
π΅
π½ββπ°
π΅
max 0,1 + π₯π΅
π(π π½β β π π½+)
CNN π
ππ
π
ππ
πππ πππ Person riding bike π½π΅ + I β π΅ππ’πππ π΅ Skip-grams ππ© ππ© + π β ππ©ππ© + ππ©π β Action prediction loss
π°
π΅ : a set of positive images of A
π°
π΅ : a set of negative images of A
Skip-grams CNN
30
Approach | Full model
π· = π·ππ + π½π π·π ππ + π½ππ·πππ + π½ππ·ππππ‘ + π π 2
2
Full model : Basic action retrieval model
[Image + Action]
Language prior [only Action] Visual objective [Image + Action] Consistency
- bjective
[only Action] The weights in the model
31
Approach | Relationship prediction
Goal : Denote the relationship by a vector π
π΅πΆ
= π
π΅πΆ π , π π΅πΆ π’ , π π΅πΆ π β 0,1 3
Implied by, type-of and mutually exclusive
Person riding bike π΅ππ’πππ π΅ Skip-grams Person riding camel π΅ππ’πππ πΆ Neural Tensor Network
π₯π΅, π₯πΆ π
π ππ 1:3
Softmax
ππ©πͺ
π
π΅πΆ = π‘πππ’πππ¦πΎ π₯π΅β¨π π ππ 1:3 β¨π₯πΆ + ππ ππ
Skip-grams Neural Tensor Network
32
Approach | Language prior for relationship
- 1. Implied-by :
- 2. Type-of :
- 3. Mutually exclusive :
Person preparing food Chef cooking pasta Man eating fish Person feeding a fish Person riding horse Man riding camel
Wrong
- NLP prior
33
Approach | Language prior for relationship
π·πππ =
π΅ πΆββπ΅
π
π΅πΆ β
π
π΅πΆ
NLP prior Relationship prediction
π«πππ: ππ©πͺ: ππ©πͺ:
The loss function of language-based relationship
- NLP priors are not always
accurate
- They treated NLP priors as a
noisy prior
34
Approach | Full model
π· = π·ππ + π½π π·π ππ + π½ππ·πππ + π½ππ·ππππ‘ + π π 2
2
Full model : Basic action retrieval model
[Image + Action]
Language prior [only Action] Visual objective [Image + Action] Consistency
- bjective
[only Action] The weights in the model
35
Approach | Action retrieval with relationship
- Visual objective
β π·π΅πΆ
π = π½πβπ°
πΆ
π½ββπ°
π΅
max 0,1 + π₯π΅
π π π½β β ππ½π
β π·π΅πΆ
π = π½πβπ°
π΅
π½ββπ°
πΆ
max 0,1 + π₯πΆ
π π π½β β ππ½π
β π·π΅πΆ
π = π½πβπ°
π΅
π½πβπ°
πΆ
max 0,1 + π₯π΅
π ππ½π β ππ½π
A is implied-by B : A is Type-of B : A is Mutually : exclusive of B
π°
πΆ : a set of positive images of B
π°
π΅ : a set of negative images of A
π°
π΅ : a set of positive images of A
π°
πΆ : a set of negative images of B
π°
π΅ : a set of positive images of A
π°
πΆ : a set of positive images of B
Rank the positive images of B higher than the negatives
- f A
Rank the positive images of A higher than negatives of B Rank the positive images of A higher than the positives
- f B
36
Approach | Action retrieval with relationship
- Visual objective
ππππππ’ππ€π: π·π ππ =
π΅βπ πΆββπ΅
π
π΅πΆ π β π· π΅πΆ π
+ π
π΅πΆ π’ β π· π΅πΆ π’ + π π΅πΆ π β π·π΅πΆ π
Relationship prediction π
π΅πΆ = {π π΅πΆ π , π π΅πΆ π’ , π π΅πΆ π }
ο Summarize costs(π·
π΅πΆ π , π·π΅πΆ π’ , π· π΅πΆ π ) of each relations, when
relationship prediction({π
π΅πΆ π , π π΅πΆ π’ , π π΅πΆ π }) is β1β.
37
Approach | Full model
π· = π·ππ + π½π π·π ππ + π½ππ·πππ + π½ππ·ππππ‘ + π π 2
2
Full model : Basic action retrieval model
[Image + Action]
Language prior [only Action] Visual objective [Image + Action] Consistency
- bjective
[only Action] The weights in the model
38
Approach | Action retrieval with relationship
- Consistency
π·ππππ‘ =
π΅ πΆββπ΅ π·ββπΆ πβ π,π’,π 3
π
π΅πΆ π1 β π πΆπ· π2 β π π·π΅ π3
Constrain the relationship assignment between actions Ex) : A is implied-by B B is implied-by C A is mutually exclusive of C Inconsistent relationships They wanted to avoid those kind of problems ο A is implied-by C
39
Approach | Full model
π· = π·ππ + π½π π·π ππ + π½ππ·πππ + π½ππ·ππππ‘ + π π 2
2
Full model : Basic action retrieval model
[Image + Action]
Language prior [only Action] Visual objective [Image + Action] Consistency
- bjective
[only Action] The weights in the model
- The full objective is minimized through downpour stochastic
gradient descent.
- Hyper-parameters of models : πΎ, π, π½π , π½π, π½π
40
Result |
41
- Thank you.