better action retrieval in images Inkyu An Content 1. Background - - PowerPoint PPT Presentation

β–Ά
better action retrieval in images
SMART_READER_LITE
LIVE PREVIEW

better action retrieval in images Inkyu An Content 1. Background - - PowerPoint PPT Presentation

Learning semantic relationships for better action retrieval in images Inkyu An Content 1. Background 2. Motivation 3. Related Work 4. Approach 5. Result 2 Background | Semantic ? What comes to mind when you see below picture ? There are


slide-1
SLIDE 1

Learning semantic relationships for better action retrieval in images

Inkyu An

slide-2
SLIDE 2

2

  • 1. Background
  • 2. Motivation
  • 3. Related Work
  • 4. Approach
  • 5. Result

Content

slide-3
SLIDE 3

3

Background | Semantic ?

What comes to mind when you see below picture ? There are many parked vehicles on either side of the road.

slide-4
SLIDE 4

4

Background | Semantic labeling

http://rodrigob.github.io/are_we_there_yet/build/semantic_l abeling_datasets_results.html#4d5352432d3231

slide-5
SLIDE 5

5

Background | Semantic labeling

More complex - A wide variety of classes

slide-6
SLIDE 6

6

Background | Semantic labeling

Poodle Yorkshire Terrier Collie Samoyed Great Dane Labrador Retriever Pomeranian Retriever Vizsla Bull Terrier

More complex - A wide variety of classes

slide-7
SLIDE 7

7

Background | More and more complex

She is stretching her right leg over listening a music

slide-8
SLIDE 8

8

Motivation | Action retrieval in images

Person interacting with panda Query image Image Search ???

slide-9
SLIDE 9

9

Motivation | Action retrieval in images

Person interacting with panda Query image Image Search Result of Prior work

False Positive

slide-10
SLIDE 10

10

Motivation | Action retrieval in images

Person interacting with panda

Query image

Person feeding panda Person holding animals Person feeding calf

Implied-by Type-of Mutual-exclusive Result images

slide-11
SLIDE 11

11

Motivation | Action retrieval in images

Three kinds of relations

  • 1. Implied-by
  • 2. Type-of
  • 3. Mutual-exclusive

HEX-graph

Large-scale object classification using label relation graphs [ECCV 2014]

slide-12
SLIDE 12

12

Motivation | Action retrieval in images

β€œPerson interacting with panda” is represented by a weight vector 𝓧𝑩

Skip-grams

Distributed Representations of Words and Phrases and their Compositionality [NIPS 2013]

slide-13
SLIDE 13

13

Motivation | Action retrieval in images

They needed to get a score of relationship of sentences pair.

Neural Tensor Network

Reasoning With Neural Tensor Networks for Knowledge Base Completion [NIPS 2013]

slide-14
SLIDE 14

14

Related Work |

  • 1. HEX-graph
  • Three kinds of relations
  • 2. Skip-grams
  • Weight vectors of actions(Sentence)
  • 3. Neural Tensor Network
  • Scores of relationship of pairs of actions
slide-15
SLIDE 15

15

Bengal cat Russian Blue Siberian Husky Poodle Bulldog

Classifier

Related Work | HEX-graph _ Motivation

Dog Cat

slide-16
SLIDE 16

16

Classifier

Related Work | HEX-graph _ Motivation

Siberian Husky Puppy Dog Cat Subsumption Exclusion

HEX-graph

slide-17
SLIDE 17

17

Related Work | HEX-graph _ Problem Definition

<HEX-graph>

subsumption

exclusion

Dog Puppy Husky Cat

𝑂𝑝𝑒𝑓𝑑 π‘Š ∢

Dog Cat Puppy Husky

πΌπ‘—π‘“π‘ π‘π‘ π‘‘β„Žπ‘§ 𝑓𝑒𝑕𝑓 πΉβ„Ž ∢ πΉπ‘¦π‘‘π‘šπ‘£π‘‘π‘—π‘π‘œ 𝑓𝑒𝑕𝑓 𝐹𝑓 ∢

subsumption exclusion Relations : Dog Puppy Dog Cat Husky Puppy : subsumption : exclusion : overlap

slide-18
SLIDE 18

21

Related Work | skip-grams

Nearby words

  • The training objective is to learn word vector representations

that are good at predicting the nearby words The average log probability Input sentence οƒ  Training

slide-19
SLIDE 19

23

Related Work | Neural Tensor Networks (NTN)

  • The model returns a high score if they are in that

relationship and a low on otherwise

slide-20
SLIDE 20

24

Approach | Problem setup

Action : Person riding bike A set of actions 𝒝

  • Person riding bike
  • Person riding horse
  • Person preparing food
  • Chef cooking pasta

Two SVO structure : 1. <subject, verb, object>

  • 2. <subject, verb, prepositional object>
  • Person walking with a horse

Related images

slide-21
SLIDE 21

25

Approach | Problem setup _ three kinds of relations

  • 1. Implied-by :
  • 2. Type-of :
  • 3. Mutually exclusive :

Person preparing food Chef cooking pasta Person doing football Man playing soccer Person riding horse Man riding camel

slide-22
SLIDE 22

26

Approach | Full model

𝐷 = 𝐷𝑏𝑑 + 𝛽𝑠𝐷𝑠𝑓𝑑 + π›½π‘œπ·π‘œπ‘šπ‘ž + π›½π‘‘π·π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2

2

Full model : Basic action retrieval model

[Image + Action]

Language prior [only Action] Visual objective [Image + Action] Consistency

  • bjective

[only Action] The weights in the model

slide-23
SLIDE 23

27

Approach | Full model

𝐷 = 𝐷𝑏𝑑 + 𝛽𝑠𝐷𝑠𝑓𝑑 + π›½π‘œπ·π‘œπ‘šπ‘ž + π›½π‘‘π·π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2

2

Full model : Basic action retrieval model

[Image + Action]

Language prior [only Action] Visual objective [Image + Action] Consistency

  • bjective

[only Action] The weights in the model

𝑋 = 𝑋

𝑗𝑛, π΅βˆˆπ’

π‘₯𝐡 , 𝑋

π‘ π‘“π‘š

𝑏𝑠𝑓 π‘š2 π‘ π‘“π‘•π‘£π‘šπ‘π‘ π‘—π‘¨π‘“π‘’ π‘₯π‘—π‘’β„Ž 𝑏 π‘ π‘“π‘•π‘£π‘šπ‘π‘ π‘—π‘¨π‘π‘’π‘—π‘π‘œ π‘‘π‘π‘“π‘”π‘”π‘—π‘‘π‘—π‘“π‘œπ‘’ πœ‡

slide-24
SLIDE 24

28

Approach | Full model

𝐷 = 𝐷𝑏𝑑 + 𝛽𝑠𝐷𝑠𝑓𝑑 + π›½π‘œπ·π‘œπ‘šπ‘ž + π›½π‘‘π·π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2

2

Full model : Basic action retrieval model

[Image + Action]

Language prior [only Action] Visual objective [Image + Action] Consistency

  • bjective

[only Action] The weights in the model

slide-25
SLIDE 25

29

Approach | Basic action retrieval model

𝑔

𝐽 = 𝑋 𝑗𝑛𝐷𝑂𝑂 𝐽 + 𝑐𝑗𝑛

𝐷𝑏𝑑 =

𝐡 𝐽+βˆˆπ’°

𝐡

π½βˆ’βˆˆπ’°

𝐡

max 0,1 + π‘₯𝐡

π‘ˆ(𝑔 π½βˆ’ βˆ’ 𝑔 𝐽+)

CNN 𝑋

𝑗𝑛

𝑋

𝑗𝑛

𝑐𝑗𝑛 𝑐𝑗𝑛 Person riding bike 𝐽𝐡 + I βˆ’ π΅π‘‘π‘’π‘—π‘π‘œ 𝐡 Skip-grams 𝒙𝑩 π’ˆπ‘© + π’ˆ βˆ’ π’™π‘©π’ˆπ‘© + π’™π‘©π’ˆ βˆ’ Action prediction loss

𝒰

𝐡 : a set of positive images of A

𝒰

𝐡 : a set of negative images of A

Skip-grams CNN

slide-26
SLIDE 26

30

Approach | Full model

𝐷 = 𝐷𝑏𝑑 + 𝛽𝑠𝐷𝑠𝑓𝑑 + π›½π‘œπ·π‘œπ‘šπ‘ž + π›½π‘‘π·π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2

2

Full model : Basic action retrieval model

[Image + Action]

Language prior [only Action] Visual objective [Image + Action] Consistency

  • bjective

[only Action] The weights in the model

slide-27
SLIDE 27

31

Approach | Relationship prediction

Goal : Denote the relationship by a vector 𝑠

𝐡𝐢

= 𝑠

𝐡𝐢 𝑗 , 𝑠 𝐡𝐢 𝑒 , 𝑠 𝐡𝐢 𝑛 ∈ 0,1 3

Implied by, type-of and mutually exclusive

Person riding bike π΅π‘‘π‘’π‘—π‘π‘œ 𝐡 Skip-grams Person riding camel π΅π‘‘π‘’π‘—π‘π‘œ 𝐢 Neural Tensor Network

π‘₯𝐡, π‘₯𝐢 𝑋

π‘ π‘“π‘š 1:3

Softmax

𝒔𝑩π‘ͺ

𝑠

𝐡𝐢 = 𝑑𝑝𝑔𝑒𝑛𝑏𝑦𝛾 π‘₯𝐡⨂𝑋 π‘ π‘“π‘š 1:3 ⨂π‘₯𝐢 + π‘π‘ π‘“π‘š

Skip-grams Neural Tensor Network

slide-28
SLIDE 28

32

Approach | Language prior for relationship

  • 1. Implied-by :
  • 2. Type-of :
  • 3. Mutually exclusive :

Person preparing food Chef cooking pasta Man eating fish Person feeding a fish Person riding horse Man riding camel

Wrong

  • NLP prior
slide-29
SLIDE 29

33

Approach | Language prior for relationship

π·π‘œπ‘šπ‘ž =

𝐡 πΆβˆˆβ„›π΅

𝑠

𝐡𝐢 βˆ’

𝑠

𝐡𝐢

NLP prior Relationship prediction

π‘«π’π’Žπ’’: 𝒔𝑩π‘ͺ: 𝒔𝑩π‘ͺ:

The loss function of language-based relationship

  • NLP priors are not always

accurate

  • They treated NLP priors as a

noisy prior

slide-30
SLIDE 30

34

Approach | Full model

𝐷 = 𝐷𝑏𝑑 + 𝛽𝑠𝐷𝑠𝑓𝑑 + π›½π‘œπ·π‘œπ‘šπ‘ž + π›½π‘‘π·π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2

2

Full model : Basic action retrieval model

[Image + Action]

Language prior [only Action] Visual objective [Image + Action] Consistency

  • bjective

[only Action] The weights in the model

slide-31
SLIDE 31

35

Approach | Action retrieval with relationship

  • Visual objective

β†’ 𝐷𝐡𝐢

𝑗 = π½π‘βˆˆπ’°

𝐢

π½βˆ’βˆˆπ’°

𝐡

max 0,1 + π‘₯𝐡

π‘ˆ 𝑔 π½βˆ’ βˆ’ 𝑔𝐽𝑐

β†’ 𝐷𝐡𝐢

𝑗 = π½π‘βˆˆπ’°

𝐡

π½βˆ’βˆˆπ’°

𝐢

max 0,1 + π‘₯𝐢

π‘ˆ 𝑔 π½βˆ’ βˆ’ 𝑔𝐽𝑏

β†’ 𝐷𝐡𝐢

𝑗 = π½π‘βˆˆπ’°

𝐡

π½π‘βˆˆπ’°

𝐢

max 0,1 + π‘₯𝐡

π‘ˆ 𝑔𝐽𝑐 βˆ’ 𝑔𝐽𝑏

A is implied-by B : A is Type-of B : A is Mutually : exclusive of B

𝒰

𝐢 : a set of positive images of B

𝒰

𝐡 : a set of negative images of A

𝒰

𝐡 : a set of positive images of A

𝒰

𝐢 : a set of negative images of B

𝒰

𝐡 : a set of positive images of A

𝒰

𝐢 : a set of positive images of B

Rank the positive images of B higher than the negatives

  • f A

Rank the positive images of A higher than negatives of B Rank the positive images of A higher than the positives

  • f B
slide-32
SLIDE 32

36

Approach | Action retrieval with relationship

  • Visual objective

π‘ƒπ‘π‘˜π‘“π‘‘π‘’π‘—π‘€π‘“: 𝐷𝑠𝑓𝑑 =

π΅βˆˆπ’ πΆβˆˆβ„›π΅

𝑠

𝐡𝐢 𝑗 β‹… 𝐷 𝐡𝐢 𝑗

+ 𝑠

𝐡𝐢 𝑒 β‹… 𝐷 𝐡𝐢 𝑒 + 𝑠 𝐡𝐢 𝑛 β‹… 𝐷𝐡𝐢 𝑛

Relationship prediction 𝑠

𝐡𝐢 = {𝑠 𝐡𝐢 𝑗 , 𝑠 𝐡𝐢 𝑒 , 𝑠 𝐡𝐢 𝑛 }

οƒ  Summarize costs(𝐷

𝐡𝐢 𝑗 , 𝐷𝐡𝐢 𝑒 , 𝐷 𝐡𝐢 𝑛 ) of each relations, when

relationship prediction({𝑠

𝐡𝐢 𝑗 , 𝑠 𝐡𝐢 𝑒 , 𝑠 𝐡𝐢 𝑛 }) is β€˜1’.

slide-33
SLIDE 33

37

Approach | Full model

𝐷 = 𝐷𝑏𝑑 + 𝛽𝑠𝐷𝑠𝑓𝑑 + π›½π‘œπ·π‘œπ‘šπ‘ž + π›½π‘‘π·π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2

2

Full model : Basic action retrieval model

[Image + Action]

Language prior [only Action] Visual objective [Image + Action] Consistency

  • bjective

[only Action] The weights in the model

slide-34
SLIDE 34

38

Approach | Action retrieval with relationship

  • Consistency

π·π‘‘π‘π‘œπ‘‘ =

𝐡 πΆβˆˆβ„›π΅ π·βˆˆβ„›πΆ π‘’βˆˆ π‘ž,𝑒,𝑛 3

𝑠

𝐡𝐢 𝑒1 β‹… 𝑠 𝐢𝐷 𝑒2 β‹… 𝑠 𝐷𝐡 𝑒3

Constrain the relationship assignment between actions Ex) : A is implied-by B B is implied-by C A is mutually exclusive of C Inconsistent relationships They wanted to avoid those kind of problems οƒ  A is implied-by C

slide-35
SLIDE 35

39

Approach | Full model

𝐷 = 𝐷𝑏𝑑 + 𝛽𝑠𝐷𝑠𝑓𝑑 + π›½π‘œπ·π‘œπ‘šπ‘ž + π›½π‘‘π·π‘‘π‘π‘œπ‘‘ + πœ‡ 𝑋 2

2

Full model : Basic action retrieval model

[Image + Action]

Language prior [only Action] Visual objective [Image + Action] Consistency

  • bjective

[only Action] The weights in the model

  • The full objective is minimized through downpour stochastic

gradient descent.

  • Hyper-parameters of models : 𝛾, πœ‡, 𝛽𝑠, π›½π‘œ, 𝛽𝑑
slide-36
SLIDE 36

40

Result |

slide-37
SLIDE 37

41

  • Thank you.

Q & A |