SPICE: Semantic Propositional Image Caption Evaluation Presented to - - PowerPoint PPT Presentation

spice semantic propositional image caption evaluation
SMART_READER_LITE
LIVE PREVIEW

SPICE: Semantic Propositional Image Caption Evaluation Presented to - - PowerPoint PPT Presentation

SPICE: Semantic Propositional Image Caption Evaluation Presented to the COCO Consortium, Sept 2016 Peter Anderson 1 , Basura Fernando 1 , Mark Johnson 2 and Stephen Gould 1 1 Australian National University 2 Macquarie University ARC Centre of


slide-1
SLIDE 1

ARC Centre of Excellence for Robotic Vision ARC Centre of Excellence for Robotic Vision 1

SPICE: Semantic Propositional Image Caption Evaluation

Presented to the COCO Consortium, Sept 2016

Peter Anderson1, Basura Fernando1, Mark Johnson2 and Stephen Gould1

1 Australian National University 2 Macquarie University

slide-2
SLIDE 2

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 2

Image captioning

Source: MS COCO Captions dataset Source: http://aipoly.com/

slide-3
SLIDE 3

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 3

Automatic caption evaluation

  • Benchmark datasets require fast to compute,

accurate and inexpensive evaluation metrics

  • Good metrics can be used to help construct

better models The Evaluation Task: Given a candidate caption ci and a set of m reference captions Ri = {ri1,…,rim}, compute a score Si that represents similarity between ci and Ri.

slide-4
SLIDE 4

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 4

Existing state of the art

Source: Lin Cui, Large-scale Scene UNderstanding Workshop, CVPR 2015

  • Nearest neighbour captions often ranked higher than human captions
slide-5
SLIDE 5

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 5

Existing metrics

  • BLEU: Precision

with brevity penalty, geometric mean

  • ver n-grams
  • ROUGE-L: F-score

based on Longest Common Substring

  • METEOR: Align

fragments, take harmonic mean of precision & recall

  • CIDEr: Cosine

similarity with TF- IDF weighting

slide-6
SLIDE 6

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 6

Motivation

‘False positive’

(High n-gram similarity)

‘False negative’

(Low n-gram similarity) …n-gram overlap is not necessary or suffjcient for two sentences to mean the same A young girl standing on top

  • f a tennis

court. A giraffe standing on top

  • f a green field.

A shiny metal pot filled with some diced veggies. The pan on the stove has chopped vegetables in it.

Source: MS COCO Captions dataset

…SPICE primarily addresses false positives

slide-7
SLIDE 7

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 7

Is this a good caption?

“A young girl standing on top of a basketball court”

slide-8
SLIDE 8

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 8

Is this a good caption?

“A young girl standing on top of a basketball court”

Semantic propositions: 1.There is girl 2.The girl is young 3.The girl is standing 4.There is court 5.The court is used for basketball 6.The girl is on the court

slide-9
SLIDE 9

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 9

Key Idea – scene graphs1

1 Johnson et. al. Image Retrieval Using Scene Graphs, CVPR 2015 2 Klein & Manning: Accurate Unlexicalized Parsing, ACL 2003 3 Schuster et. al: Generating semantically precise scene graphs from textual descriptions for

improved image retrieval, EMNLP 2015

(girl)

(court) (girl, young) (girl, standing) (court, tennis) (girl, on-top-of, court)

  • 4. T

uples

  • 3. Scene Graph3
  • 2. Parse2
  • 1. Input
slide-10
SLIDE 10

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 10

SPICE Calculation

SPICE calculated as an F-score over tuples, with:

  • Merging of synonymous nodes, and
  • Wordnet synsets used for tuple matching and merging.

Given candidate caption c, a set of reference captions S, and the mapping T from captions to tuples:

slide-11
SLIDE 11

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 11

Example – good caption

slide-12
SLIDE 12

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 12

Example – good caption

slide-13
SLIDE 13

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 13

Example – weak caption

slide-14
SLIDE 14

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 14

Example – weak caption

slide-15
SLIDE 15

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 15

Evaluation – MS COCO (C40)

Pearson ρ correlation between evaluation metrics and human judgments for the 15 competition entries plus human captions in the 2015 COCO Captioning Challenge, using 40 reference captions.

Source: Our thanks to the COCO Consortium for performing this evaluation using MS COCO Captions C40.

slide-16
SLIDE 16

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 16

Evaluation – MS COCO (C40)

SPICE picks the same top-5 as human evaluators.

Absolute scores are lower with 40 reference captions (compared to 5 reference captions)

Source: Our thanks to the COCO Consortium for performing this evaluation using MS COCO Captions C40.

slide-17
SLIDE 17

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 17

Gameability

  • SPICE measures how well caption models

recover objects, attributes and relations

  • Fluency is neglected (as with n-gram

metrics)

  • If fluency is a concern, include a fluency

metric such as surprisal*

*Hale, J: A probabilistic Earley Parser as a Psycholinguistic Model 2001; Levy, R: Expectation-based syntactic comprehension 2008

slide-18
SLIDE 18

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 18

SPICE for error analysis

Breakdown of SPICE F-score over objects, attributes and relations

slide-19
SLIDE 19

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 19

Can caption models count?

Breakdown of attribute F-score over color, number and size attributes

slide-20
SLIDE 20

www.roboticvisi

  • n.org

ARC Centre of Excellence for Robotic Vision 20

Summary

  • SPICE measures how well caption models

recover objects, attributes and relations

  • SPICE captures human judgment better

than CIDEr, BLEU, METEOR and ROUGE

  • Tuples can be categorized for detailed error

analysis

  • Scope for further improvement as better

semantic parsers are developed

  • Next steps: Using SPICE to build better

caption models!

slide-21
SLIDE 21

ARC Centre of Excellence for Robotic Vision ARC Centre of Excellence for Robotic Vision 21

Thank you

Link: SPICE Project Page (http://panderson.me/spice) Acknowledgement: We are grateful to the COCO Consortium for re-evaluating the 2015 Captioning Challenge entries using SPICE.