SPICE: Semantic Propositional Image Caption Evaluation Presented to - - PowerPoint PPT Presentation

▶

Feb 23, 2023 321 likes •544 views

SPICE: Semantic Propositional Image Caption Evaluation Presented to the COCO Consortium, Sept 2016 Peter Anderson 1 , Basura Fernando 1 , Mark Johnson 2 and Stephen Gould 1 1 Australian National University 2 Macquarie University ARC Centre of

SLIDE 1

ARC Centre of Excellence for Robotic Vision ARC Centre of Excellence for Robotic Vision 1

SPICE: Semantic Propositional Image Caption Evaluation

Presented to the COCO Consortium, Sept 2016

Peter Anderson1, Basura Fernando1, Mark Johnson2 and Stephen Gould1

1 Australian National University 2 Macquarie University

SLIDE 2

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 2

Image captioning

Source: MS COCO Captions dataset Source: http://aipoly.com/

SLIDE 3

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 3

Automatic caption evaluation

Benchmark datasets require fast to compute,

accurate and inexpensive evaluation metrics

Good metrics can be used to help construct

better models The Evaluation Task: Given a candidate caption ci and a set of m reference captions Ri = {ri1,…,rim}, compute a score Si that represents similarity between ci and Ri.

SLIDE 4

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 4

Existing state of the art

Source: Lin Cui, Large-scale Scene UNderstanding Workshop, CVPR 2015

Nearest neighbour captions often ranked higher than human captions

SLIDE 5

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 5

Existing metrics

BLEU: Precision

with brevity penalty, geometric mean

ver n-grams
ROUGE-L: F-score

based on Longest Common Substring

METEOR: Align

fragments, take harmonic mean of precision & recall

CIDEr: Cosine

similarity with TF- IDF weighting

SLIDE 6

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 6

Motivation

‘False positive’

(High n-gram similarity)

‘False negative’

(Low n-gram similarity) …n-gram overlap is not necessary or suffjcient for two sentences to mean the same A young girl standing on top

f a tennis

court. A giraffe standing on top

f a green field.

A shiny metal pot filled with some diced veggies. The pan on the stove has chopped vegetables in it.

Source: MS COCO Captions dataset

…SPICE primarily addresses false positives

SLIDE 7

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 7

Is this a good caption?

“A young girl standing on top of a basketball court”

SLIDE 8

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 8

Is this a good caption?

“A young girl standing on top of a basketball court”

Semantic propositions: 1.There is girl 2.The girl is young 3.The girl is standing 4.There is court 5.The court is used for basketball 6.The girl is on the court

SLIDE 9

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 9

Key Idea – scene graphs1

1 Johnson et. al. Image Retrieval Using Scene Graphs, CVPR 2015 2 Klein & Manning: Accurate Unlexicalized Parsing, ACL 2003 3 Schuster et. al: Generating semantically precise scene graphs from textual descriptions for

improved image retrieval, EMNLP 2015

(girl)

(court) (girl, young) (girl, standing) (court, tennis) (girl, on-top-of, court)

4. T

uples

3. Scene Graph3
2. Parse2
1. Input

SLIDE 10

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 10

SPICE Calculation

SPICE calculated as an F-score over tuples, with:

Merging of synonymous nodes, and
Wordnet synsets used for tuple matching and merging.

Given candidate caption c, a set of reference captions S, and the mapping T from captions to tuples:

SLIDE 11

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 11

Example – good caption

SLIDE 12

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 12

Example – good caption

SLIDE 13

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 13

Example – weak caption

SLIDE 14

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 14

Example – weak caption

SLIDE 15

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 15

Evaluation – MS COCO (C40)

Pearson ρ correlation between evaluation metrics and human judgments for the 15 competition entries plus human captions in the 2015 COCO Captioning Challenge, using 40 reference captions.

Source: Our thanks to the COCO Consortium for performing this evaluation using MS COCO Captions C40.

SLIDE 16

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 16

Evaluation – MS COCO (C40)

SPICE picks the same top-5 as human evaluators.

Absolute scores are lower with 40 reference captions (compared to 5 reference captions)

Source: Our thanks to the COCO Consortium for performing this evaluation using MS COCO Captions C40.

SLIDE 17

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 17

Gameability

SPICE measures how well caption models

recover objects, attributes and relations

Fluency is neglected (as with n-gram

metrics)

If fluency is a concern, include a fluency

metric such as surprisal*

*Hale, J: A probabilistic Earley Parser as a Psycholinguistic Model 2001; Levy, R: Expectation-based syntactic comprehension 2008

SLIDE 18

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 18

SPICE for error analysis

Breakdown of SPICE F-score over objects, attributes and relations

SLIDE 19

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 19

Can caption models count?

Breakdown of attribute F-score over color, number and size attributes

SLIDE 20

www.roboticvisi

n.org

ARC Centre of Excellence for Robotic Vision 20

Summary

SPICE measures how well caption models

recover objects, attributes and relations

SPICE captures human judgment better

than CIDEr, BLEU, METEOR and ROUGE

Tuples can be categorized for detailed error

analysis

Scope for further improvement as better

semantic parsers are developed

Next steps: Using SPICE to build better

caption models!

SLIDE 21

ARC Centre of Excellence for Robotic Vision ARC Centre of Excellence for Robotic Vision 21

Thank you

Link: SPICE Project Page (http://panderson.me/spice) Acknowledgement: We are grateful to the COCO Consortium for re-evaluating the 2015 Captioning Challenge entries using SPICE.