ReferItGame: Referring to Objects in Photographs of Natural Scenes - - PowerPoint PPT Presentation

▶

Apr 02, 2024 207 likes •361 views

ReferItGame: Referring to Objects in Photographs of Natural Scenes Motivation First large-scale referring expression dataset Reference expressions are the natural way people talk Of psychological interest in the 70s; Grice,

SLIDE 1

ReferItGame: Referring to Objects in Photographs of Natural Scenes

SLIDE 2

Motivation

First large-scale referring expression dataset
Reference expressions are the natural way people talk

– Of psychological interest in the ‘70s; Grice, Rosch, Winograd

Application to human-computer interaction, robots
Introduce

– A large-scale dataset of referring expressions – A benchmark model for generating referral expressions

SLIDE 3

Motivation

Natural referring expressions are free-form

– ‘smiling boy’; only subject – ‘man on left’; subject and preposition

Other work requires expression as (subj, prep, obj)

– ‘cat on the chair’

SLIDE 4

Dataset

Build on SAIAPR TC-12 dataset with 238 object categories
Visual features include segmentations with

– absolute properties: area, boundary, width, height… – relative properties: adjacent, disjoint, beside, X-aligned, above…

SLIDE 5

Dataset

Player 1 writes an expression referencing the segmented
bject
Player 2 clicks on where that object should be

– This verifies the expression is reasonable

SLIDE 6

Dataset

Collected through Turkers and volunteers

– ~130,000 expressions – ~100,000 distinct objects – ~20,000 photographs

www.referitgame.com is down unfortunately

SLIDE 7

Dataset

Parse expressions into 7-tuple set of attributes, R

– entry-level category; ‘bird’ – color; ‘blue’ – size; ‘tiny’ – absolute location; ‘top of the image’ – relative location relation; ‘the car to the left of the tree’ – relative location object; ‘the car to the left of the tree’ – generic; ‘wooden’, ‘round’

The big old white cabin beside the tree

– R = {cabin, white, big, Ø, beside, tree, old}

StanfordCoreNLP parser and attribute template

SLIDE 8

Dataset

Psychology analysis

– ‘woman’ often replaced with ‘person’

SLIDE 9

Dataset

Attribute use

– Roughly half of parsed descriptions are just category

SLIDE 10

Model

Optimize R over P and S using ILP

– R is 7-tuple set of attributes – P is visual features of object being referred to – S is visual features of the scene

Different hand-engineered distributions for different attributes
Unary priors between attribute and object
Pairwise priors between pairs of attributes

SLIDE 11

Evaluation

Three test sets of 500 images each

– A contains interesting objects – B contains most frequently occurring interesting objects – C contains interesting objects when multiple are present

Baseline model

– Incorporates only the priors, so no S or attributes

Humans ~72% accuracy

SLIDE 12

Critique

How important is the scene for the attributes?

– S is only used for relative location {relation, object} attributes – Absolute location is most commonly used attribute – Over half of parsed descriptions only include object category

Why don’t the authors include more information on the

visual features?

– Which visual features are most important?

Better metric than precision and recall?

– Just ask AMT workers if description is reasonable?

SLIDE 13

Critique

Why don’t the authors analyze training referral

expressions more?

– Paid Turk workers per every 10 images – Some human expressions are just the object

SLIDE 14

Future Work

Scale up the dataset and train end-to-end with the best

neural networks

Identify referred object instead of generating expression

– Done in upcoming MAttNet paper

Make the images and expressions more challenging

SLIDE 15