ReferItGame: Referring to Objects in Photographs of Natural Scenes - - PowerPoint PPT Presentation

referitgame referring to objects in photographs of
SMART_READER_LITE
LIVE PREVIEW

ReferItGame: Referring to Objects in Photographs of Natural Scenes - - PowerPoint PPT Presentation

ReferItGame: Referring to Objects in Photographs of Natural Scenes Motivation First large-scale referring expression dataset Reference expressions are the natural way people talk Of psychological interest in the 70s; Grice,


slide-1
SLIDE 1

ReferItGame: Referring to Objects in Photographs of Natural Scenes

slide-2
SLIDE 2

Motivation

  • First large-scale referring expression dataset
  • Reference expressions are the natural way people talk

– Of psychological interest in the ‘70s; Grice, Rosch, Winograd

  • Application to human-computer interaction, robots
  • Introduce

– A large-scale dataset of referring expressions – A benchmark model for generating referral expressions

slide-3
SLIDE 3

Motivation

  • Natural referring expressions are free-form

– ‘smiling boy’; only subject – ‘man on left’; subject and preposition

  • Other work requires expression as (subj, prep, obj)

– ‘cat on the chair’

slide-4
SLIDE 4

Dataset

  • Build on SAIAPR TC-12 dataset with 238 object categories
  • Visual features include segmentations with

– absolute properties: area, boundary, width, height… – relative properties: adjacent, disjoint, beside, X-aligned, above…

slide-5
SLIDE 5

Dataset

  • Player 1 writes an expression referencing the segmented
  • bject
  • Player 2 clicks on where that object should be

– This verifies the expression is reasonable

slide-6
SLIDE 6

Dataset

  • Collected through Turkers and volunteers

– ~130,000 expressions – ~100,000 distinct objects – ~20,000 photographs

  • www.referitgame.com is down unfortunately
slide-7
SLIDE 7

Dataset

  • Parse expressions into 7-tuple set of attributes, R

– entry-level category; ‘bird’ – color; ‘blue’ – size; ‘tiny’ – absolute location; ‘top of the image’ – relative location relation; ‘the car to the left of the tree’ – relative location object; ‘the car to the left of the tree’ – generic; ‘wooden’, ‘round’

  • The big old white cabin beside the tree

– R = {cabin, white, big, Ø, beside, tree, old}

  • StanfordCoreNLP parser and attribute template
slide-8
SLIDE 8

Dataset

  • Psychology analysis

– ‘woman’ often replaced with ‘person’

slide-9
SLIDE 9

Dataset

  • Attribute use

– Roughly half of parsed descriptions are just category

slide-10
SLIDE 10

Model

  • Optimize R over P and S using ILP

– R is 7-tuple set of attributes – P is visual features of object being referred to – S is visual features of the scene

  • Different hand-engineered distributions for different attributes
  • Unary priors between attribute and object
  • Pairwise priors between pairs of attributes
slide-11
SLIDE 11

Evaluation

  • Three test sets of 500 images each

– A contains interesting objects – B contains most frequently occurring interesting objects – C contains interesting objects when multiple are present

  • Baseline model

– Incorporates only the priors, so no S or attributes

  • Humans ~72% accuracy
slide-12
SLIDE 12

Critique

  • How important is the scene for the attributes?

– S is only used for relative location {relation, object} attributes – Absolute location is most commonly used attribute – Over half of parsed descriptions only include object category

  • Why don’t the authors include more information on the

visual features?

– Which visual features are most important?

  • Better metric than precision and recall?

– Just ask AMT workers if description is reasonable?

slide-13
SLIDE 13

Critique

  • Why don’t the authors analyze training referral

expressions more?

– Paid Turk workers per every 10 images – Some human expressions are just the object

slide-14
SLIDE 14

Future Work

  • Scale up the dataset and train end-to-end with the best

neural networks

  • Identify referred object instead of generating expression

– Done in upcoming MAttNet paper

  • Make the images and expressions more challenging
slide-15
SLIDE 15