ReferItGame: Referring to Objects in Photographs of Natural Scenes - - PowerPoint PPT Presentation
ReferItGame: Referring to Objects in Photographs of Natural Scenes - - PowerPoint PPT Presentation
ReferItGame: Referring to Objects in Photographs of Natural Scenes Motivation First large-scale referring expression dataset Reference expressions are the natural way people talk Of psychological interest in the 70s; Grice,
Motivation
- First large-scale referring expression dataset
- Reference expressions are the natural way people talk
– Of psychological interest in the ‘70s; Grice, Rosch, Winograd
- Application to human-computer interaction, robots
- Introduce
– A large-scale dataset of referring expressions – A benchmark model for generating referral expressions
Motivation
- Natural referring expressions are free-form
– ‘smiling boy’; only subject – ‘man on left’; subject and preposition
- Other work requires expression as (subj, prep, obj)
– ‘cat on the chair’
Dataset
- Build on SAIAPR TC-12 dataset with 238 object categories
- Visual features include segmentations with
– absolute properties: area, boundary, width, height… – relative properties: adjacent, disjoint, beside, X-aligned, above…
Dataset
- Player 1 writes an expression referencing the segmented
- bject
- Player 2 clicks on where that object should be
– This verifies the expression is reasonable
Dataset
- Collected through Turkers and volunteers
– ~130,000 expressions – ~100,000 distinct objects – ~20,000 photographs
- www.referitgame.com is down unfortunately
Dataset
- Parse expressions into 7-tuple set of attributes, R
– entry-level category; ‘bird’ – color; ‘blue’ – size; ‘tiny’ – absolute location; ‘top of the image’ – relative location relation; ‘the car to the left of the tree’ – relative location object; ‘the car to the left of the tree’ – generic; ‘wooden’, ‘round’
- The big old white cabin beside the tree
– R = {cabin, white, big, Ø, beside, tree, old}
- StanfordCoreNLP parser and attribute template
Dataset
- Psychology analysis
– ‘woman’ often replaced with ‘person’
Dataset
- Attribute use
– Roughly half of parsed descriptions are just category
Model
- Optimize R over P and S using ILP
– R is 7-tuple set of attributes – P is visual features of object being referred to – S is visual features of the scene
- Different hand-engineered distributions for different attributes
- Unary priors between attribute and object
- Pairwise priors between pairs of attributes
Evaluation
- Three test sets of 500 images each
– A contains interesting objects – B contains most frequently occurring interesting objects – C contains interesting objects when multiple are present
- Baseline model
– Incorporates only the priors, so no S or attributes
- Humans ~72% accuracy
Critique
- How important is the scene for the attributes?
– S is only used for relative location {relation, object} attributes – Absolute location is most commonly used attribute – Over half of parsed descriptions only include object category
- Why don’t the authors include more information on the
visual features?
– Which visual features are most important?
- Better metric than precision and recall?
– Just ask AMT workers if description is reasonable?
Critique
- Why don’t the authors analyze training referral
expressions more?
– Paid Turk workers per every 10 images – Some human expressions are just the object
Future Work
- Scale up the dataset and train end-to-end with the best
neural networks
- Identify referred object instead of generating expression
– Done in upcoming MAttNet paper
- Make the images and expressions more challenging