Describing objects in visual scenes Is visual salience like - PowerPoint PPT Presentation

Describing objects in visual scenes Is visual salience like conversational salience? Micha Elsner Hannah Rohde, Alasdair Clarke Department of Linguistics The Ohio State University University of Edinburgh

“Describe the person in the box so that someone could find them” 2

◮ To the right of the men smoking a woman wearing a yellow top and red skirt. ◮ woman in yellow shirt, red skirt in the queue leaving the building ◮ the woman in a yellow short just behind the spray of the hose ◮ Between the yellow and white airplanes there is a red vehicle spraying people with a hose. The people getting sprayed have a small line behind them. In the line there is a woman with brownish red hair, a yellow shirt and a red skirt holding a purse. She is standing behind a man dressed in green. 3

Relational descriptions “The woman standing near the jetway ” ◮ Overall target : ◮ “the woman” ◮ Landmark : ◮ “the jetway” ◮ relative to “woman” 4

Motivation: ◮ Information structure via discourse salience : ◮ Familiar / important / in common ground ◮ Image understanding via visual salience : ◮ Perceptually apparent / attracts attention ◮ What do they have in common? This study: ◮ Complex information structure of relational descriptions ◮ Visual features matter... ◮ Visual salience is like discourse salience 5

Overview Ordering strategies in the corpus “Where’s Wally”: the dataset Learning to use visual features Experiments: predicting the order 6

Ordering strategies: direction right The woman standing near the jetway left Near the hut that is burning , there is a man ... inter Man ... next to railroad tracks wearing a white coat ◮ Orders defined WRT first mention ◮ Information structure, not syntax 7

Basic ordering ◮ R IGHT default for landmarks (40%) ◮ L EFT default for image regions (57%) ◮ “On the left is a woman”... ◮ Other orders are marked: ◮ L EFT landmarks (33%) ◮ I NTER landmarks (27%) 8

Non-relational mentions Look at the plane . This man is holding a box that he is putting on the plane . ◮ First mention isn’t relational ◮ “There is”, “look at”, “find the”... ◮ Annotated as ESTABLISH construction ◮ Usually occurs with LEFT ordering 9

Where’s Wally: the dataset By Martin Handford: Walker Books, London ◮ Published in US as “Where’s Waldo” ◮ Series of childrens’ books: a game based on visual search ◮ Gathered referring expressions through Mechanical Turk ◮ Each subject saw a single target in each image 10

28 images x 16 targets x 10 subjects per image 11

Why Wally? ◮ Wide range of objects with varied visual salience ◮ Deliberately difficult visual search ◮ Relational descriptions a must ◮ Not: “Wally is wearing a red striped shirt and a bobble hat” ◮ Previous studies used fewer objects ◮ Got fewer relational descriptions (Viethen+Dale ‘08) 12

Annotation: 11 images complete so far The < targ > man < /targ > just to the left of the < lmark rel=“targ” obj=“(id)” > burning hut < /lmark > < targ > holding a torch and a sword < /targ > 13

Individual variation For head/landmark pairs mentioned by multiple subjects: ◮ 65% agreement about mention direction ◮ 40% ESTABLISH constructions agreed on Strategies are predictable but vary ◮ Based on other landmarks selected? ◮ Different cognitive strategies? 14

Effects of visual perception 15

Visual information: ◮ Root area of object... ◮ (Low-level) visual salience of object ◮ Distance between objects Visual salience: ◮ Psychological models of low-level vision (Toet ‘11, Itti+Koch ‘00, others) ◮ Where will people look in an image? ◮ Which objects are easy to find? 16

Salience map ◮ Based on responses from filter bank ◮ Bottom-up part of (Torralba+al ‘06) 17

Modeling: tag induction ◮ Information structure as tagging problem ◮ Each object has (hidden) type ◮ Analogous to part of speech ◮ Order controlled by types right target1 landmark2 The woman standing near the jetway 18

Begin with simple discriminative system ◮ Features: discretized area, salience, distance ◮ Thresholds set at training set quartiles ◮ Number of landmarks used for each object right dst ar, sal, deps ar, sal, deps The woman standing near the jetway 19

Multilayer system ◮ No longer reliant on hand-tuned discretization ◮ CRF/Neural Net with latent type variables ◮ Area, salience, deps predict type ◮ ...which predict direction right dst target1 landmark2 ar, sal, deps ar, sal, deps The woman standing near the jetway 20

System design ◮ Tag induction: almost grammar induction ◮ Not hierarchical yet though ◮ Based on Berkeley-style latent variable grammar ◮ (Matsuzaki+al ‘05, Petrov+al ‘06,‘08) ◮ Implemented with Theano package ◮ Automatic computation of gradients 21

Visualization of types for objects 22

Linguistic analysis ◮ Red: smallest and hardest to see ◮ Right > inter > left ◮ Blue: small ◮ Right > inter > left ◮ A few ESTABLISH ◮ Green: midsized ◮ Left > inter = right ◮ Common as ESTABLISH ◮ Purple: largest ◮ Inter > left = right 23

Information ordered by givenness/familiarity: (Prince ‘81, Birner+Ward ‘98 etc) ◮ Subject position: more familiar entities ◮ New information (outside common ground) later in sentence Obama (given) has a dog named Bo (new) ◮ ESTABLISH construction introduces hearer-new entity (Ward+Birner ‘95) Hey, look! There’s a huge raccoon asleep under my car (new) ! (WB95 ex. 9) 24

Visual salience is similar: ◮ Highly visible landmarks appear left/inter ◮ Treated as familiar entities ◮ Assumed in common ground ◮ Harder-to-see landmarks on right ◮ Assumed discourse-new ◮ ESTABLISH construction used for mid-sized entities ◮ Used to place them on the left ◮ Might not normally be on the left (not in common ground) ◮ But are visually salient enough to motivate leftward order 25

Predicting the order ◮ Input: unordered abstract structure Acc (direction) F ( ESTABLISH ) All RIGHT 36 0 Regs LEFT 43 0 26

Predicting the order ◮ Input: unordered abstract structure Acc (direction) F ( ESTABLISH ) All RIGHT 36 0 Regs LEFT 43 0 Basic discr 50 43 Multilevel 52 50 26

Predicting the order ◮ Input: unordered abstract structure Acc (direction) F ( ESTABLISH ) All RIGHT 36 0 Regs LEFT 43 0 Basic discr 50 43 Multilevel 52 50 Majority oracle 75 65 26

Predictions II Left (F1) Inter (F1) Right (F1) All RIGHT 0 0 53 Regs LEFT 40 0 55 Basic discr 57 34 53 Multilevel 60 29 56 Majority oracle 65 60 70 27

Conclusions: ◮ Complex information structure of relational descriptions ◮ Predictable from visual information... ◮ More visible objects act like familiar entities Future work: ◮ Surface realization of these structures ◮ More sophisticated visual models 28

Describing objects in visual scenes Is visual salience like - PowerPoint PPT Presentation

Describing objects in visual scenes Is visual salience like conversational salience? Micha Elsner Hannah Rohde, Alasdair Clarke Department of Linguistics The Ohio State University University of Edinburgh Describe the person in the box so

Objects and scenes Objects and scenes: Recognizing Multiple Object Classes Josef Sivic and Ivan

Mosaics of Scenes with Moving Objects James Davis Computer Science Department Stanford

Mutable Values Announcements Objects (Demo) Objects 4 Objects Objects represent

61A Lecture 12 Announcements Objects (Demo) Objects 4 Objects Objects represent

Building 3D Scenes With QML Building 3D OpenGL Scenes with Qt 5 and QML Krzysztof Krzewniak

4DVideo & Dynamic Scenes Torsten Sattler and Martin Oswald Spring 2018 1 Institute of

Objects & Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Behind the scenes of a C64 demo Ninja / The Dreams 28C3 Ninja / The Dreams Behind the scenes of

Lecture 19: Motion Sparse stereo matching Indexing scenes Indexing scenes Tuesday, Nov

Live Objects Live Objects Live Objects Live Objects Krzys Ostrowski, Ken Birman, Danny Dolev

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

Shared Segmentation of Natural Scenes using Dependent Pitman-Yor Processes Erik Sudderth &

Object Oriented Programming Sunil Pai, Y! Objects Objects and Javascript Numbers Strings

North American CO 2 Fluxes, Inflow, and Uncertainties Estimated Using Atmospheric Measurements

FaCT: A DSL for Timing-Sensitive Computation Sunjay Cauligi, Gary Soeller, Brian Johannesmeyer,

Application: Semantic Role Labeling CS 6956: Deep Learning for NLP Overview What is semantic

Applying Signal Detection Theory to Multi-Level Modeling: When Accuracy Isn't Always

Design for screening, emulation and calibration

How to (Correctly) Invoke Wagner Sonia Bogos; Serge Vaudenay EPFL How to (Correctly) Invoke

Nuclear techniques for the Nuclear techniques for the in- -situ detection of mineral situ

May 2020 Monthly Update North Dakota Pipeline Authority Justin J. Kringstad May 15, 2020 US

Describing objects in visual scenes Is visual salience like - PowerPoint PPT Presentation

Describing objects in visual scenes Is visual salience like conversational salience? Micha Elsner Hannah Rohde, Alasdair Clarke Department of Linguistics The Ohio State University University of Edinburgh Describe the person in the box so

Objects and scenes Objects and scenes: Recognizing Multiple Object Classes Josef Sivic and Ivan

Mosaics of Scenes with Moving Objects James Davis Computer Science Department Stanford

Mutable Values Announcements Objects (Demo) Objects 4 Objects Objects represent

61A Lecture 12 Announcements Objects (Demo) Objects 4 Objects Objects represent

Building 3D Scenes With QML Building 3D OpenGL Scenes with Qt 5 and QML Krzysztof Krzewniak

4DVideo &amp; Dynamic Scenes Torsten Sattler and Martin Oswald Spring 2018 1 Institute of

Objects &amp; Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Behind the scenes of a C64 demo Ninja / The Dreams 28C3 Ninja / The Dreams Behind the scenes of

Lecture 19: Motion Sparse stereo matching Indexing scenes Indexing scenes Tuesday, Nov

Live Objects Live Objects Live Objects Live Objects Krzys Ostrowski, Ken Birman, Danny Dolev

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Overview Overview Visual displays Visual displays Visual and tactile displays Visual and

Shared Segmentation of Natural Scenes using Dependent Pitman-Yor Processes Erik Sudderth &amp;

Object Oriented Programming Sunil Pai, Y! Objects Objects and Javascript Numbers Strings

North American CO 2 Fluxes, Inflow, and Uncertainties Estimated Using Atmospheric Measurements

FaCT: A DSL for Timing-Sensitive Computation Sunjay Cauligi, Gary Soeller, Brian Johannesmeyer,

Application: Semantic Role Labeling CS 6956: Deep Learning for NLP Overview What is semantic

Applying Signal Detection Theory to Multi-Level Modeling: When Accuracy Isn't Always

Design for screening, emulation and calibration

How to (Correctly) Invoke Wagner Sonia Bogos; Serge Vaudenay EPFL How to (Correctly) Invoke

Nuclear techniques for the Nuclear techniques for the in- -situ detection of mineral situ

May 2020 Monthly Update North Dakota Pipeline Authority Justin J. Kringstad May 15, 2020 US

4DVideo & Dynamic Scenes Torsten Sattler and Martin Oswald Spring 2018 1 Institute of

Objects & Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

Shared Segmentation of Natural Scenes using Dependent Pitman-Yor Processes Erik Sudderth &