about visual categories Kristen Grauman Department of Computer - - PowerPoint PPT Presentation

about visual categories
SMART_READER_LITE
LIVE PREVIEW

about visual categories Kristen Grauman Department of Computer - - PowerPoint PPT Presentation

Teaching computers about visual categories Kristen Grauman Department of Computer Science University of Texas at Austin Visual category recognition Goal: recognize and detect categories of visually and semantically related Objects Scenes


slide-1
SLIDE 1

Teaching computers about visual categories

Kristen Grauman Department of Computer Science University of Texas at Austin

slide-2
SLIDE 2

Visual category recognition

Goal: recognize and detect categories of visually and semantically related…

Objects Scenes Activities

Kristen Grauman, UT Austin

slide-3
SLIDE 3

The need for visual recognition

Scientific data analysis Augmented reality Robotics Indexing by content Surveillance

Kristen Grauman, UT Austin

slide-4
SLIDE 4

Difficulty of category recognition

Illumination Object pose Clutter Viewpoint Intra-class appearance Occlusions

~30,000 possible categories to distinguish! [Biederman 1987]

Kristen Grauman, UT Austin

slide-5
SLIDE 5

Progress charted by datasets

COIL Roberts 1963

1996 1963 …

Kristen Grauman, UT Austin

slide-6
SLIDE 6

INRIA Pedestrians UIUC Cars MIT-CMU Faces

2000

Progress charted by datasets

1996 1963 …

Kristen Grauman, UT Austin

slide-7
SLIDE 7

Caltech-256 Caltech-101 MSRC 21 Objects

2000 2005

Progress charted by datasets

1996 1963 …

Kristen Grauman, UT Austin

slide-8
SLIDE 8

PASCAL VOC Detection challenge

2000 2005 2007

Progress charted by datasets

1996 1963 …

Kristen Grauman, UT Austin

slide-9
SLIDE 9

Faces in the Wild 80M Tiny Images Birds-200 PASCAL VOC ImageNet

2000 2005 2007 2008 2013

Progress charted by datasets

1996 1963 …

Kristen Grauman, UT Austin

slide-10
SLIDE 10

Learning-based methods

Last ~10 years: impressive strides by learning appearance models (usually discriminative).

Annotator

Car Non-car Training images Novel image

[Papageorgiou & Poggio 1998, Schneiderman & Kanade 2000, Viola & Jones 2001, Dalal & Triggs 2005, Grauman & Darrell 2005, Lazebnik et al. 2006, Felzenszwalb et al. 2008,…] Kristen Grauman, UT Austin

slide-11
SLIDE 11

Exuberance for image data (and their category labels)

80M Tiny Images ImageNet

14M images 1K+ labeled object categories [Deng et al. 2009-2012] 80M images 53K noisily labeled object categories [Torralba et al. 2008] 131K images 902 labeled scene categories 4K labeled object categories [Xiao et al. 2010]

SUN Database

Kristen Grauman, UT Austin

slide-12
SLIDE 12

Problem

Log scale

Difficulty+scale

  • f data

1998 2013

Complexity of supervision

1998 2013

While complexity and scale of recognition task has escalated dramatically, our means of “teaching” visual categories remains shallow.

Kristen Grauman, UT Austin

slide-13
SLIDE 13

Envisioning a broader channel

Human annotator

“This image has a cow in it.”

More labeled images ↔ more accurate models?

Kristen Grauman, UT Austin

slide-14
SLIDE 14

Envisioning a broader channel

Need richer means to teach system about visual world

Human annotator Kristen Grauman, UT Austin

slide-15
SLIDE 15

Envisioning a broader channel

Today

Vision Learning Vision Learning Human computation Language Robotics Multi-agent systems Knowledge representation

Next 10 years

human system human system

Kristen Grauman, UT Austin

slide-16
SLIDE 16

Our goal

Teaching computers about visual categories must be an ongoing, interactive process, with communication that goes beyond labels. This talk:

  • 1. Active visual learning
  • 2. Learning from visual comparisons

Kristen Grauman, UT Austin

slide-17
SLIDE 17

Active learning for visual recognition

Annotator Unlabeled data Labeled data Active request

?

Current classifiers

[Mackay 1992, Cohn et al. 1996, Freund et al. 1997, Lindenbaum et al. 1999, Tong & Koller 2000, Schohn and Cohn 2000, Campbell et al. 2000, Roy & McCallum 2001, Kapoor et al. 2007,…]

Kristen Grauman, UT Austin

slide-18
SLIDE 18

Active learning for visual recognition

Annotator Unlabeled data Labeled data Current classifiers

Num labels added Accuracy

active passive

Intent: better models, faster/cheaper

Kristen Grauman, UT Austin

slide-19
SLIDE 19

Problem: Active selection and recognition

More expensive to

  • btain

Less expensive to

  • btain
  • Multiple levels of

annotation are possible

  • Variable cost depending
  • n level and example

Kristen Grauman, UT Austin

slide-20
SLIDE 20
  • Compute decision-theoretic active selection

criterion that weighs both: – which example to annotate, and – what kind of annotation to request for it as compared to – the predicted effort the request would require

Our idea: Cost-sensitive multi-question active learning

[Vijayanarasimhan & Grauman, NIPS 2008, CVPR 2009]

Kristen Grauman, UT Austin

slide-21
SLIDE 21

Decision-theoretic multi-question criterion

Value of asking given question about given data object Current misclassification risk Estimated risk if candidate request were answered Cost of getting the answer

  • 1. Label a region

?

  • 3. Segment the

image, name all

  • bjects.

Three “levels” of requests to choose from:

  • 2. Tag an object

in the image

?

Kristen Grauman, UT Austin

slide-22
SLIDE 22

Predicting effort

  • What manual effort cost would we expect to pay

for an unlabeled image?

Which image would you rather annotate?

Kristen Grauman, UT Austin

slide-23
SLIDE 23

Predicting effort

  • What manual effort cost would we expect to pay

for an unlabeled image?

Which image would you rather annotate?

Kristen Grauman, UT Austin

slide-24
SLIDE 24

Predicting effort

We estimate labeling difficulty from visual content.

Kristen Grauman, UT Austin

slide-25
SLIDE 25

Predicting effort

We estimate labeling difficulty from visual content.

Other forms of effort cost: expertise required, resolution of data, how far the robot must move, length of video clip,…

Kristen Grauman, UT Austin

slide-26
SLIDE 26

Multi-question active learning

Annotator Unlabeled data Labeled data Current classifiers

“Completely segment image #32.” “Does image #7 contain a cow?”

[Vijayanarasimhan & Grauman, NIPS 2008, CVPR 2009]

Kristen Grauman, UT Austin

slide-27
SLIDE 27

Multi-question active learning

Annotator Unlabeled data Labeled data Current classifiers

“Completely segment image #32.” “Does image #7 contain a cow?”

[Vijayanarasimhan & Grauman, NIPS 2008, CVPR 2009]

Kristen Grauman, UT Austin

slide-28
SLIDE 28

Multi-question active learning curves

Annotation effort Accuracy

Kristen Grauman, UT Austin

slide-29
SLIDE 29

What is this

  • bject?

Does this object have spots? [Kovashka et al., ICCV 2011]

Annotator Unlabeled data Labeled data Current model

Multi-question active learning with objects and attributes

Weigh relative impact of an object label or an attribute label, at each iteration.

Kristen Grauman, UT Austin

slide-30
SLIDE 30

[Vijayanarasimhan et al., CVPR 2010]

Annotator Unlabeled data Labeled data Current model

Budgeted batch active learning

Select batch of examples that together improves classifier objective and meets annotation budget.

$

$

$

$

Unlabeled data

Kristen Grauman, UT Austin

slide-31
SLIDE 31

Problem: “Sandbox” active learning

Thus far, tested only in artificial settings:

Actual time Accuracy

active passive

~103 prepared images

  • Unlabeled data already fixed,

small scale, biased

  • Computational cost ignored

Kristen Grauman, UT Austin

slide-32
SLIDE 32

Our idea: Live active learning

Large-scale active learning of object detectors with crawled data and crowdsourced labels. How to scale active learning to massive unlabeled pools of data?

Kristen Grauman, UT Austin

slide-33
SLIDE 33

Pool-based active learning

e.g., select point nearest to hyperplane decision boundary for labeling.

w

?

[Tong & Koller, 2000; Schohn & Cohn, 2000; Campbell et al. 2000]

Kristen Grauman, UT Austin

slide-34
SLIDE 34

Current classifier Unlabeled data

Sub-linear time active selection

[Jain, Vijayanarasimhan, Grauman, NIPS 2010]

110

Hash table

111 101

We propose a novel hashing approach to identify the most uncertain examples in sub-linear time.

Actively selected examples

Kristen Grauman, UT Austin

slide-35
SLIDE 35

) 1 (  t

w

) 1 (

1

 t

x

) 1 (

2

 t

x

) (t

w

) (

1

t

x

) (

2

t

x

) 1 (

3

 t

x

} , { ) (

1 k

h x x w  

Hashing a hyperplane query

At each iteration of the learning loop, our hash functions map the current hyperplane directly to its nearest unlabeled points.

Kristen Grauman, UT Austin

slide-36
SLIDE 36

) 1 (  t

w

) 1 (

1

 t

x

) 1 (

2

 t

x

) (t

w

) (

1

t

x

) (

2

t

x

) 1 (

3

 t

x

} , { ) (

1 k

h x x w  

Hashing a hyperplane query

At each iteration of the learning loop, our hash functions map the current hyperplane directly to its nearest unlabeled points.

Guarantee high probability of collision for points near decision boundary:

Kristen Grauman, UT Austin

slide-37
SLIDE 37

H-Hash result on 1M Tiny Images

Time spent searching for selection

2

H-Hash Active Exhaustive Active

By minimizing both selection and labeling time, obtain the best accuracy per unit time.

H-Hash Active Exhaustive Active Passive

8

5% 10% 15%

Improvement in AUROC Selection + labeling time (hrs) Accounting for all costs

4 1.3

Accuracy improvements as more data labeled

Exhaustive Active Passive H-Hash Active

Sub-linear time active selection

Kristen Grauman, UT Austin

slide-38
SLIDE 38

PASCAL Visual Object Categorization

  • Closely studied object detection benchmark
  • Original image data from Flickr

http://pascallin.ecs.soton.ac.uk/challenges/VOC/

Kristen Grauman, UT Austin

slide-39
SLIDE 39

1111 1010 1100

Hash table of image windows “bicycle”

 

w h

 

) (

i

O h 

Actively selected examples Annotated data Consensus (Mean shift) Current hyperplane Unlabeled windows Jumping window candidates Unlabeled images

Live active learning

[Vijayanarasimhan & Grauman CVPR 2011]

slide-40
SLIDE 40

1111 1010 1100

Hash table of image windows “bicycle”

 

w h

 

) (

i

O h 

Actively selected examples Annotated data Consensus (Mean shift) Current hyperplane Unlabeled windows Jumping window candidates Unlabeled images

Live active learning

[Vijayanarasimhan & Grauman CVPR 2011]

For 4.5 million unlabeled instances, 10 minutes machine time per iter,

  • vs. 60 hours for a linear scan.
slide-41
SLIDE 41

Live active learning results

PASCAL VOC objects - Flickr test set

Outperforms status quo data collection approach

Kristen Grauman, UT Austin

slide-42
SLIDE 42

First selections made when learning “boat”

Live active learning (ours) Keyword+image baseline

Live active learning results

What does the live learning system ask first?

Kristen Grauman, UT Austin

slide-43
SLIDE 43

Ongoing challenges in active visual learning

  • Exploration vs. exploitation
  • Utility tied to specific classifier or model
  • Joint batch selection (“non-myopic”) expensive,

remains challenging

  • Crowdsourcing: reliability, expertise, economics
  • Active annotations for objects/activity in video

Kristen Grauman, UT Austin

slide-44
SLIDE 44

Our goal

Teaching computers about visual categories must be an ongoing, interactive process, with communication that goes beyond labels. This talk:

  • 1. Active visual learning
  • 2. Learning from visual comparisons

Kristen Grauman, UT Austin

slide-45
SLIDE 45

Visual attributes

  • High-level semantic properties shared by objects
  • Human-understandable and machine-detectable

brown indoors

  • utdoors

flat four-legged high heel red has-

  • rnaments

metallic [Oliva et al. 2001, Ferrari & Zisserman 2007, Kumar et al. 2008, Farhadi et al. 2009, Lampert et al. 2009, Endres et al. 2010, Wang & Mori 2010, Berg et al. 2010, Branson et al. 2010, Parikh & Grauman 2011, …]

slide-46
SLIDE 46

Horse Horse Horse Donkey Donkey Mule

slide-47
SLIDE 47

Attributes

Is furry Has four legs Has a tail

A mule…

Kristen Grauman, UT Austin

slide-48
SLIDE 48

Binary attributes

Is furry Has four legs Has a tail

A mule…

[Ferrari & Zisserman 2007, Kumar et al. 2008, Farhadi et al. 2009, Lampert et al. 2009, Endres et al. 2010, Wang & Mori 2010, Berg et al. 2010, Branson et al. 2010, …]

Kristen Grauman, UT Austin

slide-49
SLIDE 49

Relative attributes

Is furry Has four legs Has a tail Tail longer than donkeys’ Legs shorter than horses’

A mule…

Kristen Grauman, UT Austin

slide-50
SLIDE 50

Idea: represent visual comparisons between classes, images, and their properties.

Relative attributes

Properties Concept Properties Concept Properties

Brighter than

[Parikh & Grauman, ICCV 2011]

Bright Bright

Kristen Grauman, UT Austin

slide-51
SLIDE 51

How to teach relative visual concepts?

1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3

How much is the person smiling?

Kristen Grauman, UT Austin

slide-52
SLIDE 52

How to teach relative visual concepts?

1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3

How much is the person smiling?

Kristen Grauman, UT Austin

slide-53
SLIDE 53

How to teach relative visual concepts?

1 4 2 3 1 4 2 3 1 4 2 3 1 4 2 3

How much is the person smiling?

Kristen Grauman, UT Austin

slide-54
SLIDE 54

How to teach relative visual concepts?

Less More

?

Kristen Grauman, UT Austin

slide-55
SLIDE 55

…,

Learning relative attributes

For each attribute, use ordered image pairs to train a ranking function:

=

[Parikh & Grauman, ICCV 2011; Joachims 2002]

Image features Ranking function

Kristen Grauman, UT Austin

slide-56
SLIDE 56

Relating images

Rather than simply label images with their properties,

Not bright Smiling Not natural

Kristen Grauman, UT Austin

slide-57
SLIDE 57

Relating images

Now we can compare images by attribute’s “strength”

bright smiling natural

Kristen Grauman, UT Austin

slide-58
SLIDE 58

Enable new modes of human-system communication

  • Training category models through descriptions
  • Rationales to explain image labels
  • Semantic relative feedback for image search
  • Analogical constraints on feature learning

Learning with visual comparisons

Kristen Grauman, UT Austin

slide-59
SLIDE 59

Relative zero-shot learning

Training: Images from S seen categories and

Descriptions of U unseen categories

Need not use all attributes, nor all seen categories Testing: Categorize image into one of S+U classes Age: Scarlett Clive Hugh Jared Miley Smiling: Jared Miley

Kristen Grauman, UT Austin

slide-60
SLIDE 60

Clive

Predict new classes based on their relationships to existing classes – even without training images. Age: Scarlett Clive Hugh Jared Miley Smiling: Jared Miley Smiling Age

Miley S J H

Relative zero-shot learning

Kristen Grauman, UT Austin

slide-61
SLIDE 61

Comparative descriptions are more discriminative than categorical descriptions.

Relative zero-shot learning

20 40 60

Outdoor Scenes Public Figures

Binary attributes Relative attributes - ranker

Accuracy

Kristen Grauman, UT Austin

slide-62
SLIDE 62

Enable new modes of human-system communication

  • Training category models through descriptions
  • Rationales to explain image labels
  • Semantic relative feedback for image search
  • Analogical constraints on feature learning

Learning with visual comparisons

Kristen Grauman, UT Austin

slide-63
SLIDE 63

Soliciting visual rationales

Main idea:

  • Ask the annotator not just what, but also why.

Is the team winning? Is her form good? Is it a safe route?

How can you tell? How can you tell? How can you tell?

[Donahue and Grauman, ICCV 2011; Zaidan et al. NAACL HLT 2007] Kristen Grauman, UT Austin

slide-64
SLIDE 64

Hot or Not?

Soliciting visual rationales

Spatial rationales Attribute rationales

How can you tell?

slide-65
SLIDE 65

Soliciting visual rationales

[Donahue & Grauman, ICCV 2011] Accuracy

16.5 17 17.5 18 18.5 52.00% 54.00% 56.00% 58.00% 60.00% 54.00% 57.00% 60.00% 63.00%

Accuracy Accuracy

Hot or Not Attractiveness Scene categories

Original labels only + Rationales

Kristen Grauman, UT Austin

slide-66
SLIDE 66

Enable new modes of human-system communication

  • Training category models through descriptions
  • Rationales to explain image labels
  • Semantic relative feedback for image search
  • Analogical constraints on feature learning

Learning with visual comparisons

Kristen Grauman, UT Austin

slide-67
SLIDE 67

Interactive visual search

Traditional binary relevance feedback offers only coarse communication between user and system

relevant relevant irrelevant irrelevant

“white high heels” [Rui et al. 1998, Zhou et al. 2003, …]

Kristen Grauman, UT Austin

slide-68
SLIDE 68

WhittleSearch: Relative attribute feedback

Whittle away irrelevant images via precise semantic feedback

Feedback: “shinier than these” Feedback: “less formal than these”

Refined top search results Initial top search results

[Kovashka, Parikh, and Grauman, CVPR 2012]

Query: “white high-heeled shoes”

Kristen Grauman, UT Austin

slide-69
SLIDE 69

Beyond pairwise comparisons …

Visual analogies

[Hwang, Grauman, & Sha, ICML 2013]

Properties Concept Concept Properties Concept Concept

Kristen Grauman, UT Austin

slide-70
SLIDE 70

Regularize object models with analogies

Learning with visual analogies

[Hwang, Grauman, & Sha, ICML 2013]

: ?

planet : sun = electron : ? nucleus

Kristen Grauman, UT Austin

slide-71
SLIDE 71

Regularize object models with analogies

Learning with visual analogies

[Hwang, Grauman, & Sha, ICML 2013]

: = : ?

= = | | Input space

p:q = r:s

Semantic embedding

p q r s

Kristen Grauman, UT Austin

slide-72
SLIDE 72

Visual analogies

[Hwang, Grauman, & Sha, ICML 2013]

:

=

?

:

GRE-like visual analogy tests

20 40 60 80 100

Chance Semantic embedding [Weinberger, 2009] Analogy-preserving embedding (Ours)

Analogy completion accuracy

slide-73
SLIDE 73

Teaching visual recognition systems

Today

Vision Learning Vision Learning Human computation Language Robotics Multi-agent systems Knowledge representation

Next 10 years

human system human system

Kristen Grauman, UT Austin

slide-74
SLIDE 74

Important next directions

  • Recognition in action:

embodied, egocentric

  • Activity understanding:
  • bjects & actions
  • Scale: many classes,

fine-grained recognition

Kristen Grauman, UT Austin

slide-75
SLIDE 75

Acknowledgements

NSF, ONR, DARPA, Luce Foundation, Google, Microsoft

Sudheendra Vijayanarasimhan Yong Jae Lee Jaechul Kim Sung Ju Hwang Adriana Kovashka Chao-Yeh Chen Suyog Jain Aron Yu Dinesh Jayaraman

Devi Parikh (Virginia Tech), Fei Sha (USC), Prateek Jain (MSR), Trevor Darrell (UC Berkeley)

  • J. K. Aggarwal, Ray Mooney, Peter Stone, Bruce Porter, and

all my UT colleagues

Jeff Donahue

slide-76
SLIDE 76

Summary

  • Humans are not simply “label machines”
  • More data need not mean better learning
  • Widen access to visual knowledge through

– Large-scale interactive/active learning systems – Representing relative visual comparisons

  • Visual recognition offers new AI challenges, and

progress demands that more AI ideas convene

slide-77
SLIDE 77
slide-78
SLIDE 78

References

  • WhittleSearch: Image Search with Relative Attribute Feedback. A. Kovashka, D. Parikh,

and K. Grauman. CVPR 2012.

  • Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active
  • Learning. P. Jain, S. Vijayanarasimhan, and K. Grauman. NIPS 2010.
  • Annotator Rationales for Visual Recognition. J. Donahue and K. Grauman. ICCV 2011.
  • Actively Selecting Annotations Among Objects and Attributes. A. Kovashka, S.

Vijayanarasimhan, and K. Grauman. ICCV 2011.

  • Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and
  • Crowds. S. Vijayanarasimhan and K. Grauman. CVPR 2011.
  • Cost-Sensitive Active Visual Category Learning. S. Vijayanarasimhan and K.
  • Grauman. International Journal of Computer Vision (IJCV), Vol. 91, Issue 1 (2011), p. 24.
  • What’s It Going to Cost You?: Predicting Effort vs. Informativeness for Multi-Label Image
  • Annotations. S. Vijayanarasimhan and K. Grauman. CVPR 2009.
  • Multi-Level Active Prediction of Useful Image Annotations for Recognition.
  • S. Vijayanarasimhan and K. Grauman. NIPS 2008.
  • Far-Sighted Active Learning on a Budget for Image and Video Recognition.
  • S. Vijayanarasimhan, P. Jain, and K. Grauman. CVPR 2010.
  • Relative Attributes. D. Parikh and K. Grauman. ICCV 2011.
  • Analogy-Preserving Semantic Embedding for Visual Object Categorization. S. J. Hwang,
  • K. Grauman, and F. Sha. ICML 2013.