Where have we been? Where are we going? LI F E I F EI The - - PowerPoint PPT Presentation

where have we been where are we going
SMART_READER_LITE
LIVE PREVIEW

Where have we been? Where are we going? LI F E I F EI The - - PowerPoint PPT Presentation

Where have we been? Where are we going? LI F E I F EI The Beginning: CVPR 2009 J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, Im a g eNet: A La rg e-Sca le Hiera rchica l Im a g e Da ta b a se. IEEE Com puter Vision and


slide-1
SLIDE 1

Where have we been? Where are we going?

LI F E I – F EI

slide-2
SLIDE 2

The Beginning: CVPR 2009

  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, Im a g eNet: A La rg e-Sca le Hiera rchica l Im a g e Da ta b a se.

IEEE Com puter Vision and Pattern Recognition (CVPR), 2009.

slide-3
SLIDE 3

The Impact of

slide-4
SLIDE 4

4,38 6

Citations

2,8 47

Citations

  • n Google Scholar

…and m any m ore.

slide-5
SLIDE 5

From Challenge Contestants to Startups

slide-6
SLIDE 6

A Revolution in Deep Learning

W hy Deep Lea rning is Sud d enly Cha ng ing Your Life

By Roger Parloff

The Grea t Artificia l Intellig ence Aw a kening

By Gideon Lew is-Kraus

The d a ta tha t tra nsform ed AI resea rch—a nd p ossib ly the w orld

By Dave Gershgorn

slide-7
SLIDE 7

“The of x”

SpaceNet

DigitalGlobe, CosmiQ Works, NVIDIA

ShapeNet

A.Chang et al, 2015

MusicNet

  • J. Thickstun et al, 2017

EventNet

  • G. Ye et al, 2015

Medical Im ageNet

Stanford Radiology, 2017

ActivityNet

  • F. Heilbron et al, 2015
slide-8
SLIDE 8

An Explosion of Datasets

1627

Hosted Datasets

276

Commercial Competitions

1MM

Data Scientists

4 MM

ML Models Submitted

1919

Student Competitions

slide-9
SLIDE 9

“ Datasets—not algorithm s—m ight be the key lim iting factor to developm ent

  • f hum an-level artificial intelligence.”

A L E X A N D E R W I S S N E R - G R O S S Edge.org, 2016

slide-10
SLIDE 10

The Untold History of

slide-11
SLIDE 11

Hardly the First Image Dataset

Lotus Hill (20 0 7)

Yao et al, 2007

ESP (20 0 6 )

Ahn et al, 2006

LabelMe (20 0 5)

Russell et al, 2005

MSRC (20 0 6 )

Shotton et al. 2006

CalTech 10 1/ 256 (20 0 5)

Fei-Fei et al, 2004 GriffIn et al, 2007

TinyIm age (20 0 8 )

Torralba et al. 2008

PASCAL (20 0 7)

Everingham et al, 2009

CAVIAR Tracking (20 0 5)

  • R. Fisher, J. Santos-Victor J. Crowley

Middlebury Stereo (20 0 2)

  • D. Scharstein R. Szeliski

UIUC Cars (20 0 4 )

  • S. Agarwal, A. Awan, D. Roth

FERET Faces (19 9 8 )

  • P. Phillips, H. Wechsler, J.

Huang, P. Raus

CMU/ VASC Faces (19 9 8 )

  • H. Rowley, S. Baluja, T. Kanade

MNIST digits (19 9 8 -10 )

Y LeCun & C. Cortes

COIL Objects (19 9 6 )

  • S. Nene, S. Nayar, H. Murase

3D Textures (20 0 5)

  • S. Lazebnik, C. Schmid, J. Ponce

CuRRET Textures (19 9 9 )

  • K. Dana B. Van Ginneken S. Nayar
  • J. Koenderink

KTH hum an action (20 0 4 )

  • I. Leptev & B. Caputo

Sign Language (20 0 8 )

  • P. Buehler, M. Everingham, A.

Zisserman

Segm entation (20 0 1)

  • D. Martin, C. Fowlkes, D. Tal, J. Malik.
slide-12
SLIDE 12

A Profound Machine Learning Problem Within Visual Learning

slide-13
SLIDE 13

Machine Learning 101: Complexity, Generalization, Overfitting

Underfitting Zone Overfitting Zone

Generalization Error Generalization Gap Training Error

Error Capacity Optim al Capacity

slide-14
SLIDE 14

Fei-Fei et al, 2003, 2004

One-Shot Learning

slide-15
SLIDE 15

Fei-Fei et al, 2003, 2004

slide-16
SLIDE 16

How Children Learn to See

slide-17
SLIDE 17

Underfitting Zone Overfitting Zone

Generalization Error Generalization Gap Training Error

Error Capacity Optim al Capacity

slide-18
SLIDE 18

A new way of thinking…

To shift the focus of Machine Learning for visual recognition from modeling… … to data. Lots of data.

slide-19
SLIDE 19

15,000 Global Data Traffic (PB/ month)

Source: Cisco

11,250 7,500 3,750

Internet Data Growth 1990-2010

slide-20
SLIDE 20

What is WordNet?

Original paper by [George Miller, et al 1990 ] cited over 5,000 times Organizes over 150,000 words into 117,000 categories called synsets. Establishes

  • ntological and

lexical relationships in NLP and related tasks.

slide-21
SLIDE 21

Christiane Fellbaum

Senior Research Scholar Computer Science Department, Princeton President, Global WordNet Consortium

slide-22
SLIDE 22

Germ an shepherd: breed of large shepherd dogs used in police work and as a guide for the blind. m icrowave: kitchen appliance that cooks food by passing an electromagnetic wave through it. m ountain: a land mass that projects well above its surroundings; higher than a hill. jacket: a short coat

A m a ssiv e ontology of im a ges to tra nsform com p uter v ision Ind iv id ua lly Illustra ted W ord Net Nod es

slide-23
SLIDE 23

Comrades

  • Prof. Kai Li

Princeton Jia Deng 1st Ph.D. student Princeton

slide-24
SLIDE 24

Entity Ma m m a l Dog Germ a n Shep herd

Step 1: Ontological structure based on WordNet

slide-25
SLIDE 25

Dog Germ a n Shep herd

Step 2: Populate categories with thousands of images from the Internet

slide-26
SLIDE 26

Step 3: Clean results by hand

Dog Germ a n Shep herd

slide-27
SLIDE 27

Three Attempts at Launching

slide-28
SLIDE 28

1st Attempt: The Psychophysics Experiment

Im ageNet PhD Students Miserable Undergrads

slide-29
SLIDE 29

1st Attempt: The Psychophysics Experiment

  • # of synsets: 40 ,0 0 0 (subject to: imageability analysis)
  • # of candidate images to label per synset: 10 ,0 0 0
  • # of people needed to verify: 2-5
  • Speed of human labeling: 2 im ages/ sec (one fixation: ~200msec)
  • Massive parallelism (N ~ 10 ^2-3)

40 ,0 0 0 × 10 ,0 0 0 × 3 / 2 = 60 0 0 ,0 0 0 ,0 0 0 sec ≈ 19 years N

slide-30
SLIDE 30

2nd Attempt: Human-in-the-Loop Solutions

slide-31
SLIDE 31

2nd Attempt: Human-in-the-Loop Solutions

Human-generated datasets transcend algorithmic limitations, leading to better machine perception. Machine-generated datasets can only match the best algorithms of the time.

slide-32
SLIDE 32

3rd Attempt: A Godsend Emerges

Im ageNet PhD Students Crowdsourced Labor 4 9 k Workers from 16 7 Countries 20 0 7-20 10

slide-33
SLIDE 33

The Result: Goes Live in 2009

slide-34
SLIDE 34
slide-35
SLIDE 35

What We Did Right

slide-36
SLIDE 36

While Others Targeted Detail…

LabelMe

Per-Object Regions and Labels Russell et al, 2005

Lotus Hill

Hand-Traced Parse Trees Yao et al, 2007

slide-37
SLIDE 37

15M

[Deng et al. ’09]

SUN, 131K

[Xiao et al. ‘10]

LabelMe, 37K

[Russell et al. ’07]

… We Targeted Scale

PASCAL VOC, 30K

[Everingham et al. ’06-’12]

Caltech10 1, 9K

[Fei-Fei, Fergus, Perona, ‘03]

slide-38
SLIDE 38

Additional Goals

High Resolution

To better replicate human visual acuity

Free of Charge

To ensure immediate application and a sense of community

High-Quality Annotation

To create a benchmarking dataset and advance the state of machine perception, not merely reflect it

Carnivore

  • Canine
  • Dog
  • Working Dog
  • Husky
slide-39
SLIDE 39

An Emphasis on Community and Achievement

Large Scale Visual Recognition Challenge (ILSVRC 20 10 -20 17)

slide-40
SLIDE 40

Olga Russakovsky Stanford Fei-Fei Li Stanford Alex Berg UNC Chapel Hill Wei Liu UNC Chapel Hill

ILSVRC Contributors

Eunbyung Park UNC Chapel Hill Sean Ma Stanford Jonathan Krause Stanford Sanjeev Satheesh Stanford Hao Su Stanford Aditya Khosla Stanford Zhiheng Huang Stanford Jia Deng

  • Univ. of Michigan
slide-41
SLIDE 41

Our Inspiration: PASCAL VOC

2005-2012

slide-42
SLIDE 42

Our Inspiration: PASCAL VOC

Mark Everingham

1973-2012

Mark Everingham Prize @ ECCV 20 16

Alex Berg, Jia Deng, Fei-Fei Li, Wei Liu, Olga Russakovsky

slide-43
SLIDE 43

2010

35 29 8 1 123 157 172

2011 2012 2013 2014 2015 2016

Participation and Performance

Num ber of Entries

slide-44
SLIDE 44

2010

35 29 8 1 123 157 172

2011 2012 2013 2014 2015 2016

Participation and Performance

Num ber of Entries Classification Errors (top-5)

0 .28 0 .0 3

slide-45
SLIDE 45

2010

35 29 8 1 123 157 172

2011 2012 2013 2014 2015 2016

Participation and Performance

Num ber of Entries Classification Errors (top-5)

0 .28 0 .0 3 0 .23 0 .66

Average Precision For Object Detection

slide-46
SLIDE 46

What we did to make better

slide-47
SLIDE 47

Lack of Details

slide-48
SLIDE 48

Lack of Details… ILSVRC Detection Challenge

Statistics PASCAL VOC 20 12 ILSVRC 20 13 Object classes 20 20 0 Training Images 5.7K 395K Objects 13.6K 345K

25x 10 x 70 x

slide-49
SLIDE 49

Evaluation of ILSVRC Detection

Need to annotate the presence of all classes (to penalize false detections)

Table Chair Horse Dog Cat Bird

+ +

  • +
  • +
  • + +
  • # images: 400K

# classes: 200 # annotations = 80M!

slide-50
SLIDE 50

Evaluation of ILSVRC Detection

Hierarchical annotation

  • J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. Berg, & L. Fei-Fei. CHI, 2014
slide-51
SLIDE 51
  • J. Deng, A. Berg & L. Fei-Fei, ECCV, 2010

What does classifying 10K+ classes tell us?

slide-52
SLIDE 52

Fine-Grained Recognition

“Ca rd iga n W elsh Corgi” “Pem broke W elsh Corgi”

slide-53
SLIDE 53

[Gebru, Krause, Deng, Fei-Fei, CHI 2017]

2567 classes 700k images

Fine-Grained Recognition

cars

slide-54
SLIDE 54

Expected Outcomes

ImageNet becomes a benchmark Machine learning advances and changes dramatically Breakthroughs in

  • bject recognition
slide-55
SLIDE 55

Unexpected Outcomes

slide-56
SLIDE 56

Neural Nets are Cool Again!

Krizhevsky, Sutskever & Hinton, NIPS 2012

13,259

Citations

slide-57
SLIDE 57

… And Cooler and Cooler 

[Krizhevsky et al. NIPS 2012]

“AlexNet”

[Szegedy et al. CVPR 2015]

“GoogLeNet”

[Simonyan & Zisserman, ICLR 2015]

“VGG Net”

[He et al. CVPR 2016]

“ResNet”

slide-58
SLIDE 58

A Deep Learning Revolution

Neural Nets GPUs

slide-59
SLIDE 59

Ontological Structure Structure Not Used as Much

slide-60
SLIDE 60

Thing Animalia Chordate Arthropoda Mammal Insect Carnivora Diptera Felidae Muscidae Felis Musca Housefly Domestica Domestica Leo Lion House Cat Primate Pongidae Pan Troglodytes Chimpanzee Hominidae Homo Sapiens Human Marsupial Wombat is a is a is a

W om ba t

Deng, Krause, Berg & Fei-Fei, CVPR 2012

slide-61
SLIDE 61

Thing Anim a lia Chordate Arthropoda Ma m m a l Insect Carnivora Diptera Felidae Muscidae Felis Musca Housefly Domestica Domestica Leo Lion House Cat Primate Pongidae Pan Troglodytes Chimpanzee Hominidae Homo Sapiens Human Ma rsup ia l W om ba t is a is a is a

W om ba t

Thing Anim a l Ma m m a l Ma rsup ia l W om ba t

Deng, Krause, Berg & Fei-Fei, CVPR 2012

slide-62
SLIDE 62

Thing Anim a lia Chordate Arthropoda Ma m m a l Insect Carnivora Diptera Felidae Muscidae Felis Musca Housefly Domestica Domestica Leo Lion House Cat Primate Pongidae Pan Troglodytes Chimpanzee Hominidae Homo Sapiens Human Ma rsup ia l W om ba t is a is a is a

W om ba t

Ma xim ize Sp ecificity ( f ) Subject to Accura cy ( f ) ≥ 1 - ε

Deng, Krause, Berg & Fei-Fei, CVPR 2012

slide-63
SLIDE 63

Our Mod el

Optimizing with a Knowledge Ontology Results in Big Gains in Information at Arbitrary Accuracy

Deng, Krause, Berg & Fei-Fei, CVPR 2012

slide-64
SLIDE 64

Kuettel, Guillaumin, Ferrari. Segm entation Propagation in Im ageNet. ECCV 2012 ECCV 2012 Best paper Award

Relatively Few Works Have Used Ontology

slide-65
SLIDE 65

Most works still use 1M images to do pre-training 15M Images Total

1M Images

slide-66
SLIDE 66

“ First, w e find that the perform ance on vision tasks still increases linearly w ith orders of m agnitude of training data size.”

  • C. Sun et al, 20 17
slide-67
SLIDE 67

How Humans Compare

Andrej Karpathy. http:/ / karpathy.github.io/ 2014/ 09/ 02/ what-i-learned-from-competing-against-a-convnet-on-imagenet/

slide-68
SLIDE 68

How Humans Compare

GoogLeNet

6.8 %

Top-5 error rate

  • Small, thin objects
  • Image filters
  • Abstract representations
  • Miscellaneous sources

Suscep tible to:

Hum an

5.1%

Top-5 error rate

  • Fine-grained recognition
  • Class unawareness
  • Insufficient training data

Suscep tible to:

Andrej Karpathy. http:/ / karpathy.github.io/ 2014/ 09/ 02/ what-i-learned-from-competing-against-a-convnet-on-imagenet/

slide-69
SLIDE 69

What Lies Ahead

slide-70
SLIDE 70

Moving from object recognition…

p erson p erson p erson p erson p erson sca le room

slide-71
SLIDE 71

… to human-level understanding.

p erson Sta nd ing on p erson Step p ing on p erson W a tching a nd la ug hing room sca le W a nts to w eig h him self W a nts to p la y a p ra nk

Step p ing on a sca le a d d s w eig ht a nd up s the rea d ing .

slide-72
SLIDE 72

Inverse Graphics

Im age credit: http s:/ / w w w .y outub e.com / w a tch?v =ip -KIzQm cBo (Oliver Villar)

slide-73
SLIDE 73

Im ageNet: Deng et al. 2009; COCO: Lin et al. 2014

lady

slide-74
SLIDE 74

tree ski

jacket boots

snow

sunglasses

vest

pole coat

glove head

building

leaves equipment

bag hat sky

lady

COCO: Lin et al. 2014

“ A lady in pink dress is skiing.”

slide-75
SLIDE 75

Q: What is the man in the center doing? A: Sta nd ing on a ski. Q: What is the color of the sky? A: Blue Q: Where are the pine trees? A: Behind the hill.

<woman, wear, coat> <trees, be, green> <trees, behind, group (of people)> <man, has, jacket> <boots, be, yellow> <lady, hold, skis> “A man standing.” “A clear blue sky at a ski resort.” “A snowy hill is in front of pine trees.” “There are several pine trees.” “A group of people getting ready to ski.”

tree ski

jacket boots

snow

sunglasses

vest

pole coat

glove head

building

leaves equipment

bag hat sky

lady “ A lady in pink dress is skiing.”

slide-76
SLIDE 76
slide-77
SLIDE 77

entire universe of im ages

[Johnson et al., CVPR 2015]

slide-78
SLIDE 78

Visual Genome Dataset

Goals

  • Beyond nouns

– Objects, verbs, attibutes

  • Beyond object classification

– Relationships and contexts

  • Sentences and QAs
  • From Perception to Cognition

Specs

  • 108,249 images (COCO images)
  • 4.2M image descriptions
  • 1.8M Visual QA (7W)
  • 1.4M objects, 75.7K obj. classes
  • 1.5M relationships, 40.5K rel. classes
  • 1.7M attributes, 40.5K attr. classes
  • Vision and language correspondences
  • Everything mapped to WordNet Synset

Krishna et al. IJCV 2016

A dataset, a know ledge base, an ongoing effort to connect structural im age concepts to language.

slide-79
SLIDE 79

Visual Genome Dataset

A dataset, a know ledge base, an ongoing effort to connect structural im age concepts to language.

Krishna et al. IJCV 2016 Q: What is the person sitting on the right of the elephant wearing? A: A b lue shirt.

DenseCap & Paragraph Generation

Karpathy et al. CVPR’16 Krause et al. CVPR’17

Relationship Prediction Krishna et al.

ECCV’16

Im age Retrieval w/ Scene Graphs Johnson et al.

CVPR’15 Xu et al. CVPR’17

Visual Q&A

Zhu et al. CVPR’16

slide-80
SLIDE 80

Visual Genome Dataset

A dataset, a know ledge base, an ongoing effort to connect structural im age concepts to language.

Krishna et al. IJCV 2016 Q: What is the person sitting on the right of the elephant wearing? A: A b lue shirt.

DenseCap & Paragraph Generation

Karpathy et al. CVPR’16 Krause et al. CVPR’17

Relationship Prediction Krishna et al.

ECCV’16

Im age Retrieval w/ Scene Graphs Johnson et al.

CVPR’15 Xu et al. CVPR’17

Visual Q&A

Zhu et al. CVPR’16

W orkshop on Visua l Und ersta nd ing b y Lea rning from W eb Da ta 20 17

26 July 2017 | Honolulu, Haw aii in conjunction w ith CVPR 20 17 http :/ / w w w .v ision.ee.ethz.ch/ w ebv ision/ w orkshop .htm l

slide-81
SLIDE 81

81

The Future of Vision and Intelligence

Agency: The integration

  • f perception,

understanding and action

Vision Language Understanding Action

slide-82
SLIDE 82

Eight Years of Competitions 2010-2017

10 ×

reduction of image classification error

improvement of detection precision

slide-83
SLIDE 83

What Happens Now?

We’re passing the baton to Kaggle: a community of more than 1M data scientists. Why? dem ocratizing data is vital to dem ocratizing AI. im age-net.org remains live at Stanford.

slide-84
SLIDE 84

What Happens Now?

ImageNet Object Localization Challenge ImageNet Object Detection Challenge ImageNet Object Detection from Video Challenge

slide-85
SLIDE 85

Alex Berg Michael Bernstein Edward Chang Brendan Collins Jia Deng Minh Do Wei Dong Alexei Efros Mark Everingham Christiane Fellbaum Adam Finkelstein Thomas Funkhouser Timnit Gebru Derek Hoiem Zhiheng Huang Andrej Karpathy Aditya Khosla Jonathan Krause Fei-Fei Li Kai Li Li-Jia Li Wei Liu Sean Ma Xiaojuan Ma Jitendra Malik Dan Osherson Eunbyung Park Chuck Rosenberg Olga Russakovksy Sanjeev Satheesh Richard Socher Hao Su Zhe Wang Andrew Zisserman

Contributors/ Friends/ Advisors

49k Amazon Mechanical Turk Workers

slide-86
SLIDE 86

“ This is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning.”

W I N S T O N C H U R C H I L L