KEMAL ZMECLER (08.03.2016) 1 PRESENTATION TOPIC OBJECT DETECTORS - - PowerPoint PPT Presentation

kemal zmec ler 08 03 2016
SMART_READER_LITE
LIVE PREVIEW

KEMAL ZMECLER (08.03.2016) 1 PRESENTATION TOPIC OBJECT DETECTORS - - PowerPoint PPT Presentation

KEMAL ZMECLER (08.03.2016) 1 PRESENTATION TOPIC OBJECT DETECTORS EMERGE IN DEEP SCENE CNNS Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba International Conference on Learning Representations,2015. 2 OUTLINE


slide-1
SLIDE 1

KEMAL ÇİZMECİLER (08.03.2016)

1

slide-2
SLIDE 2

2

OBJECT DETECTORS EMERGE IN DEEP SCENE CNNS

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba International Conference on Learning Representations,2015.

PRESENTATION TOPIC

slide-3
SLIDE 3

§ Problem statement and motivation § Sımplifying The Input Images § Visualizing The Receptive Fields § Identifying The Semantics Of Internal Units § Connections with other work § Future direction

3

OUTLINE

slide-4
SLIDE 4

4

Problem Definition

§ to study the internal representation learned by

Places-CNN on a task other than object

  • recognition. (i.e. scene recognition)

§ Visualize those representations through inner

layers

slide-5
SLIDE 5

ImageNet CNN and Places CNN

ImageNet CNN for Object Classification

Places

Same architecture: AlexNet Places CNN for Scene Classification

Slide credit : Bolei Zhou

slide-6
SLIDE 6

§ The ImageNet-CNN from Jia (2013) is trained on 1.3 million images

from 1000 object categories of ImageNet (ILSVRC 2012) and achieves a top-1 accuracy of 57.4%.

§ Places-CNN is trained on 2.4 million images from 205 scene categories

  • f Places Database (Zhou et al., 2014), and achieves a top-1 accuracy
  • f 50.0%.

6

Comparison of ImageNet CNN and

Places CNN

slide-7
SLIDE 7

Object Representations in Computer Vision

Part-based models are used to represent objects and visual patterns.

  • Object as a set of parts
  • Relative locations between parts

Figure from Fischler & Elschlager (1973)

Slide credit : Bolei Zhou

slide-8
SLIDE 8

How Objects are Representedin CNN?

Zeiler, M. et al. Visualizing and Understanding Convolutional Networks,ECCV 2014.

Deconvolution

Simonyan, K. et al. Deep inside convolutional networks: Visualising image classification models and saliency maps. ICLR workshop, 2014 Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-rate object detection and semantic segmentation. CVPR 2014

Back-propagation Strong activation image

Slide credit : Bolei Zhou

slide-9
SLIDE 9

DeConvnet

Matthew D. Zeiler , Rob Fergus Visualizing and Understanding Convolutional Networks

slide-10
SLIDE 10

IMAGES HAVING HIGHEST ACTIVATIONS

the earlier layers such as pool1 and pool2 prefer similar images for both networks ; while the later layers tend to be more specialized to the specific task of scene or object categorization.

slide-11
SLIDE 11

How Objects are Representedin CNN?

CNN uses distributed code to represent objects.

Conv2 Conv3 Conv4 Pool5 Conv1

Slide credit : Bolei Zhou http://people.csail.mit.edu/torralba/research/drawCNN/drawNet.html

slide-12
SLIDE 12

§ (Girshick et al., 2014) proposed a pre-train for auxiliary task and then

fine-tune for the target task,

§ the same network can do both object localization and scene

recognition in a single forward-pass.

§ Another set of recent works (Oquab et al., 2014; Bergamo et al.,

2014) demonstrate the ability of deep networks trained on object classification to do localization without bounding box supervision. However , they require object-level supervision.

12

DIFFERENCE FROM OTHER WORK

slide-13
SLIDE 13

§ SIMPLIFYING THE INPUT IMAGES § VISUALIZING THE RECEPTIVE FIELDS § IDENTIFYING THE SEMANTICS OF INTERNAL UNITS

13

UNCOVERING THE CNN

slide-14
SLIDE 14

§ simplify this image such that it keeps as little visual

information as possible while still having a high classification score for the same category

§ second approach: generate the minimal image

representations using the fully annotated image set of SUN Database (Xiao et al., 2014) instead of performing automatic segmentation

14

SIMPLIFYING THE INPUT IMAGES

slide-15
SLIDE 15

15

  • J. Xiao et. al. 2010

SUN DATABASE

slide-16
SLIDE 16

16

  • J. Xiao et. al. 2010

SUN DATABASE

slide-17
SLIDE 17

§ At each iteration remove the segment that produces the

smallest decrease of the correct classification score and do this until the image is incorrectly classified

§ Obtain minimal representation

17

HOW IS SIMPLIFICATION HANDLED

slide-18
SLIDE 18

Related idea: Poisson Blending

  • A good blend should preserve gradients of source

region without changing the background

slide by Derek Hoiem

Perez, Patrick, Gangnet, Michel, and Blake, Andrew.Poisson image editing. ACM Trans.Graph., 2003

slide-19
SLIDE 19

MINIMAL IMAGE REPRESENTATION

slide-20
SLIDE 20

§ For art gallery the minimal image representations contained paintings

(81%) and pictures (58%); in amusement parks, carousel (75%), ride (64%), and roller coaster (50%); in bookstore, bookcase (96%), books (68%), and shelves (67%).

§ These results suggest that object detection is an important part of the

representation built by the network to

  • btain

discriminative informationfor scene classification.

20

INFERENCE FROM SIMPLIFICATION

slide-21
SLIDE 21

§ replicate each image many times with small random

  • ccluders (image patches of size 11×11) at different

locations in the image.

§ from the K images, center the discrepancy map around the

spatial location of the unit that caused the maximum activation for the given image. Then average the re- centered discrepancy maps

21

VISUALIZING THE RECEPTIVE FIELDS

slide-22
SLIDE 22

22

VISUALIZING THE RECEPTIVE FIELDS

200K images from scene centric Sun database + object-centric ImageNet

slide-23
SLIDE 23

Estimating the Receptive Fields

Estimated receptive fields

pool1

Actual size of RF is much smaller than the theoretic size

conv3 pool5

Segmentation using the RF of Units (Highlight the regions within the RF that have the highest value in the feature map. )

More semantically meaningful

slide-24
SLIDE 24

24

Workers provide tags without being constrained to a dictionary of terms to pre-calculated segments 3 tasks : (1)identify the concept (2) mark the set of images that do not fall into this theme (3) categorize the concept provided in (1) to one of 6 semantic groups ranging from low-level to high-level: from simple elements and colors to object parts and even scenes For each unit, measure its precision as the percentage of images that were selected as fitting the labeled concept

IDENTIFYING THE SEMANTICS

slide-25
SLIDE 25

Annotating the Semantics of Units

Top ranked segmented images are cropped and sent to Amazon Turk for annotation.

slide-26
SLIDE 26

Annotating the Semantics of Units

Pool5, unit 76; Label: ocean; Type: scene; Precision: 93%

slide-27
SLIDE 27

Pool5, unit 13; Label: Lamps; Type: object; Precision: 84%

Annotating the Semantics of Units

slide-28
SLIDE 28

Pool5, unit 77; Label:legs; Type: object part; Precision: 96%

Annotating the Semantics of Units

slide-29
SLIDE 29

Pool5, unit 112; Label: pool table; Type: object; Precision: 70%

Annotating the Semantics of Units

slide-30
SLIDE 30

Annotating the Semantics of Units

Pool5, unit 22; Label: dinner table; Type: scene; Precision: 60%

slide-31
SLIDE 31

31

consider only units that had a precision above 75% as provided by the AMT workers. Around 60% of the units on each layer where above that threshold. For both networks, units at the early layers (pool1, pool2) have more units responsive to simple elements and colors, while those at later layers (conv4, pool5) have more high-level semantics (responsive more to objects and scenes).

IDENTIFYING THE SEMANTICS

slide-32
SLIDE 32

Distribution of Semantic Types at Each Layer

Object detectors emerge within CNN trained to classify scenes, without any object supervision!

Slide credit : Bolei Zhou

slide-33
SLIDE 33

SUN database is used because dense object annotations are needed to study what the most informative object classes for scene categorization are, and what the natural object frequencies in scene images are. The segmentation shows the regions of the image for which the unit is above a certain threshold. Each unit seems to be selective to a particular appearance of the object.

33

WHAT OBJECT CLASSES EMERGE?

slide-34
SLIDE 34
slide-35
SLIDE 35

§ The categories found in pool5 tend to follow the target categories in ImageNet. § There are 6 units that detect lamps, each unit detecting a particular type of lamp providing finer-grained discrimination; § There are 9 units selective to people, each one tuned to different scales or people doing different tasks

35

WHAT OBJECT CLASSES EMERGE?

slide-36
SLIDE 36

§ The correlation between object frequency in the database and

  • bject frequency discovered by the units in pool5 is 0.54.

§ The correlation between the number of scene categories a particular object class is the most useful for and discovered object class is 0.84. § Also, there are 115 units not detecting objects. This means that we cannot rule out other representations being used in combination with

  • bjects.

36

AND WHY ?

slide-37
SLIDE 37

Correlation:0.53 Correlation:0.84

Evaluation on SUN Database

slide-38
SLIDE 38

Evaluation on SUN Database

Evaluate the performance of the emerged object detectors

slide-39
SLIDE 39

§ Object detection emerges inside a CNN trained to recognize scenes. § Many more objects can be naturally discovered, in a supervised

setting tuned to scene classification rather than object classification

§ Object localization in a single forward-pass § Besides taking the output of the last layer as feature, inner layers can

Show different levels of abstraction.

39

CONCLUSION

slide-40
SLIDE 40

§ In what other tasks, can we learn object classes or discriminative object detectors ? § Can we pop-up all object classes?

40

FUTURE DIRECTION

slide-41
SLIDE 41

Thank You

41

QUESTIONS