Unsupervised Visual Representation Learning by Context Prediction - - PowerPoint PPT Presentation

unsupervised visual representation learning by context
SMART_READER_LITE
LIVE PREVIEW

Unsupervised Visual Representation Learning by Context Prediction - - PowerPoint PPT Presentation

Unsupervised Visual Representation Learning by Context Prediction Berkan Demirel Most slides in this representation are adopted from authors' original presentation at ICCV 2015 ImageNet + Deep Learning Beagle - Image Retrieval - Detection


slide-1
SLIDE 1

Unsupervised Visual Representation Learning by Context Prediction

Most slides in this representation are adopted from authors' original presentation at ICCV 2015

Berkan Demirel

slide-2
SLIDE 2

ImageNet + Deep Learning

Beagle

  • Image Retrieval
  • Detection (RCNN)
  • Segmentation (FCN)
  • Depth Estimation
slide-3
SLIDE 3

ImageNet + Deep Learning

Beagle

Do we need semantic labels?

Pose? Boundaries? Geometry? Parts? Materials?

slide-4
SLIDE 4

Context as Supervision

[Collobert& Weston 2008; Mikolov et al. 2013]

Deep Net

slide-5
SLIDE 5

Context Prediction for Images

A B

? ? ? ? ? ? ? ?

slide-6
SLIDE 6

Semantics from a non-semantic task

slide-7
SLIDE 7

Randomly Sample Patch Sample Second Patch

CNN CNN Classifier

Relative Position Task

8 possible locations

slide-8
SLIDE 8

CNN CNN Classifier

Patch Embedding

Input Nearest Neighbors CNN

Note: connects across instances!

slide-9
SLIDE 9

Architecture

Patch 2 Patch 1 Fully connected Max Pooling LRN Max Pooling LRN Convolution Convolution Convolution Convolution Convolution Max Pooling Max Pooling LRN Max Pooling LRN Fully connected Convolution Convolution Convolution Convolution Convolution Max Pooling Softmax loss Fully connected Fully connected Tied Weights

slide-10
SLIDE 10

Avoiding Trivial Shortcuts

Include a gap Jitter the patch locations

slide-11
SLIDE 11

Position in Image

A Not-So “Trivial” Shortcut

slide-12
SLIDE 12

Chromatic Aberration

slide-13
SLIDE 13

Solutions

Color Dropping

Randomly drop 2 of the 3 color channels from each patch. Then, replacing the dropped colors with Gaussian Noise ( standard deviation ~1/100 the standard deviation of the remaining channel ).

Projection

Shift green and magenta (red+blue) towards gray

slide-14
SLIDE 14

Implementation Details

  • Train on the ImageNet 2012 training set (1.3M images), using only the images and discarding

the labels.

  • Resize each image to between 150K and 450K total pixels, preserving the aspect-ratio.
  • Sample patches at resolution 96-by-96.
  • Sample the patches from a grid like pattern. Each sampled patch can participate in as many as

8 separate pairings.

  • Allow a gap of 48 pixels between the sampled patches in the grid, but also jitter the location
  • f each patch in te grid by –7 to 7 pixels in each direction.
  • Preprocess patches by (1)mean substraction, (2)projecting or dropping colors, (3)randomly

downsampling some patches to as little as 100 total pixels, and then upsampling it, to build robustness to pixelation.

  • Use batch normalization, without the scale and shift.
slide-15
SLIDE 15

Experiments

  • Chromatic Aberration
  • Nearest-Neighbor Matching
  • Object Detection
  • Geometry Estimation
  • Visual Data Mining
  • Layout Prediction
slide-16
SLIDE 16

Chromatic Aberration

CNN

slide-17
SLIDE 17

Chromatic Aberration

CNN

slide-18
SLIDE 18

Nearest-Neighbor Matching

  • fc6 layer features and only one of the two stacks are used.
  • fc7 and higher layers are removed.
  • Normalized cross correlation is used to find similar patches
  • Randomly selected 96x96 patches are used in the comparison.
slide-19
SLIDE 19

Ours

What is learned?

Input Random Initialization ImageNet AlexNet

slide-20
SLIDE 20

Still don’t capture everything

Input Ours Random Initialization ImageNet AlexNet

You don’t always need to learn!

Input Ours Random Initialization ImageNet AlexNet

slide-21
SLIDE 21

Object Detection

Pre-train on relative-position task, w/o labels

[Girshick et al. 2014]

slide-22
SLIDE 22

Object Detection

[Girshick et al. 2014]

slide-23
SLIDE 23

Object Detection

[Girshick et al. 2014]

slide-24
SLIDE 24

Multi-Task Training?

slide-25
SLIDE 25

Surface-normal Estimation

Error (Lower Better) % Good Pixels (Higher Better) No Pretraining 38.6 26.5 33.1 46.8 52.5

  • Unsup. Track.

34.2 21.9 35.7 50.6 57.0 Ours 33.2 21.3 36.0 51.2 57.8 ImageNet Labels 33.3 20.8 36.7 51.7 58.1

slide-26
SLIDE 26

Visual Data Mining

  • Sample a constellation of four adjacent patches from an

image (we use four to reduce the likelihood of a matching spatial arrangement happening by chance).

  • Find top 100 images which have the strongest matches for all

four patches, ignoring spatial layout.

  • Use a type of a geometric verification to filter away the

images where the four matches are not geometrically consistent.

  • Apply the described mining algorithm to Pascal VOC 2011.
slide-27
SLIDE 27

Visual Data Mining

Via Geometric Verification

Simplified from [Chum et al 2007]

slide-28
SLIDE 28

Mined from Pascal VOC2011

slide-29
SLIDE 29

Layout Prediction

Visual Data Mining Algorithm results for 15,000 Street View images from Paris

slide-30
SLIDE 30

Purity Test

slide-31
SLIDE 31

So, do we need semantic labels?

slide-32
SLIDE 32

Source Code & Supplementary Materials

  • Magic Init
  • Unsupervised Visual Representation Learning by Context Prediction
  • Visual Data Mining Results on unlabeled PASCAL VOC 2011 Images
  • Nearest Neighbors on PASCAL VOC 2007
  • More
slide-33
SLIDE 33

THANK YOU!