Unsupervised Visual Representation Learning by Context Prediction
Most slides in this representation are adopted from authors' original presentation at ICCV 2015
Berkan Demirel
Unsupervised Visual Representation Learning by Context Prediction - - PowerPoint PPT Presentation
Unsupervised Visual Representation Learning by Context Prediction Berkan Demirel Most slides in this representation are adopted from authors' original presentation at ICCV 2015 ImageNet + Deep Learning Beagle - Image Retrieval - Detection
Unsupervised Visual Representation Learning by Context Prediction
Most slides in this representation are adopted from authors' original presentation at ICCV 2015
Berkan Demirel
Pose? Boundaries? Geometry? Parts? Materials?
[Collobert& Weston 2008; Mikolov et al. 2013]
Randomly Sample Patch Sample Second Patch
CNN CNN Classifier
8 possible locations
CNN CNN Classifier
Patch Embedding
Input Nearest Neighbors CNN
Note: connects across instances!
Patch 2 Patch 1 Fully connected Max Pooling LRN Max Pooling LRN Convolution Convolution Convolution Convolution Convolution Max Pooling Max Pooling LRN Max Pooling LRN Fully connected Convolution Convolution Convolution Convolution Convolution Max Pooling Softmax loss Fully connected Fully connected Tied Weights
Include a gap Jitter the patch locations
Position in Image
Color Dropping
Randomly drop 2 of the 3 color channels from each patch. Then, replacing the dropped colors with Gaussian Noise ( standard deviation ~1/100 the standard deviation of the remaining channel ).
Projection
Shift green and magenta (red+blue) towards gray
the labels.
8 separate pairings.
downsampling some patches to as little as 100 total pixels, and then upsampling it, to build robustness to pixelation.
CNN
CNN
Ours
Input Random Initialization ImageNet AlexNet
Input Ours Random Initialization ImageNet AlexNet
Input Ours Random Initialization ImageNet AlexNet
Pre-train on relative-position task, w/o labels
[Girshick et al. 2014]
[Girshick et al. 2014]
[Girshick et al. 2014]
Error (Lower Better) % Good Pixels (Higher Better) No Pretraining 38.6 26.5 33.1 46.8 52.5
34.2 21.9 35.7 50.6 57.0 Ours 33.2 21.3 36.0 51.2 57.8 ImageNet Labels 33.3 20.8 36.7 51.7 58.1
image (we use four to reduce the likelihood of a matching spatial arrangement happening by chance).
four patches, ignoring spatial layout.
images where the four matches are not geometrically consistent.
Via Geometric Verification
Simplified from [Chum et al 2007]
Visual Data Mining Algorithm results for 15,000 Street View images from Paris
Source Code & Supplementary Materials