Learning Visual Semantics: Models, Massive Computation, and - - PowerPoint PPT Presentation

learning visual semantics models massive
SMART_READER_LITE
LIVE PREVIEW

Learning Visual Semantics: Models, Massive Computation, and - - PowerPoint PPT Presentation

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center Evolvement of Visual Features Low level features and histogram


slide-1
SLIDE 1

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

Part II: Visual Features and Representations Liangliang Cao, IBM Watson Research Center

slide-2
SLIDE 2

2

Evolvement of Visual Features

  • Low level features and histogram
  • SIFT and bag-of-words models
  • Sparse coding
  • Super vector and Fisher vector
  • Deep CNN
slide-3
SLIDE 3

3

Evolvement of Visual Features

  • Low level features and histogram
  • SIFT and bag-of-words models
  • Sparse coding
  • Super vector and Fisher vector
  • Deep CNN

Less parameters More parameters

slide-4
SLIDE 4

4

Evolvement of Visual Features

  • Low level features and spatial histogram
  • SIFT and bag-of-words models
  • Sparse coding
  • Super vector and Fisher vector
  • Deep CNN

Three fundamental techniques

  • 1. histogram
  • 2. spatial gridding
  • 3. filter

have been used extensively

slide-5
SLIDE 5

5

Low Level Features and Spatial Pyramid

slide-6
SLIDE 6

Concatenating raw pixels as 1D vector

Raw Pixels as Feature

Pictures courtesy to Face Research Lab, Antonio Torralba and Sam Roweis Application 1: Face recognition Application 2: Hand written digits Tiny Image [Torralba et al 2007]: resize an image to 32x32 color thumbnail, which corresponds to a 3072 dimensional vector

slide-7
SLIDE 7

7

From Pixels to Histograms

Color histogram [Swain and Ballard 91] is proposed to model the distribution

  • f colors in an image.

          b g r

We can extend color histogram to :

  • Edge histogram
  • Shape context histogram
  • Local binary patterns (LBP)
  • Histogram of gradients

Similar color histogram feature

Unlike raw pixel based vectors, histograms are not sensitive to

  • misalignment
  • scale transform
  • global rotation
slide-8
SLIDE 8

8

From Histogram to Spatialized Histogram

Problem of histograms: No spatial information!

Example thanks to Erik Learned-Miller

The same histogram!

Ojala et al, PAMI’02

Histograms of spatial cells Spatial pyramid matching

[Lazebnik et al CVPR’06]

slide-9
SLIDE 9

9

IBM IMARS Spatial Gridding

First position in 1st and 2nd ImageCLEF Medical Imaging Classification

Task: Determine which modality a medical image belongs to.

  • Images from Pubmed articles
  • 31 categories (x-ray, CT, MRI, ultrasound, etc.)
slide-10
SLIDE 10

10

IBM IMARS Spatial Gridding

First position in 1st and 2nd ImageCLEF Medical Imaging Classification http://www.imageclef.org/2012/medical

slide-11
SLIDE 11

11

Image Filters

  • In addition to histogram, another group of features can

be represented as “filters”. For example:

  • 1. Harr-like filters

(Viola-Jones face detection)

Widely used in fingerprint, iris, OCR, texture and face recognition.

  • 2. Gabor filters

(simple cells in the visual cortex can be modeled by Gabor functions)

slide-12
SLIDE 12

12

SIFT Feature and Bag-of-Words Model

  • Raw pixel
  • Histogram feature

– Color Histogram – Edge histogram

  • Frequency analysis
  • Image filters
  • Texture features

– LBP

  • Scene features

– GIST

  • Shape descriptors
  • Edge detection
  • Corner detection

1999 SIFT features and beyond

  • DoG
  • Hessian detector
  • Laplacian of Harris
  • FAST
  • ORB
  • SIFT
  • HOG
  • SURF
  • DAISY
  • BRIEF

Classical features

slide-13
SLIDE 13

13

David G. Lowe

  • Distinctive image features from scale-invariant keypoints, IJCV 2004
  • Object recognition from local scale-invariant features, ICCV 1999

SIFT Descriptor: Histogram of gradient orientation

  • Histogram is more robust to position than raw pixels
  • Edge gradient is more distinctive than color for local patches

Concatenate histograms in spatial cells

Scale-Invariant Feature Transform (SIFT)

  • Good parameters: 4 ori, 4 x 4 grid
  • Soft-assignment to spatial bins
  • Gaussian weighting over spatial location
  • Reduce the influence of large gradient magnitudes: thresholding +normalization

David Lowe’s excellent performance tuning:

slide-14
SLIDE 14

14

David G. Lowe

  • Distinctive image features from scale-invariant keypoints, IJCV 2004
  • Object recognition from local scale-invariant features, ICCV 1999

SIFT Detector:

Detect maxima and minima of difference-of-Gaussian in scale space Post-processing: keep corner points but reject low-contrast and edge points

Scale-Invariant Feature Transform (SIFT)

  • In general object recognition, we may combine multiple detectors (e.g.,

Harris, Hessian), or use dense sampling for good performance.

  • Following SIFT, many research works including SURF, BRIEF, ORB,

BRISK and etc have been proposed for faster local feature extraction.

slide-15
SLIDE 15

15

Histogram of Local Features And Bag-of-Words Models

slide-16
SLIDE 16

16

Histogram of Local Features …..

frequency

codewords

dim = # codewords

slide-17
SLIDE 17

17

Histogram of Local Features + Spatial Gridding

dim = #codewords x #grids

……

slide-18
SLIDE 18

18

Bag of Words Models

slide-19
SLIDE 19

19

Bag-of-Words Representation

Object Bag of ‘words’

Computer Vision: Text and NLP: Slide credit: Fei-Fei Li

slide-20
SLIDE 20

20

Topic Models for Bag-of-Words Representation

Unsupervised classification

Sivic et al. ICCV 2005

Supervised classification Classification + segmentation

Fei-Fei et al. CVPR 2005 Cao and Fei-Fei. ICCV 2007

slide-21
SLIDE 21

21

But these models suffer from

  • Loss of spatial information
  • Loss of information in quantization of “visual words”

Pros and Cons of Bag of Words Models

Images differ from texts!

Better coding approach

Bag of Words Models are good in

  • Modeling prior knowledge
  • Providing intuitive interpretation
slide-22
SLIDE 22

22

Sparse Coding

slide-23
SLIDE 23

23

Sparse Coding

  • Naïve histogram uses Vector Quantization as a hard assignment,

while Sparse Coding provides a soft assignment.

  • Sparse Coding: approximation of l0 norm (sparse solution):
  • SC works better with max pooling (while traditional VQ with

averages pooling)

  • References: [M. Ranzato et al, CVPR’07] [J. Yang et al, CVPR09], [J.

Wang et al CVPR10], [Y. Boureau et al, CVPR10]

slide-24
SLIDE 24

24

Sparse Coding + Spatial Pyramid

Yang et al, Linear Spatial Pyramid Matching using Sparse Coding for Image Classification, CVPR 2009

Sparse coding + spatial pyramid + linear SVM

slide-25
SLIDE 25

25

Efficient Approach

Locality preserving linear coding:

  • 1. find k nearest neighbors to the query
  • 2. compute sparse coding with the k neighbors

Significantly faster than naïve SC, e.g., O(1000a) -> O(5a) For further speedup, we can use LS regression to replace SC

[J. Wang et al CVPR10] Matlab implementation (http://www.ifp.illinois.edu/~jyang29/LLC.htm ) Can be further speed up for top-k search

slide-26
SLIDE 26

26

Sparse Coding Are Not Necessarily Sparse

Hard quantization

s.t. Sparsest solution! Less sparse!

Sparse coding is less sparse. Image level representation is not sparse after pooling. Is the success of SC due to sparsity? Sparse coding

slide-27
SLIDE 27

27

Fisher Vector and Super Vector

slide-28
SLIDE 28

28

Information Loss

  • Coding with information loss:

VQ: Sparse coding:

  • Lossless coding:
  • Significant difference with a function:

SC or VQ: Lossless coding:

a scalar!! a function!!

slide-29
SLIDE 29

29

Lossless Coding as Mixture of Experts

Expert 1 Expert 2 Expert 3 Gating function (e.g., GMM, sparse GMM, Harmonic K-means, etc)

  • Let’s look at each codeword as a “local expert”:
slide-30
SLIDE 30

30

Pooling Towards Image-Level Representation

Component 1 Component 2 Component 3

+ + + +

Pooling: Both Fisher Vector and Super Vector can be written in this form (with different subtraction and normalization and factors)

  • Fisher Vector [Perronnin et al, ECCV10]
  • Supervector [X. Zhou, K. Yu, T. Zhang et al, ECCV10]
  • HG [X. Zhou et al, ECCV09]

Related references:

+ +

Normalize and concatenate

slide-31
SLIDE 31

31

Pooling Towards Image-Level Representation

Component 1 Component 2 Component 3

+ + + +

Pooling:

+ +

Normalize and concatenate

Big model:

The dimension becomes C (#components) x d (#fea dim) For example, if C=1000, d=128, the final dimension is 128K 100+ times longer than that from SC or VQ!

slide-32
SLIDE 32

32

Very Long Vector as Feature Representation

We can generate very long image feature vector as we discussed before

The strong feature we used for ImageNet LSVRC 2010

– Dense sampling: LBP + HOG, fea dim=100 (after PCA) – GMM with 1024 components – 4 spatial gridding (1+3x1) – Dimension of image feature: 100 x 1024 x 4 = 0.41 M

GMM pooling HOG LBP

slide-33
SLIDE 33

33

How to solve big models?

slide-34
SLIDE 34

34

For Small Datasets: Use Kernel Trick!

Kernel trick:

  • 10K images => Kernel matrix: 10K x 10K ~100M
  • Computational complexity depends on the size of Kernel matrix

which is less than feature dimension

Learning Locally-Adaptive Decision Functions for Person Verification, CVPR’13 (with Z. Li and S. Chang, F. Liang, T. Huang, J. Smith) Results on LFW dataset

We tried nonlinear kernels for face verification and got good performance

slide-35
SLIDE 35

35

For Large Dataset: Use Stochastic Gradient Descent

  • Suppose we are working on ImageNet data using 0.4 M

feature vectors.

  • Total training data: 1.2M x 0.4M ~ 0.5 T real values!

– Too big to load into memory – Too many samples to use kernel tricks

  • Solution: Stochastic Gradient Descent (SGD)

– Idea: estimate the gradient on a randomly picked sample – Comparing with gradient descent:

slide-36
SLIDE 36

36

SGD Can Be Very Simple To Implement

A 10 line binary SVM solver by Shai Shalev-Shwartz decreasing learning rate

slide-37
SLIDE 37

37

Deep CNN and Related Tech

slide-38
SLIDE 38

38

Deep CNN: A Bigger Model

Motivated by the studies of [Kizhevsky et al, NIPS12] [Y. LeCun et al, PIEEE98], deep convolutionary neural network (CNN) becomes the newest winner in ImageNet competition. The most popular CNN has: – 5 convolutional layers to learn filters – 2 fully connected layers – 60 million parameters – Stochastic gradient descent (again) Why we can train such a bigger model now (not in 1990s)? – The rise of big dataset (ImageNet) – The bless of GPU computing

slide-39
SLIDE 39

Deep Learning Demo

http://smith-gpu.pok.ibm.com:8080/

slide-40
SLIDE 40

40

Learning Representation From Big Data

Computer vision researchers have seen big performance jump in large scale datasets like ImageNet. Even earlier, researchers in and speech/acoustics have seen similar success in LVCSR and related tasks. In another field, text/NLP researchers are also moving quickly to large scale learning. For example, the IBM Watson system used thousands

  • f sub-systems to won the human players in Jeopardy! Game.

www.ibm.com/watsonjobs Watson is hiring!

Especially, we are looking for winter interns working on vision + NLP problems.

contact zhou@us.ibm.com

slide-41
SLIDE 41

41

Conclusion

slide-42
SLIDE 42

42

Conclusion

The mutual evolvement of big data and big models:

Histogram Sparse coding (10K parameters) Supervec, Fishervec (0.4M parameters) Deep CNN (60M para) Bigger Small dataset

(e.g., Caltech101, 8K im)

Medium dataset

(e.g., PASCAL, 10+K)

Large dataset

(e.g., ImageNet 1.2M)

Bigger

Motivating questions:

  • How to develop scalable solutions for big data?
  • How to deal with situations with limited labeled data?

Please see the following talks for the answer!