Bag of Words Model Overview of todays lecture Bag-of-words. - - PowerPoint PPT Presentation

bag of words model overview of today s lecture
SMART_READER_LITE
LIVE PREVIEW

Bag of Words Model Overview of todays lecture Bag-of-words. - - PowerPoint PPT Presentation

Bag of Words Model Overview of todays lecture Bag-of-words. K-means clustering. Classification. K nearest neighbors. Support vector machine. Image Classification Image Classification: Problem Data-driven approach


slide-1
SLIDE 1

Bag of Words Model

slide-2
SLIDE 2
  • Bag-of-words.
  • K-means clustering.
  • Classification.
  • K nearest neighbors.
  • Support vector machine.

Overview of today’s lecture

slide-3
SLIDE 3

Image Classification

slide-4
SLIDE 4

Image Classification: Problem

slide-5
SLIDE 5

Data-driven approach

  • Collect a database of images with labels
  • Use ML to train an image classifier
  • Evaluate the classifier on test images
slide-6
SLIDE 6

Bag of words

slide-7
SLIDE 7

What object do these parts belong to?

slide-8
SLIDE 8

a collection of local features

(bag-of-features)

An object as

Some local feature are very informative

  • deals well with occlusion
  • scale invariant
  • rotation invariant
slide-9
SLIDE 9

(not so) crazy assumption

spatial information of local features can be ignored for object recognition (i.e., verification)

slide-10
SLIDE 10

Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005)

Works pretty well for image-level classification CalTech6 dataset

slide-11
SLIDE 11

Bag-of-features

an old idea

(e.g., texture recognition and information retrieval)

represent a data item (document, texture, image) as a histogram over features

slide-12
SLIDE 12

Texture recognition

Universal texton dictionary histogram

slide-13
SLIDE 13

Vector Space Model

  • G. Salton. ‘Mathematics and Information Retrieval’ Journal of Documentation,1979

1 6 2 1 1 Tartan robot CHIMP CMU bio soft ankle sensor 4 1 4 5 3 2 Tartan robot CHIMP CMU bio soft ankle sensor

http://www.fodey.com/generators/newspaper/snippet.asp
slide-14
SLIDE 14

A document (datapoint) is a vector of counts over each word (feature)

What is the similarity between two documents?

counts the number of occurrences

just a histogram over words

slide-15
SLIDE 15

A document (datapoint) is a vector of counts over each word (feature)

What is the similarity between two documents?

counts the number of occurrences

just a histogram over words

Use any distance you want but the cosine distance is fast.

slide-16
SLIDE 16

but not all words are created equal

slide-17
SLIDE 17

TF-IDF

weigh each word by a heuristic

Term Frequency Inverse Document Frequency term frequency inverse document frequency

(down-weights common terms)

(# of documents containing wi) (# of documents)

slide-18
SLIDE 18

Standard BOW pipeline

(for image classification)

slide-19
SLIDE 19

Dictionary Learning: Learn Visual Words using clustering Encode: build Bags-of-Words (BOW) vectors for each image Classify: Train and test data using BOWs

slide-20
SLIDE 20

Dictionary Learning: Learn Visual Words using clustering

  • 1. extract features (e.g., SIFT) from images
slide-21
SLIDE 21

Dictionary Learning: Learn Visual Words using clustering

  • 2. Learn visual dictionary (e.g., K-means clustering)
slide-22
SLIDE 22

What kinds of features can we extract?

slide-23
SLIDE 23
  • Regular grid
  • Vogel & Schiele, 2003
  • Fei-Fei & Perona, 2005
  • Interest point detector
  • Csurka et al. 2004
  • Fei-Fei & Perona, 2005
  • Sivic et al. 2005
  • Other methods
  • Random sampling (Vidal-Naquet &

Ullman, 2002)

  • Segmentation-based patches (Barnard

et al. 2003)

slide-24
SLIDE 24

Normalize patch

Detect patches

[Mikojaczyk and Schmid ’02] [Mata, Chum, Urban & Pajdla, ’02] [Sivic & Zisserman, ’03]

Compute SIFT descriptor

[Lowe’99]

slide-25
SLIDE 25

slide-26
SLIDE 26

How do we learn the dictionary?

slide-27
SLIDE 27

slide-28
SLIDE 28

Clustering

slide-29
SLIDE 29

Clustering

Visual vocabulary

slide-30
SLIDE 30

K-means clustering

slide-31
SLIDE 31

[ Stanford CS221 ]

slide-32
SLIDE 32

K-means Clustering

Given k: 1.Select initial centroids at random. 2.Assign each object to the cluster with the nearest centroid. 3.Compute each centroid as the mean of the objects assigned to it. 4.Repeat previous 2 steps until no change.

slide-33
SLIDE 33

From what data should I learn the dictionary?

  • Codebook can be learned on separate training set
  • Provided the training set is sufficiently

representative, the codebook will be “universal”

slide-34
SLIDE 34

From what data should I learn the dictionary?

  • Dictionary can be learned on separate training set
  • Provided the training set is sufficiently

representative, the dictionary will be “universal”

slide-35
SLIDE 35

Example visual dictionary

slide-36
SLIDE 36

Example dictionary

Source: B. Leibe

Appearance codebook

slide-37
SLIDE 37

Another dictionary

Appearance codebook

… … … … …

Source: B. Leibe

slide-38
SLIDE 38

Dictionary Learning: Learn Visual Words using clustering Encode: build Bags-of-Words (BOW) vectors for each image Classify: Train and test data using BOWs

slide-39
SLIDE 39

Encode: build Bags-of-Words (BOW) vectors for each image

  • 1. Quantization: image features gets

associated to a visual word (nearest cluster center)

slide-40
SLIDE 40

Encode: build Bags-of-Words (BOW) vectors for each image

  • 2. Histogram: count the

number of visual word

  • ccurrences
slide-41
SLIDE 41

…..

frequency

codewords

slide-42
SLIDE 42

Dictionary Learning: Learn Visual Words using clustering Encode: build Bags-of-Words (BOW) vectors for each image Classify: Train and test data using BOWs

slide-43
SLIDE 43

K nearest neighbors Naïve Bayes Support Vector Machine

slide-44
SLIDE 44

K nearest neighbors

slide-45
SLIDE 45

Distribution of data from two classes

slide-46
SLIDE 46

Which class does q belong too? Distribution of data from two classes

slide-47
SLIDE 47

Distribution of data from two classes

Look at the neighbors

slide-48
SLIDE 48

K-Nearest Neighbor (KNN) Classifier

Non-parametric pattern classification approach Consider a two class problem where each sample consists of two measurements (x,y). k = 1 k = 3 For a given query point q, assign the class of the nearest neighbor Compute the k nearest neighbors and assign the class by majority vote.

slide-49
SLIDE 49

Nearest Neighbor is competitive

MNIST Digit Recognition – Handwritten digits – 28x28 pixel images: d = 784 – 60,000 training samples – 10,000 test samples

Test Error Rate (%) Linear classifier (1-layer NN) 12.0 K-nearest-neighbors, Euclidean 5.0 K-nearest-neighbors, Euclidean, deskewed 2.4 K-NN, Tangent Distance, 16x16 1.1 K-NN, shape context matching 0.67 1000 RBF + linear classifier 3.6 SVM deg 4 polynomial 1.1 2-layer NN, 300 hidden units 4.7 2-layer NN, 300 HU, [deskewing] 1.6 LeNet-5, [distortions] 0.8 Boosted LeNet-4, [distortions] 0.7

Yann LeCunn

slide-50
SLIDE 50

What is the best distance metric between data points?

  • Typically Euclidean distance
  • Important to normalize.

Dimensions have different scales How many K?

  • Typically k=1 is good
  • Cross-validation (try different k!)
slide-51
SLIDE 51

Distance metrics

Cosine Chi-squared Euclidean

slide-52
SLIDE 52

Choice of distance metric

  • Hyperparameter
slide-53
SLIDE 53

CIFAR-10 and NN results

slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56

Validation

slide-57
SLIDE 57

Cross-validation

slide-58
SLIDE 58
slide-59
SLIDE 59

How to pick hyperparameters?

  • Methodology

– Train and test – Train, validate, test

  • Train for original model
  • Validate to find hyperparameters
  • Test to understand generalizability
slide-60
SLIDE 60

Pros

  • simple yet effective

Cons

  • search is expensive (can be sped-up)
  • storage requirements
  • difficulties with high-dimensional data

kNN

slide-61
SLIDE 61

kNN -- Complexity and Storage

  • N training images, M test images
  • Training: O(1)
  • Testing: O(MN)
  • Hmm…

– Normally need the opposite – Slow training (ok), fast testing (necessary)

slide-62
SLIDE 62

Support Vector Machine

slide-63
SLIDE 63

Which class does q belong too? Distribution of data from two classes

slide-64
SLIDE 64

Distribution of data from two classes

Learn the decision boundary

slide-65
SLIDE 65

First we need to understand hyperplanes…

slide-66
SLIDE 66

Hyperplanes (lines) in 2D

a line can be written as dot product plus a bias another version, add a weight 1 and push the bias inside

slide-67
SLIDE 67

define the same line The line and the line Important property:

Free to choose any normalization of w

Hyperplanes (lines) in 2D

(offset/bias outside) (offset/bias inside)

slide-68
SLIDE 68

What is the distance to origin? (hint: use normal form)

slide-69
SLIDE 69

you get the normal form distance to origin scale by

slide-70
SLIDE 70

What is the distance between two parallel lines? (hint: use distance to origin)

slide-71
SLIDE 71

distance between two parallel lines Difference of distance to origin

slide-72
SLIDE 72

What happens if you change b?

Hyperplanes (planes) in 3D

Now we can go to 3D …

what are the dimensions of this vector?

slide-73
SLIDE 73

Hyperplanes (planes) in 3D

slide-74
SLIDE 74

What’s the distance between these parallel planes?

Hyperplanes (planes) in 3D

slide-75
SLIDE 75

Hyperplanes (planes) in 3D

slide-76
SLIDE 76

What’s the best w?

slide-77
SLIDE 77

What’s the best w?

slide-78
SLIDE 78

What’s the best w?

slide-79
SLIDE 79

What’s the best w?

slide-80
SLIDE 80

What’s the best w?

slide-81
SLIDE 81

What’s the best w? Intuitively, the line that is the farthest from all interior points

slide-82
SLIDE 82

What’s the best w? Maximum Margin solution: most stable to perturbations of data

slide-83
SLIDE 83

What’s the best w? Want a hyperplane that is far away from ‘inner points’

support vectors

slide-84
SLIDE 84

Find hyperplane w such that … the gap between parallel hyperplanes

margin

is maximized

slide-85
SLIDE 85

Can be formulated as a maximization problem

label of the data point Why is it +1 and -1? What does this constraint mean?

slide-86
SLIDE 86

Can be formulated as a maximization problem Equivalently,

Where did the 2 go? What happened to the labels?

slide-87
SLIDE 87

Objective Function Constraints

‘Primal formulation’ of a linear SVM

This is a convex quadratic programming (QP) problem

(a unique solution exists)

slide-88
SLIDE 88

‘soft’ margin

slide-89
SLIDE 89

What’s the best w?

slide-90
SLIDE 90

What’s the best w?

Very narrow margin

slide-91
SLIDE 91

Separating cats and dogs

Very narrow margin

slide-92
SLIDE 92

Objective Function

Hard Constraints!

‘Primal formulation’ of a linear SVM

slide-93
SLIDE 93

What’s the best w?

Very narrow margin

Intuitively, we should allow for some misclassification if we can get more robust classification

slide-94
SLIDE 94

What’s the best w?

Trade-off between the MARGIN and the MISTAKES (might be a better solution)

slide-95
SLIDE 95

Adding slack variables

misclassified point

slide-96
SLIDE 96

‘soft’ margin

  • bjective

subject to for

slide-97
SLIDE 97

‘soft’ margin

  • bjective

subject to for The slack variable allows for mistakes, as long as the inverse margin is minimized.

slide-98
SLIDE 98

‘soft’ margin

subject to for

  • bjective
  • Every constraint can be satisfied if slack is large
  • C is a regularization parameter
  • Small C: ignore constraints (larger margin)
  • Big C: constraints (small margin)
  • Still QP problem (unique solution)
slide-99
SLIDE 99
slide-100
SLIDE 100
slide-101
SLIDE 101

Multi-class case

Cat Dog Airplane Chair

slide-102
SLIDE 102

Multi-class case

Cat Dog Airplane Chair Cat Dog Airplane Chair SVM Dog Cat Airplane Chair SVM Airplane Cat Dog Chair SVM Chair Cat Dog Airplane SVM

slide-103
SLIDE 103

Multi-class case

Cat Dog Airplane Chair SVM Dog Cat Airplane Chair SVM Airplane Cat Dog Chair SVM Chair Cat Dog Airplane SVM 0.5 0.9 0.1 0.2

slide-104
SLIDE 104

References

Basic reading:

  • Szeliski, Chapter 14.