CS395T: Visual Recognition and Search Leveraging Internet Data - - PowerPoint PPT Presentation

cs395t visual recognition and search leveraging internet
SMART_READER_LITE
LIVE PREVIEW

CS395T: Visual Recognition and Search Leveraging Internet Data - - PowerPoint PPT Presentation

CS395T: Visual Recognition and Search Leveraging Internet Data Birgi Tamersoy March 27, 2009 Theme I L. Lazebnik Theme II L. Lazebnik Theme III K. Grauman Outline Scene Segmentation Using the Wisdom of Crowds by I. Simon and S.M. Seitz


slide-1
SLIDE 1

CS395T: Visual Recognition and Search Leveraging Internet Data

Birgi Tamersoy March 27, 2009

slide-2
SLIDE 2

Theme I

  • L. Lazebnik
slide-3
SLIDE 3

Theme II

  • L. Lazebnik
slide-4
SLIDE 4

Theme III

  • K. Grauman
slide-5
SLIDE 5

Outline

Scene Segmentation Using the Wisdom of Crowds by I. Simon and S.M. Seitz World-scale Mining of Objects and Events from Community Photo Collections by T. Quack, B. Leibe and L. Van Gool 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition by A. Torralba, R. Fergus and W.T. Freeman

slide-6
SLIDE 6

Introduction [Wisdom of Crowds]

Goal

Given a set of images of a static scene, identify and segment the interesting objects in the scene.

Observations

◮ The distribution of photos in a collection holds valuable

semantic information.

◮ Interesting objects will be frequently photographed. ◮ Detecting interesting features is straightforward, but

identifying interesting objects is more challenging.

◮ Features on the same object will appear together in many

photos.

Field-of-view cue

Co-occurrence information is used to group features into objects.

slide-7
SLIDE 7

Big Picture

slide-8
SLIDE 8

Spatial Cues I

Algorithm

  • 1. Find feature points in each image using SIFT keypoint

detector.

  • 2. For each pair of images, match the detected feature points.
  • 3. Robustly estimate a fundamental matrix for the pair using

RANSAC (RAndom SAmple Consensus) and remove the

  • utliers.
  • 4. Organize the matches into tracks.

◮ A track is a connected set of of matching keypoints across

multiple images.

  • 5. Recover camera parameters and a 3D location for each track.
slide-9
SLIDE 9

Spatial Cues II

Snavely et al.

◮ A single 3D Gaussian distribution per

  • bject to enforce spatial cues.

◮ A mixture of Gaussians to model the

spatial cues from multiple objects. P(C, X|π, µ, Σ) =

  • j

P(cj|π)P(xj|cj, µ, Σ)

◮ A class variable cj is associated with

each point xi. Drawn from a multinomial distribution.

◮ Point locations are drawn from 3D

Gaussians, where the point class determines which Gaussian to use.

slide-10
SLIDE 10

Field-of-view Cues

pLSA

Co-occurrence information is modeled by Probabilistic Latent Semantic Analysis (pLSA). P(C, X|θ, φ) =

  • i
  • j|xj∈Vi

P(cij|θi)P(xij|cij, φ)

◮ A class variable cij for each point-image incidence. ◮ In original pLSA, xij would be the number of times word j

appears in document i.

slide-11
SLIDE 11

Combined Model

Simon and Seitz

P(C, X|θ, π, µ, Σ) = (

  • i
  • j|xj∈Vi

P(cij|θi)) × (

  • j

P(cj|π)P(xj|cj, µ, Σ))

◮ This joint density is locally maximized using the EM

algorithm.

slide-12
SLIDE 12

Evaluation I

◮ For each test scene, the ground truth clusterings C ∗ are

manually created.

◮ Three different models, mixture of Gaussians, pLSA and the

combined model, are all tested.

◮ Computed clusterings are evaluated using Meila’s Variation of

Information (VI) metric: VI(C, C ∗) = H(C|C ∗) + H(C ∗|C)

◮ The two terms represent the conditional entropies;

information lost and gained between the two clusterings.

Simon and Seitz

slide-13
SLIDE 13

Evaluation II

Simon and Seitz

slide-14
SLIDE 14

Importance Viewer

◮ Interesting objects appear in many photos. ◮ Penalize objects for size for not to reward the large

background objects. imp(c) = α 1 |Σc|

  • i

θi(c)

Simon and Seitz

slide-15
SLIDE 15

Region Labeling

◮ Image tags in the Internet are very noisy. ◮ Accurate tags could be computed by examining tag-cluster

co-occurrence statistic.

◮ Score of each cluster c tag t pair is given by:

score(c, t) = P(c, t)(log P(c, t) − log P(c)P(t))

Simon and Seitz

slide-16
SLIDE 16

Interactive Map Viewer

◮ After the scene is segmented, the scene points are manually

aligned with an overhead view.

Simon and Seitz

slide-17
SLIDE 17

Summary

◮ Field-of-view cue is introduced. ◮ Field-of-view cues are incorporated with spatial cues to

identify the interesting objects of a scene.

◮ Source of the information: distribution of photos, ie. wisdom

  • f crowds.
slide-18
SLIDE 18

Introduction [World-scale Object Mining]

Goal

Automated collection of a high quality image database with correct annotations.

Observations

◮ Large databases of visual data is available from community

photo collections.

◮ More and more images are “geotagging”. ◮ Geotags and textual tags are sparse and noisy.

slide-19
SLIDE 19

Big Picture

slide-20
SLIDE 20

Gathering the Data

Quack et. al.

◮ Earth’s surface is

divided into tiles.

◮ High overlap between

tiles.

◮ 70.000 tiles are

processed (52.000 containing no images at all).

slide-21
SLIDE 21

Photo Clustering

  • 1. Dissimilarity matrices are computed for several modalities:

◮ Visual cues. ◮ Textual cues. ◮ (User/timestamps cues.)

  • 2. A hierarchical clustering step is used to create clusters of

photos for the same object or event.

slide-22
SLIDE 22

Visual Features and Similarity I

  • 1. Extract SURF features from each photo of the tile.
  • 2. For each pair of images find the matching features.
  • 3. Estimate homography H between the two images using

RANSAC.

  • 4. Create the distance matrix using the number of “inlier”

feature matches Iij for each image pair: dij = Iij

Imax

if Iij ≥ 10 ∞ if Iij < 10

slide-23
SLIDE 23

Visual Features and Similarity II

Speeded-Up Robust Features by Bay et. al.

◮ Scale- and rotation-invariant

detector and descriptor.

◮ At each step integral images

are used to get very fast detections and descriptions.

◮ A box filter approximation of

the Hessian matrix is used as the underlying filter.

◮ The 64-dimensional SURF

descriptor describes the distribution of the intensity content within the interest point neighborhood.

Bay et. al.

slide-24
SLIDE 24

Visual Features and Similarity III

Homography

  • K. Grauman

p′ = Hp   wx′ wy′ w   =   a b c d e f g h 1     x y 1  

RANdom SAmple Consensus

  • K. Grauman
slide-25
SLIDE 25

Text Features and Similarity

  • 1. Three meta-data (tags, title and description) are combined to

form a single text per image.

  • 2. Image specific stop lists are applied.
  • 3. Pairwise text similarities are computed to create the distance

matrix.

Term weighting

wi,j = Li,j ∗ Gi ∗ Nj Li,j = log tfi,f + 1

  • j(log(tfi,f + 1)

Gi = log D − di di Nj = Uj 1 + 0.0115 ∗ Uj where Uj is the number of unique terms in image j.

slide-26
SLIDE 26

Clustering

◮ Hierarchical agglomerative clustering is applied to the

computed distance matrices with the following cut-off distances:

Quack et. al

◮ Three different linkage methods are employed in order to

capture different visual properties: single-link : dAB = min

i∈A,j∈B dij

complete-link : dAB = max

i∈A,j∈B dij

average-link : dAB = 1 ninj

  • i∈A,j∈B

dij

slide-27
SLIDE 27

Classification into Objects and Events

◮ Two features are extracted

by using only the meta-data

  • f the images in a tile:

◮ Number of unique days

the photos in a cluster were taken at.

◮ The number of different

users who “contributed” photos to this cluster divided by the cluster size.

◮ An individual ID3 decision

tree is trained for each class.

◮ 88% precision for objects

and 94% precision for events.

Quack et. al

slide-28
SLIDE 28

Labeling the Objects

◮ “Correct” labels of a cluster are found using frequent itemset

mining.

◮ Top 15 itemsets are kept per cluster.

Frequent Itemset Mining

Let I = {i1 · · · ip} be a set of p words. The text of each image in the tile is a subset of I, T ⊆ I. The text of all images in a tile forms the database D. The goal is to find an itemset A ⊆ T, which has relatively high support: supp(A) = |{T ∈ D|A ⊆ T}| |D| ∈ [0, 1]

slide-29
SLIDE 29

Linking to Wikipedia

  • 1. Each itemset is used as a query to Google (search is limited to

Wikipedia articles.

  • 2. Images in the article are compared with the images in the

cluster.

  • 3. If there is a match, the query is kept as a label, otherwise it is

rejected.

slide-30
SLIDE 30

Experiments

◮ 70.000 tiles covering approximately 700 square kilometers. ◮ Over 220.000 images. ◮ Over 20.000.000 similarities (only 1 million being greater than

0).

◮ At the end, 73.000 images could be assigned to a cluster.

slide-31
SLIDE 31

Object Clusters

Quack et. al

slide-32
SLIDE 32

Event Clusters

Quack et. al

slide-33
SLIDE 33

Linkage Methods

Single-link Complete-link Quack et. al

slide-34
SLIDE 34

Summary

◮ World surface is divided into tiles. ◮ Images belonging to a tile are identified using geotags. ◮ These images are clustered. ◮ Clusters are classified as objects or events. ◮ Object labels are determined, and additional information from

the Internet is linked to these objects.

◮ FULLY UNSUPERVISED!!!

slide-35
SLIDE 35

Introduction [80 Million Tiny Images]

Goal

Creating an image database that densely populates the manifold of natural images, allowing the use of non-parametric approaches.

Observations

◮ Billions of images are available on the Internet. ◮ Human vision system has a remarkable tolerance to

degradations in image resolutions.

◮ Visual world is very regular limiting the space of possible

images significantly.

slide-36
SLIDE 36

Big Picture

Torralba et. al

slide-37
SLIDE 37

Low Dimensional Image Representation

Torralba et. al

◮ 32 × 32 color images contain

enough information for scene recognition, object detection and segmentation (for humans).

◮ Two advantages of low

resolution representation:

◮ Intrinsic dimensionality of the

manifold gets much smaller.

◮ Storing and efficient indexing

  • f vast amounts of data

points becomes feasible.

◮ It is important that information

is not lost, while the dimensionality is reduced.

slide-38
SLIDE 38

A Large Dataset of 32 × 32 Images I

  • 1. 75.062 non-abstract nouns are extracted from Wordnet.
  • 2. 7 independent search engines are searched for all of the

images belonging to one of these categories.

  • 3. In 8 mounts 97.245.098 images are collected.
  • 4. Duplicates and uniform images are eliminated to form the

final dataset of 79.302.017 images.

slide-39
SLIDE 39

A Large Dataset of 32 × 32 Images II

Torralba et. al

Keywords

Around 10% of keywords have very few images. Mean number

  • f images per word: 1.056.

Labeling Noise

The dataset is not cleaned up. Often visual content is unrelated to the query word.

slide-40
SLIDE 40

Dataset Statistics

Dataset Size

Distance between two images can be approximated using few principal components.

Similarity Measures

D2

ssd =

  • x,y,c

(I1(x, y, c) − I2(x, y, c))2 D2

warp =

  • x,y,c

(I1(x, y, c) − Tθ[I2(x, y, c)])2 D2

shift =

  • x,y,c

(I1(x, y, c)−ˆ I2(x+Dx, y+Dy, c)])2

slide-41
SLIDE 41

Wordnet Voting Scheme in Recognition I

Recognition

Rather than relying on complex matching schemes, let the data do the work.

Wordnet Voting Scheme for Labeling Noise

◮ Given a query image find the nearest neighbors using some

similarity measure.

◮ Each neighbor votes for its branch in the Wordnet tree. ◮ Classification at a specific level is done with respect to the

votes.

slide-42
SLIDE 42

Wordnet Voting Scheme in Recognition II

Torralba et. al

slide-43
SLIDE 43

Person Detection I

◮ 23% of the images

contain people in it.

◮ Hence, the corresponding

region in the manifold is covered very densely.

Torralba et. al

slide-44
SLIDE 44

Person Detection II

Torralba et. al

slide-45
SLIDE 45

Person Localization

◮ Segment the input image

using normalized cuts (10 segments).

◮ Query the dataset using

cropped continuous segments.

Torralba et. al

slide-46
SLIDE 46

Scene Recognition

◮ Scene images will have a

high count accumulated at the “location” node of the Wordnet tree.

Torralba et. al

slide-47
SLIDE 47

Automatic Image Annotation

Torralba et. al

slide-48
SLIDE 48

Image Colorization

Torralba et. al

slide-49
SLIDE 49

Detecting Image Orientation

Torralba et. al

slide-50
SLIDE 50

Summary

◮ Data should do the work, not us!!! ◮ 32 × 32 color images are enough for most of the computer

vision tasks.

◮ Covering the manifold of natural images densely, so that for

every query image there will be a semantically very similar image in the database.

slide-51
SLIDE 51

Final Word

◮ Wisdom of Crowds: importance viewer, region labeling,

interactive map viewer.

◮ World-scale Mining of Objects: recognition, automatic

annotation.

◮ 80 Million Tiny Images: detection, recognition, localization,

automatic annotation, etc.

Dataset Size

  • 1. Wisdom of Crowds
  • 2. World-scale Mining of

Objects

  • 3. 80 Million Tiny Images

Complexity

  • 1. 80 Million Tiny Images
  • 2. World-scale Mining of

Objects

  • 3. Wisdom of Crowds
slide-52
SLIDE 52

References

◮ Scene Segmentation Using the Wisdom of Crowds by I. Simon

and S. M. Seitz

◮ Photo Tourism: Exploring Photo Collections in 3D by N.

Snavely, S. M. Seitz and R. Szeliski

◮ Computing Clusterings - an Information Based Distance by M.

Meila

◮ World-scale Mining of Objects and Events from Community

Photo Collections by T. Quack, B. Leibe and L. Van Gool

◮ Speeded-Up Robust Features (SURF) by H. Bay, A. Ess, T.

Tuytelaars and L. Van Gool

◮ 80 million tiny images: a large dataset for non-parametric

  • bject and scene recognition by A. Torralba, R. Fergus and W.
  • T. Freeman

◮ Dr. Kristen Grauman’s CS378 (Fall 2008) and CS395T

(Spring 2009) lecture slides