CS395T: Visual Recognition and Search Leveraging Internet Data - PowerPoint PPT Presentation

CS395T: Visual Recognition and Search Leveraging Internet Data Birgi Tamersoy March 27, 2009

Theme I L. Lazebnik

Theme II L. Lazebnik

Theme III K. Grauman

Outline Scene Segmentation Using the Wisdom of Crowds by I. Simon and S.M. Seitz World-scale Mining of Objects and Events from Community Photo Collections by T. Quack, B. Leibe and L. Van Gool 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition by A. Torralba, R. Fergus and W.T. Freeman

Introduction [Wisdom of Crowds] Goal Given a set of images of a static scene, identify and segment the interesting objects in the scene. Observations ◮ The distribution of photos in a collection holds valuable semantic information. ◮ Interesting objects will be frequently photographed. ◮ Detecting interesting features is straightforward, but identifying interesting objects is more challenging. ◮ Features on the same object will appear together in many photos. Field-of-view cue Co-occurrence information is used to group features into objects.

Big Picture

Spatial Cues I Algorithm 1. Find feature points in each image using SIFT keypoint detector. 2. For each pair of images, match the detected feature points. 3. Robustly estimate a fundamental matrix for the pair using RANSAC (RAndom SAmple Consensus) and remove the outliers. 4. Organize the matches into tracks. ◮ A track is a connected set of of matching keypoints across multiple images. 5. Recover camera parameters and a 3D location for each track.

Spatial Cues II ◮ A single 3D Gaussian distribution per object to enforce spatial cues. ◮ A mixture of Gaussians to model the spatial cues from multiple objects. � P ( C , X | π, µ, Σ) = P ( c j | π ) P ( x j | c j , µ, Σ) j ◮ A class variable c j is associated with each point x i . Drawn from a multinomial distribution. ◮ Point locations are drawn from 3D Gaussians, where the point class determines which Gaussian to use. Snavely et al.

Field-of-view Cues pLSA Co-occurrence information is modeled by Probabilistic Latent Semantic Analysis (pLSA). � � P ( C , X | θ, φ ) = P ( c ij | θ i ) P ( x ij | c ij , φ ) i j | x j ∈ V i ◮ A class variable c ij for each point-image incidence. ◮ In original pLSA, x ij would be the number of times word j appears in document i .

Evaluation I ◮ For each test scene, the ground truth clusterings C ∗ are manually created. ◮ Three different models, mixture of Gaussians, pLSA and the combined model, are all tested. ◮ Computed clusterings are evaluated using Meila’s Variation of Information (VI) metric: VI ( C , C ∗ ) = H ( C | C ∗ ) + H ( C ∗ | C ) ◮ The two terms represent the conditional entropies; information lost and gained between the two clusterings. Simon and Seitz

Evaluation II Simon and Seitz

Importance Viewer ◮ Interesting objects appear in many photos. ◮ Penalize objects for size for not to reward the large background objects. imp ( c ) = α 1 � θ i ( c ) | Σ c | i Simon and Seitz

Region Labeling ◮ Image tags in the Internet are very noisy. ◮ Accurate tags could be computed by examining tag-cluster co-occurrence statistic. ◮ Score of each cluster c tag t pair is given by: score ( c , t ) = P ( c , t )(log P ( c , t ) − log P ( c ) P ( t )) Simon and Seitz

Interactive Map Viewer ◮ After the scene is segmented, the scene points are manually aligned with an overhead view. Simon and Seitz

Summary ◮ Field-of-view cue is introduced. ◮ Field-of-view cues are incorporated with spatial cues to identify the interesting objects of a scene. ◮ Source of the information: distribution of photos, ie. wisdom of crowds.

Introduction [World-scale Object Mining] Goal Automated collection of a high quality image database with correct annotations. Observations ◮ Large databases of visual data is available from community photo collections. ◮ More and more images are “geotagging”. ◮ Geotags and textual tags are sparse and noisy.

Big Picture

Gathering the Data ◮ Earth’s surface is divided into tiles. ◮ High overlap between tiles. ◮ 70.000 tiles are processed (52.000 containing no images at all). Quack et. al.

Photo Clustering 1. Dissimilarity matrices are computed for several modalities: ◮ Visual cues. ◮ Textual cues. ◮ (User/timestamps cues.) 2. A hierarchical clustering step is used to create clusters of photos for the same object or event.

Visual Features and Similarity I 1. Extract SURF features from each photo of the tile. 2. For each pair of images find the matching features. 3. Estimate homography H between the two images using RANSAC. 4. Create the distance matrix using the number of “inlier” feature matches I ij for each image pair: � I ij if I ij ≥ 10 I max d ij = ∞ if I ij < 10

Visual Features and Similarity II Speeded-Up Robust Features by Bay et. al. ◮ Scale- and rotation-invariant detector and descriptor. ◮ At each step integral images are used to get very fast detections and descriptions. ◮ A box filter approximation of the Hessian matrix is used as the underlying filter. ◮ The 64-dimensional SURF descriptor describes the distribution of the intensity content within the interest Bay et. al. point neighborhood.

Visual Features and Similarity III RANdom SAmple Consensus Homography K. Grauman p ′ = Hp wx ′       a b c x wy ′  = d e f y      w g h 1 1 K. Grauman

Text Features and Similarity 1. Three meta-data (tags, title and description) are combined to form a single text per image. 2. Image specific stop lists are applied. 3. Pairwise text similarities are computed to create the distance matrix. log tf i , f + 1 = L i , j � j (log( tf i , f + 1) log D − d i Term weighting G i = d i U j w i , j = L i , j ∗ G i ∗ N j N j = 1 + 0 . 0115 ∗ U j where U j is the number of unique terms in image j .

Clustering ◮ Hierarchical agglomerative clustering is applied to the computed distance matrices with the following cut-off distances: Quack et. al ◮ Three different linkage methods are employed in order to capture different visual properties: single-link : d AB = i ∈ A , j ∈ B d ij min complete-link : d AB = i ∈ A , j ∈ B d ij max 1 � average-link : d AB = d ij n i n j i ∈ A , j ∈ B

Classification into Objects and Events ◮ An individual ID3 decision ◮ Two features are extracted tree is trained for each class. by using only the meta-data of the images in a tile: ◮ 88% precision for objects ◮ Number of unique days and 94% precision for the photos in a cluster events. were taken at. ◮ The number of different users who “contributed” photos to this cluster divided by the cluster size. Quack et. al

Labeling the Objects ◮ “Correct” labels of a cluster are found using frequent itemset mining. ◮ Top 15 itemsets are kept per cluster. Frequent Itemset Mining Let I = { i 1 · · · i p } be a set of p words. The text of each image in the tile is a subset of I , T ⊆ I . The text of all images in a tile forms the database D . The goal is to find an itemset A ⊆ T , which has relatively high support: supp ( A ) = |{ T ∈ D | A ⊆ T }| ∈ [0 , 1] | D |

Linking to Wikipedia 1. Each itemset is used as a query to Google (search is limited to Wikipedia articles. 2. Images in the article are compared with the images in the cluster. 3. If there is a match, the query is kept as a label, otherwise it is rejected.

Experiments ◮ 70.000 tiles covering approximately 700 square kilometers. ◮ Over 220.000 images. ◮ Over 20.000.000 similarities (only 1 million being greater than 0). ◮ At the end, 73.000 images could be assigned to a cluster.

Object Clusters Quack et. al

Event Clusters Quack et. al

Linkage Methods Single-link Complete-link Quack et. al

Summary ◮ World surface is divided into tiles. ◮ Images belonging to a tile are identified using geotags. ◮ These images are clustered. ◮ Clusters are classified as objects or events. ◮ Object labels are determined, and additional information from the Internet is linked to these objects. ◮ FULLY UNSUPERVISED!!!

Introduction [80 Million Tiny Images] Goal Creating an image database that densely populates the manifold of natural images, allowing the use of non-parametric approaches. Observations ◮ Billions of images are available on the Internet. ◮ Human vision system has a remarkable tolerance to degradations in image resolutions. ◮ Visual world is very regular limiting the space of possible images significantly.

Big Picture Torralba et. al

Low Dimensional Image Representation ◮ 32 × 32 color images contain enough information for scene recognition, object detection and segmentation (for humans). ◮ Two advantages of low resolution representation: ◮ Intrinsic dimensionality of the manifold gets much smaller. ◮ Storing and efficient indexing of vast amounts of data points becomes feasible. ◮ It is important that information is not lost, while the dimensionality is reduced. Torralba et. al

CS395T: Visual Recognition and Search Leveraging Internet Data - PowerPoint PPT Presentation

CS395T: Visual Recognition and Search Leveraging Internet Data Birgi Tamersoy March 27, 2009 Theme I L. Lazebnik Theme II L. Lazebnik Theme III K. Grauman Outline Scene Segmentation Using the Wisdom of Crowds by I. Simon and S.M. Seitz

Learning distance functions Xin Sui CS395T Visual Recognition and Search The University of Texas

Efficient visual search of local features Efficient visual search of local features Cordelia

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Introduction to Visual Recognition General visual recognition importance for intelligence?

Introduction to Visual Search and Recognition Visual Search Tutorial Global representations:

EE 6882 Visual Search Engine Lec. 1: Introduction tinyeye, photo copy search Web image search

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

CS395T paper review Indoor Segmentation and Support Inference from RGBD Images Chao Jia Sep

Instance-level recognition 1) Local invariant features 2) Matching and recognition with local

Instance-level recognition 1) Local invariant features 2) Matching and recognition with local

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Learning about images from keyword-based Web search CS 395T: Visual Recognition and Search

Audio- -Visual Automatic Speech Recognition: Visual Automatic Speech Recognition: Audio Theory,

Image Retrieval with CNN Giorgos Tolias Visual Recognition Group, CTU in Prague CVPR 2017

Object Detection on Street View Images: from Panoramas to Geotags Vladimir A. Krylov in

Querying Geo-social Data by Bridging Spatial Networks and Social Networks Yerach Ben Yaron

Geo-Data Visualizations http://patompa.github.io/geovizdev/ 1 Agenda Overview 8min Preparing

NUS Geography Challenge 2017 Component Outline 1. Overview of Creative Component (CC) 2. Theme

Building Mobile Apps by Example DrupalCamp Ohio 2014 November 14 - Columbus, OH by: Tyler

L I N K E D O P E N D ATA F O R E N E L E L E C T R I F I E D V E H I C L E C H A R G I N G

Sharing with Parents on Helping Your Children Protect Their Personal Data Online Outline of

xBGP : When You Cant Wait for the IETF and Vendors Thomas Wirtgen , Quentin De Coninck, Randy