cs395t visual recognition and search leveraging internet
play

CS395T: Visual Recognition and Search Leveraging Internet Data - PowerPoint PPT Presentation

CS395T: Visual Recognition and Search Leveraging Internet Data Birgi Tamersoy March 27, 2009 Theme I L. Lazebnik Theme II L. Lazebnik Theme III K. Grauman Outline Scene Segmentation Using the Wisdom of Crowds by I. Simon and S.M. Seitz


  1. CS395T: Visual Recognition and Search Leveraging Internet Data Birgi Tamersoy March 27, 2009

  2. Theme I L. Lazebnik

  3. Theme II L. Lazebnik

  4. Theme III K. Grauman

  5. Outline Scene Segmentation Using the Wisdom of Crowds by I. Simon and S.M. Seitz World-scale Mining of Objects and Events from Community Photo Collections by T. Quack, B. Leibe and L. Van Gool 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition by A. Torralba, R. Fergus and W.T. Freeman

  6. Introduction [Wisdom of Crowds] Goal Given a set of images of a static scene, identify and segment the interesting objects in the scene. Observations ◮ The distribution of photos in a collection holds valuable semantic information. ◮ Interesting objects will be frequently photographed. ◮ Detecting interesting features is straightforward, but identifying interesting objects is more challenging. ◮ Features on the same object will appear together in many photos. Field-of-view cue Co-occurrence information is used to group features into objects.

  7. Big Picture

  8. Spatial Cues I Algorithm 1. Find feature points in each image using SIFT keypoint detector. 2. For each pair of images, match the detected feature points. 3. Robustly estimate a fundamental matrix for the pair using RANSAC (RAndom SAmple Consensus) and remove the outliers. 4. Organize the matches into tracks. ◮ A track is a connected set of of matching keypoints across multiple images. 5. Recover camera parameters and a 3D location for each track.

  9. Spatial Cues II ◮ A single 3D Gaussian distribution per object to enforce spatial cues. ◮ A mixture of Gaussians to model the spatial cues from multiple objects. � P ( C , X | π, µ, Σ) = P ( c j | π ) P ( x j | c j , µ, Σ) j ◮ A class variable c j is associated with each point x i . Drawn from a multinomial distribution. ◮ Point locations are drawn from 3D Gaussians, where the point class determines which Gaussian to use. Snavely et al.

  10. Field-of-view Cues pLSA Co-occurrence information is modeled by Probabilistic Latent Semantic Analysis (pLSA). � � P ( C , X | θ, φ ) = P ( c ij | θ i ) P ( x ij | c ij , φ ) i j | x j ∈ V i ◮ A class variable c ij for each point-image incidence. ◮ In original pLSA, x ij would be the number of times word j appears in document i .

  11. Combined Model Simon and Seitz � � P ( C , X | θ, π, µ, Σ) = ( P ( c ij | θ i )) × i j | x j ∈ V i � ( P ( c j | π ) P ( x j | c j , µ, Σ)) j ◮ This joint density is locally maximized using the EM algorithm.

  12. Evaluation I ◮ For each test scene, the ground truth clusterings C ∗ are manually created. ◮ Three different models, mixture of Gaussians, pLSA and the combined model, are all tested. ◮ Computed clusterings are evaluated using Meila’s Variation of Information (VI) metric: VI ( C , C ∗ ) = H ( C | C ∗ ) + H ( C ∗ | C ) ◮ The two terms represent the conditional entropies; information lost and gained between the two clusterings. Simon and Seitz

  13. Evaluation II Simon and Seitz

  14. Importance Viewer ◮ Interesting objects appear in many photos. ◮ Penalize objects for size for not to reward the large background objects. imp ( c ) = α 1 � θ i ( c ) | Σ c | i Simon and Seitz

  15. Region Labeling ◮ Image tags in the Internet are very noisy. ◮ Accurate tags could be computed by examining tag-cluster co-occurrence statistic. ◮ Score of each cluster c tag t pair is given by: score ( c , t ) = P ( c , t )(log P ( c , t ) − log P ( c ) P ( t )) Simon and Seitz

  16. Interactive Map Viewer ◮ After the scene is segmented, the scene points are manually aligned with an overhead view. Simon and Seitz

  17. Summary ◮ Field-of-view cue is introduced. ◮ Field-of-view cues are incorporated with spatial cues to identify the interesting objects of a scene. ◮ Source of the information: distribution of photos, ie. wisdom of crowds.

  18. Introduction [World-scale Object Mining] Goal Automated collection of a high quality image database with correct annotations. Observations ◮ Large databases of visual data is available from community photo collections. ◮ More and more images are “geotagging”. ◮ Geotags and textual tags are sparse and noisy.

  19. Big Picture

  20. Gathering the Data ◮ Earth’s surface is divided into tiles. ◮ High overlap between tiles. ◮ 70.000 tiles are processed (52.000 containing no images at all). Quack et. al.

  21. Photo Clustering 1. Dissimilarity matrices are computed for several modalities: ◮ Visual cues. ◮ Textual cues. ◮ (User/timestamps cues.) 2. A hierarchical clustering step is used to create clusters of photos for the same object or event.

  22. Visual Features and Similarity I 1. Extract SURF features from each photo of the tile. 2. For each pair of images find the matching features. 3. Estimate homography H between the two images using RANSAC. 4. Create the distance matrix using the number of “inlier” feature matches I ij for each image pair: � I ij if I ij ≥ 10 I max d ij = ∞ if I ij < 10

  23. Visual Features and Similarity II Speeded-Up Robust Features by Bay et. al. ◮ Scale- and rotation-invariant detector and descriptor. ◮ At each step integral images are used to get very fast detections and descriptions. ◮ A box filter approximation of the Hessian matrix is used as the underlying filter. ◮ The 64-dimensional SURF descriptor describes the distribution of the intensity content within the interest Bay et. al. point neighborhood.

  24. Visual Features and Similarity III RANdom SAmple Consensus Homography K. Grauman p ′ = Hp wx ′       a b c x wy ′  = d e f y      w g h 1 1 K. Grauman

  25. Text Features and Similarity 1. Three meta-data (tags, title and description) are combined to form a single text per image. 2. Image specific stop lists are applied. 3. Pairwise text similarities are computed to create the distance matrix. log tf i , f + 1 = L i , j � j (log( tf i , f + 1) log D − d i Term weighting G i = d i U j w i , j = L i , j ∗ G i ∗ N j N j = 1 + 0 . 0115 ∗ U j where U j is the number of unique terms in image j .

  26. Clustering ◮ Hierarchical agglomerative clustering is applied to the computed distance matrices with the following cut-off distances: Quack et. al ◮ Three different linkage methods are employed in order to capture different visual properties: single-link : d AB = i ∈ A , j ∈ B d ij min complete-link : d AB = i ∈ A , j ∈ B d ij max 1 � average-link : d AB = d ij n i n j i ∈ A , j ∈ B

  27. Classification into Objects and Events ◮ An individual ID3 decision ◮ Two features are extracted tree is trained for each class. by using only the meta-data of the images in a tile: ◮ 88% precision for objects ◮ Number of unique days and 94% precision for the photos in a cluster events. were taken at. ◮ The number of different users who “contributed” photos to this cluster divided by the cluster size. Quack et. al

  28. Labeling the Objects ◮ “Correct” labels of a cluster are found using frequent itemset mining. ◮ Top 15 itemsets are kept per cluster. Frequent Itemset Mining Let I = { i 1 · · · i p } be a set of p words. The text of each image in the tile is a subset of I , T ⊆ I . The text of all images in a tile forms the database D . The goal is to find an itemset A ⊆ T , which has relatively high support: supp ( A ) = |{ T ∈ D | A ⊆ T }| ∈ [0 , 1] | D |

  29. Linking to Wikipedia 1. Each itemset is used as a query to Google (search is limited to Wikipedia articles. 2. Images in the article are compared with the images in the cluster. 3. If there is a match, the query is kept as a label, otherwise it is rejected.

  30. Experiments ◮ 70.000 tiles covering approximately 700 square kilometers. ◮ Over 220.000 images. ◮ Over 20.000.000 similarities (only 1 million being greater than 0). ◮ At the end, 73.000 images could be assigned to a cluster.

  31. Object Clusters Quack et. al

  32. Event Clusters Quack et. al

  33. Linkage Methods Single-link Complete-link Quack et. al

  34. Summary ◮ World surface is divided into tiles. ◮ Images belonging to a tile are identified using geotags. ◮ These images are clustered. ◮ Clusters are classified as objects or events. ◮ Object labels are determined, and additional information from the Internet is linked to these objects. ◮ FULLY UNSUPERVISED!!!

  35. Introduction [80 Million Tiny Images] Goal Creating an image database that densely populates the manifold of natural images, allowing the use of non-parametric approaches. Observations ◮ Billions of images are available on the Internet. ◮ Human vision system has a remarkable tolerance to degradations in image resolutions. ◮ Visual world is very regular limiting the space of possible images significantly.

  36. Big Picture Torralba et. al

  37. Low Dimensional Image Representation ◮ 32 × 32 color images contain enough information for scene recognition, object detection and segmentation (for humans). ◮ Two advantages of low resolution representation: ◮ Intrinsic dimensionality of the manifold gets much smaller. ◮ Storing and efficient indexing of vast amounts of data points becomes feasible. ◮ It is important that information is not lost, while the dimensionality is reduced. Torralba et. al

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend