extracting visual knowledge from the web with multimodal
play

Extracting Visual Knowledge from the Web with Multimodal Learning - PowerPoint PPT Presentation

Extracting Visual Knowledge from the Web with Multimodal Learning Dihong Gong, Daisy Zhe Wang Data Science Research Lab Department of Computer and Information Science and Engineering University of Florida In Presentation of International Joint


  1. Extracting Visual Knowledge from the Web with Multimodal Learning Dihong Gong, Daisy Zhe Wang Data Science Research Lab Department of Computer and Information Science and Engineering University of Florida In Presentation of International Joint Conference on Artificial Intelligence (Melbourne, 2017)

  2. Problem Description Goal: extract visual knowledge from large collection of web pages. 2

  3. Related Work: web visual knowledge extraction 3 Chen, Xinlei, Abhinav Shrivastava, and Abhinav Gupta. "Neil: Extracting visual knowledge from web data”, ICCV 2013.

  4. Motivation: Traditional Approaches vs. Our Approach A girl is playing with a A girl is playing with a Inputs sleeping dog in a room. sleeping dog in a room . Text Visual Visual Extraction Discard Processing Detection Detection V S Multimodal Prediction G-BBOX: Chair G-BBOX: Chair Outputs Utilize text for enhanced prediction R-BBOX: Dog R-BBOX: Dog Y-BBOX: Person Y-BBOX: Person 4

  5. Step 1: Candidate Objects Extraction ● Fast RCNN with CaffeNet model ● Geodesic Object Proposals for candidate box generation ● Trained on 46 categories from ImageNet (~250 samples per category) ● Apply the trained object detector on ~10 million web images to extract objects 5

  6. Step 2: Learning Common Representations wheel car Text vehicle wheel: 0.8 Image dealer Gainesville car: 0.9 Gainesville vehicle dealer for 20 years! Example 2D Embeddings ● Convert an image into multimodal document: m[k] = {“Vehicle”, “car”, “wheel”} ● Given {m[1], m[2], …, m[N]} training documents, apply Skip-Gram based word-to-vec program to learn vector embeddings, where each multimodal document is treated as a single sentence in text domain. ● Multimodal Embeddings: objects with similar semantic meanings are closer. 6

  7. Step 3: Multimodal Prediction ● Features: each multimodal document is converted into a bag of vectors. ● Model: sparse logistic regression. ● Training: train one model per image category, where positive samples are multimodal documents whose images can detect objects of that category, and negative samples are the rest. ● Scoring: Apply prediction model on multimodal documents, and the final confidence score is given by where, p is a [0,1] confidence score given by logistic regression, and q is the confidence score given by object detector. Thus, p mostly carries text info, while q carries visual info. 7

  8. Dataset We prepared our own dataset by these steps: 1. Parsed hundreds of millions of HTML web pages. 2. Extract image urls and meta text. 3. Calculate the text tags of each image by image tagging program (use TFIDF to measure importance of a tag). 4. Download images using Amazon S3 and EC2 in a distributed manner. This results in a collection of around 10 million tagged images for our study. 8

  9. Evaluation Methods ● We compare our approach to baseline system using only image information. ● Extract all objects (of 46 categories) from the 10 million images, using both baseline and proposed approach. ● Calculate the precision of retrieved documents based on top 1,000 results of each category, where correctness of a retrieval is judged by human workers. 9

  10. Quantitative Results Baseline Proposed #objects 10

  11. Illustrative Examples 11

  12. Conclusions and Future Work ● Proposed a multimodal embedding based multimodal retrieval method to enhance visual knowledge extraction from the Web images. ● Shows significant improvement over unimodal approach (72.95% -> 81.43%). In the future, we will continue to explore other algorithms like graphical models or deep learning models to make more efficient use of multimodal information. 12

  13. Thank You! About Me (Dihong Gong) Education: 5-th year Ph.D. student in CS department, University of Florida (United States). Research Interest: Multimodal Contents Analysis, Information Extraction, Computer Vision. Email: gongd@ufl.edu More Info: www.cise.ufl.edu/~dihong Looking for full-time research scientist positions! 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend