Extracting Visual Knowledge from the Web with Multimodal Learning - PowerPoint PPT Presentation

Extracting Visual Knowledge from the Web with Multimodal Learning Dihong Gong, Daisy Zhe Wang Data Science Research Lab Department of Computer and Information Science and Engineering University of Florida In Presentation of International Joint Conference on Artificial Intelligence (Melbourne, 2017)

Problem Description Goal: extract visual knowledge from large collection of web pages. 2

Related Work: web visual knowledge extraction 3 Chen, Xinlei, Abhinav Shrivastava, and Abhinav Gupta. "Neil: Extracting visual knowledge from web data”, ICCV 2013.

Motivation: Traditional Approaches vs. Our Approach A girl is playing with a A girl is playing with a Inputs sleeping dog in a room. sleeping dog in a room . Text Visual Visual Extraction Discard Processing Detection Detection V S Multimodal Prediction G-BBOX: Chair G-BBOX: Chair Outputs Utilize text for enhanced prediction R-BBOX: Dog R-BBOX: Dog Y-BBOX: Person Y-BBOX: Person 4

Step 1: Candidate Objects Extraction ● Fast RCNN with CaffeNet model ● Geodesic Object Proposals for candidate box generation ● Trained on 46 categories from ImageNet (~250 samples per category) ● Apply the trained object detector on ~10 million web images to extract objects 5

Step 2: Learning Common Representations wheel car Text vehicle wheel: 0.8 Image dealer Gainesville car: 0.9 Gainesville vehicle dealer for 20 years! Example 2D Embeddings ● Convert an image into multimodal document: m[k] = {“Vehicle”, “car”, “wheel”} ● Given {m[1], m[2], …, m[N]} training documents, apply Skip-Gram based word-to-vec program to learn vector embeddings, where each multimodal document is treated as a single sentence in text domain. ● Multimodal Embeddings: objects with similar semantic meanings are closer. 6

Step 3: Multimodal Prediction ● Features: each multimodal document is converted into a bag of vectors. ● Model: sparse logistic regression. ● Training: train one model per image category, where positive samples are multimodal documents whose images can detect objects of that category, and negative samples are the rest. ● Scoring: Apply prediction model on multimodal documents, and the final confidence score is given by where, p is a [0,1] confidence score given by logistic regression, and q is the confidence score given by object detector. Thus, p mostly carries text info, while q carries visual info. 7

Dataset We prepared our own dataset by these steps: 1. Parsed hundreds of millions of HTML web pages. 2. Extract image urls and meta text. 3. Calculate the text tags of each image by image tagging program (use TFIDF to measure importance of a tag). 4. Download images using Amazon S3 and EC2 in a distributed manner. This results in a collection of around 10 million tagged images for our study. 8

Evaluation Methods ● We compare our approach to baseline system using only image information. ● Extract all objects (of 46 categories) from the 10 million images, using both baseline and proposed approach. ● Calculate the precision of retrieved documents based on top 1,000 results of each category, where correctness of a retrieval is judged by human workers. 9

Quantitative Results Baseline Proposed #objects 10

Illustrative Examples 11

Conclusions and Future Work ● Proposed a multimodal embedding based multimodal retrieval method to enhance visual knowledge extraction from the Web images. ● Shows significant improvement over unimodal approach (72.95% -> 81.43%). In the future, we will continue to explore other algorithms like graphical models or deep learning models to make more efficient use of multimodal information. 12

Thank You! About Me (Dihong Gong) Education: 5-th year Ph.D. student in CS department, University of Florida (United States). Research Interest: Multimodal Contents Analysis, Information Extraction, Computer Vision. Email: gongd@ufl.edu More Info: www.cise.ufl.edu/~dihong Looking for full-time research scientist positions! 13

Extracting Visual Knowledge from the Web with Multimodal Learning - PowerPoint PPT Presentation

Extracting Visual Knowledge from the Web with Multimodal Learning Dihong Gong, Daisy Zhe Wang Data Science Research Lab Department of Computer and Information Science and Engineering University of Florida In Presentation of International Joint

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

1 Methods of Extracting or Obtaining Essential Oils The most common method for extracting

Orko: Facilitating Multimodal Interaction for Visual Exploration and Analysis of Networks Arjun

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Extracting knowledge from life courses: clustering and visualization 1 Nicolas S. Mller, Alexis

A simple and robust A simple and robust algorithm for extracting algorithm for extracting

Extracting Tables from PDFs Extracting Tables from PDFs Using Camelot and Excalibur to

Extracting Gait Parameters Extracting Gait Parameters from Raw Data from Raw Data

Program Analysis Program Analysis Extracting information, in order to present Extracting

CKM 2006 CKM 2006 Extracting CKM phase from phase from Extracting CKM B K

KNOWLEDGE ACQUISITION AND CONSTRUCTION Transfer of Knowledge Knowledge acquisition is the

View from our house prior to TAG Oil Sarah Roberts View from our house TAG Oil at Cheal B

VISUAL LIBRARY THE VISUAL LIBRARY CONTACT URL: https://visuals.newzealand.com Contact: Jodi

Remote Control of Pump Stations & Networks in Real Time

Improve Tag Suggestions in Image Sharing Sites Position of .. o Mathias Lux,

Internal Procedures 2015-2016 1 Account Coding OVERVIEW Texas Education Agency Financial

< Idaho Department of Fish and Game Clearwater Regional Office ) 7 (Selway data) Robert Hand

EMA on social media Monika Benstetter, Head of Media and Public Relations Communications

strt r r r

Extracting Visual Knowledge from the Web with Multimodal Learning - PowerPoint PPT Presentation

Extracting Visual Knowledge from the Web with Multimodal Learning Dihong Gong, Daisy Zhe Wang Data Science Research Lab Department of Computer and Information Science and Engineering University of Florida In Presentation of International Joint

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

Multimodal Machine Learning Louis-Philippe (LP) Morency CMU Multimodal Communication and Machine

1 Methods of Extracting or Obtaining Essential Oils The most common method for extracting

Orko: Facilitating Multimodal Interaction for Visual Exploration and Analysis of Networks Arjun

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

Multimodal Corridor Planning &amp; Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

MULTIMODAL OPTIMIZATION MIKE PREUSS. Multimodal Optimization 1 2014-09-14 Mike Preuss

Extracting knowledge from life courses: clustering and visualization 1 Nicolas S. Mller, Alexis

A simple and robust A simple and robust algorithm for extracting algorithm for extracting

Extracting Tables from PDFs Extracting Tables from PDFs Using Camelot and Excalibur to

Extracting Gait Parameters Extracting Gait Parameters from Raw Data from Raw Data

Program Analysis Program Analysis Extracting information, in order to present Extracting

CKM 2006 CKM 2006 Extracting CKM phase from phase from Extracting CKM B K

KNOWLEDGE ACQUISITION AND CONSTRUCTION Transfer of Knowledge Knowledge acquisition is the

View from our house prior to TAG Oil Sarah Roberts View from our house TAG Oil at Cheal B

VISUAL LIBRARY THE VISUAL LIBRARY CONTACT URL: https://visuals.newzealand.com Contact: Jodi

Remote Control of Pump Stations &amp; Networks in Real Time

Improve Tag Suggestions in Image Sharing Sites Position of .. o Mathias Lux,

Internal Procedures 2015-2016 1 Account Coding OVERVIEW Texas Education Agency Financial

&lt; Idaho Department of Fish and Game Clearwater Regional Office ) 7 (Selway data) Robert Hand

EMA on social media Monika Benstetter, Head of Media and Public Relations Communications

strt r r r

Multimodal Corridor Planning & Engineering Analysis Project A1A MULTIMODAL CORRIDOR PLANNING

Remote Control of Pump Stations & Networks in Real Time

< Idaho Department of Fish and Game Clearwater Regional Office ) 7 (Selway data) Robert Hand