Extracting Visual Knowledge from the Web with Multimodal Learning - - PowerPoint PPT Presentation

extracting visual knowledge from the web with multimodal
SMART_READER_LITE
LIVE PREVIEW

Extracting Visual Knowledge from the Web with Multimodal Learning - - PowerPoint PPT Presentation

Extracting Visual Knowledge from the Web with Multimodal Learning Dihong Gong, Daisy Zhe Wang Data Science Research Lab Department of Computer and Information Science and Engineering University of Florida In Presentation of International Joint


slide-1
SLIDE 1

Extracting Visual Knowledge from the Web with Multimodal Learning

Dihong Gong, Daisy Zhe Wang

Data Science Research Lab Department of Computer and Information Science and Engineering University of Florida

In Presentation of International Joint Conference on Artificial Intelligence (Melbourne, 2017)

slide-2
SLIDE 2

Problem Description

Goal: extract visual knowledge from large collection of web pages.

2

slide-3
SLIDE 3

Related Work: web visual knowledge extraction

Chen, Xinlei, Abhinav Shrivastava, and Abhinav Gupta. "Neil: Extracting visual knowledge from web data”, ICCV 2013. 3

slide-4
SLIDE 4

Motivation: Traditional Approaches vs. Our Approach

A girl is playing with a sleeping dog in a room. A girl is playing with a sleeping dog in a room.

Inputs Extraction Outputs

Visual Detection

G-BBOX: Chair R-BBOX: Dog Y-BBOX: Person

Discard

Visual Detection

G-BBOX: Chair R-BBOX: Dog Y-BBOX: Person

Text Processing Multimodal Prediction

Utilize text for enhanced prediction

4

V S

slide-5
SLIDE 5

Step 1: Candidate Objects Extraction

5

  • Fast RCNN with CaffeNet model
  • Geodesic Object Proposals for candidate box generation
  • Trained on 46 categories from ImageNet (~250 samples per category)
  • Apply the trained object detector on ~10 million web images to extract objects
slide-6
SLIDE 6

Step 2: Learning Common Representations

  • Convert an image into multimodal document: m[k] = {“Vehicle”, “car”, “wheel”}
  • Given {m[1], m[2], …, m[N]} training documents, apply Skip-Gram based

word-to-vec program to learn vector embeddings, where each multimodal document is treated as a single sentence in text domain.

  • Multimodal Embeddings: objects with similar semantic meanings are closer.

6

Text Image Gainesville vehicle dealer for 20 years!

wheel: 0.8 car: 0.9 wheel car vehicle dealer Gainesville

Example 2D Embeddings

slide-7
SLIDE 7

Step 3: Multimodal Prediction

  • Features: each multimodal document is converted into a bag of vectors.
  • Model: sparse logistic regression.
  • Training: train one model per image category, where positive samples are

multimodal documents whose images can detect objects of that category, and negative samples are the rest.

  • Scoring: Apply prediction model on multimodal documents, and the final

confidence score is given by where, p is a [0,1] confidence score given by logistic regression, and q is the confidence score given by object detector. Thus, p mostly carries text info, while q carries visual info.

7

slide-8
SLIDE 8

Dataset

We prepared our own dataset by these steps:

1. Parsed hundreds of millions of HTML web pages. 2. Extract image urls and meta text. 3. Calculate the text tags of each image by image tagging program (use TFIDF to measure importance of a tag). 4. Download images using Amazon S3 and EC2 in a distributed manner.

This results in a collection of around 10 million tagged images for our study.

8

slide-9
SLIDE 9

Evaluation Methods

  • We compare our approach to baseline system using only image information.
  • Extract all objects (of 46 categories) from the 10 million images, using both

baseline and proposed approach.

  • Calculate the precision of retrieved documents based on top 1,000 results of

each category, where correctness of a retrieval is judged by human workers.

9

slide-10
SLIDE 10

Quantitative Results

10

Baseline Proposed #objects

slide-11
SLIDE 11

Illustrative Examples

11

slide-12
SLIDE 12

Conclusions and Future Work

  • Proposed a multimodal embedding based multimodal retrieval method to

enhance visual knowledge extraction from the Web images.

  • Shows significant improvement over unimodal approach (72.95% -> 81.43%).

In the future, we will continue to explore other algorithms like graphical models or deep learning models to make more efficient use of multimodal information.

12

slide-13
SLIDE 13

Thank You!

About Me (Dihong Gong)

Education: 5-th year Ph.D. student in CS department, University of Florida (United States). Research Interest: Multimodal Contents Analysis, Information Extraction, Computer Vision. Email: gongd@ufl.edu More Info: www.cise.ufl.edu/~dihong Looking for full-time research scientist positions!

13