[PPT] - Multimodal Learning for Image Captioning and Visual Question PowerPoint Presentation

SLIDE 1

Multimodal Learning for Image Captioning and Visual Question Answering

Xiaodong He

Deep Learning T echnology Center Microsoft Research

UC Berkeley, April 7th, 2016

SLIDE 2

SLIDE 3

SLIDE 4

Knowledge Vision Text

Barack Obama is an American politician serving as the 44th President of the United States. Born in Honolulu, Hawaii, … in 2008, he defeated Republican nominee and was inaugurated as president

n January 20, 2009.

(Wikipedia.org)

http://s122.photobucket.com/user/b meuppls/media/stampede.jpg.html

Freebase

SLIDE 5

a man holding a tennis racquet

n a tennis court

the man is on the tennis court playing a game

Image Captioning (one step from perception to cognition)

describe objects, attributes, and relationship in an image, in a natural language form

SLIDE 6

Two entries tied at the 1st place at COCO 2015 Caption Challenge

Adopted encoder der-dec ecod

der

er framework from machine translation, Popular: Google, Montreal, Stanford, Berkeley Visual concept detection ction => caption candi didates dates generati ation

n =>

Deep sema mantic tic rankin king

Compositional framework can potentially exploit non paired image- caption data more effectively

[Fang, Gupta, Iandola, Srivastava, Deng, Dollar, Gao, He, Mitchell, Platt, Zitnick, Zweig, “From Caption

ns

s to Visual al Concepts s and Back,” CVPR, June 2015] Vinyals, T

shev, Bengio, Erhan, "Show and Tell: A Neural

l Image Caption

n Generator
r,

“ CVPR, June 2015

SLIDE 7

sitting

SLIDE 8

SLIDE 9

SLIDE 10

cabinets

wooden kitchen sink cabinets

Repeat to generate 500 candidates

floor room stove [Fang, et al., CVPR 2015]

SLIDE 11

Huang, He, Gao, Deng, Acero, Heck, “Learn

rnin ing Deep Structured ed Semantic ic Model for Web Search, “ CIKM, 2013

SLIDE 12

SLIDE 13

15K 15K 15K 15K 15K 500 500 500

max max

... ... ...

max

500

...

Word hashing layer: ft Convolutional layer: ht Max pooling layer: v Semantic layer: y <s> w1 w2 wT <s> Word sequence: xt Word hashing matrix: Wf Convolution matrix: Wc Max pooling operation Semantic projection matrix: Ws ... ...

500

a man … bench

SLIDE 14

– What does the model learn at the convolutional layer?

Capture the local context dependent word sense

Learn one embedding vector for each local context-

dependent word

car body shop

cosine similarity

car body kits 0.698 auto body repair 0.578 auto body parts 0.555 wave body language 0.301 calculate body fat 0.220 forcefield body armour 0.165

The similarity between different “body” within contexts high similarity low similarity

wave body language car body kits auto body part auto body repair car body shop forcefield body armour calculate body fat

semantic space

auto body repair …

ℎ𝑢 = 𝑋

𝑑 × [𝑔 𝑢−1, 𝑔 𝑢, 𝑔 𝑢+1]

SLIDE 15

global

intent

𝑤 𝑗 = max

𝑢=1,…,𝑈 ℎ𝑢(𝑗)

auto body repair cost calculator software Words that win the most active neurons at the max- pooling layers:

Usually, those are salient words containing clear intents/topics

𝑗 = 1, … , 300

ℎ1 𝑤 ℎ2 ℎ𝑈

SLIDE 16

0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193

Mean Reciprocal Rank % (ranking among 5000 candidates on the 5K validation set) CDSSM d=300 CDSSM d=1000 DSSM d=300

3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197

Hamonic Mean Rank (ranking among 5000 candidates on the 5K val set) CDSSM d=300 CDSSM d=1000 DSSM d=300

SLIDE 17

Turing ng Test t Results sults at the MS COCO Captioning Challenge 2015

% of captions that pass the Turing Test Official Rank

MSR SR 32.2% 2% 1st Go Google le 31.7% 7% 1st MSR SR Captiv tivato ator r 30.1% 3rd Montr Montrea eal/T /T

ront
nto

27.2% 2% 3rd Ber erkele eley LRCN CN 26.8% 8% 5th th Other er groups

ups: Baidu/

idu/UCL UCLA, Stanfor anford, Tsing nghua, hua, etc. c. Human 67.5%

Still a big gap!

SLIDE 18

SLIDE 19

System BLEU % Better or Equal to Human Model 1: MELM + DMSM 25.7 34.0% Model 2: MRNN 25.7 29.0%

Devlin, Cheng, Fang, Gupta, Deng, He, Zweig, and Mitchell “Language Models for Image Captioning: The Quirks and What Works, ” ACL 2015

Human judgers shown generated caption and human caption, choose which is “better”, or equal.

SLIDE 20

SLIDE 21

Example: MELM+DMSM: “ A plate with a sandwich and a cup of coffee” MRNN: “ A close up of a plate of food” (more generic)

SLIDE 22

SLIDE 23

Visual concepts Celebrity Landmark Language Model Confidence Model DMSM Features vector

A small boat in Ha Long Bay This image contains: water, boat, lake, mountain, etc. low high

ConvNets

[Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, Chris Sienkiewicz submitted to CVPR Deep Vision 2016]

SLIDE 24

[He, Zhang, Ren, Sun, 2015]

SLIDE 25

The deep p multimodal ltimodal sema mantic tic model el semantic emantic spac ace:

The overall semantics of a caption will also be represented by a vector in this space. If these two vectors are close to each other, then the caption is a good match for the image. Otherwise, not a matching caption.

Image feature

H1 H2 H3

W1 W2 W3 W4

Input s H3

Text: a man holding a tennis

racquet on a tennis court

H1 H2 H3

W1 W2 W3

Input t1 H3

W4

Raw Image pixels Convolution/pooling Fully connected

[Fang, et al., CVPR 2015] [Huang, He, Gao, Deng et al., 2013] [He, Zhang, Ren, Sun, 2015]

SLIDE 26

[Guo, Zhang, Hu, He, Gao,

2016]

SLIDE 27

Image

H1 H2 H3

W1 W2 W3 W4

Input s H3

caption: a man holding a

tennis racquet on a tennis court

H1 H2 H3

W1 W2 W3

Input t1 H3

W4

SLIDE 28

System Excellent Good Bad

Embarrassing

Fang et al., 2015 40.6% 26.8% 28.8% 3.8% New system 51.8% 23.4% 22.5% 2.4%

Human evaluation on 1000 random samples of the COCO test set.

SLIDE 29

System Excellent Good Bad

Embarrassing

Fang et al., 2015 12.0% 13.4% 63.0% 11.6% New system 25.4% 24.1% 45.3% 5.2%

Human evaluation on Instagram test set, which contains 1380 random images that we scraped from Instagram.

SLIDE 30

Conf. score

Excellent Good Bad

Embarrassing

mean 0.59 0.51 0.26 0.20 Std dev 0.21 0.23 0.21 0.19

SLIDE 31

a man wearing a suit and tie Ian Somerhalder wearing a suit and tie a man taking a picture in front of a mirror an picture about person a woman standing in front of a christmas tree a woman standing next to a window a black and white photo of a man wearing a hat a man posing for a picture

Above: Fang2015 Below: Ours

a man on a skateboard this picture is about photo a man holding a stop sign a man holding a stop sign a colorful kite flying in the air a table topped with a kite a couple of people at night a fire hydrant that is lit up at night a black and white photo of a man wearing a hat a man wearing a bow tie looking at the camera a view of a sunset over water a view of a sunset in the ocean a dog sitting on top of a grass covered field a dog sitting in the grass a man holding a baseball bat at a ball a man swinging a baseball bat in front of a crowd

SLIDE 32

a woman sitting on a couch this picture is about person a woman holding a red umbrella the image is about person two women standing in front of a cake a woman posing for a picture a man holding a baseball bat on a field a boy standing in front of a building a person holding a cell phone a hand holding a cell phone a man holding a teddy bear a picture about table a pair of scissors sitting on top of a table a bunch of different items a woman sitting on a bench a woman sitting on a bench

a black and white photo of a woman brushing her hair

a woman standing in front of a mirror a man and a woman wearing a tie a couple posing for a photo a pair of scissors the image is about clothing a group of pictures on the wall this picture seems contain text

SLIDE 33

http://CaptionBot.ai Cognitive Services

SLIDE 34

SLIDE 35

when Jen-Hsun Huang was giving a keynote showing off a GPU-powered VR visiting of mt. Everest -- here is what our CaptionBot has to say.

SLIDE 36

SLIDE 37

SLIDE 38

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola, "Stack acked ed Attent entio ion n Network

rks

s for Image ge Questio ion n Answering ing," CVPR 2016 (oral)

SLIDE 39

SLIDE 40

SLIDE 41

SLIDE 42

SLIDE 43

SLIDE 44

Big improvement

SLIDE 45

umbrella

SLIDE 46

SLIDE 47

SLIDE 48

a herd of elephants standing next to a man a herd of elephants standing next to Obama Obama is chased by his republic competitors 

Image credit: http://s122.photobucket.com/user/bmeup pls/media/stampede.jpg.html

Republic Party

Obama the president from Democratic party whose competitor is Republic party mascot is Elephant

Who is that person? What are behind that man? Why these elephants are chasing him?

SLIDE 49

Character-Level Question Answering with Attention Reasoning in Vector Space: An Exploratory Study of Question Answering Deep Reinforcement Learning with an Action Space Defined by Natural Language Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base Semantic Parsing for Single-Relation Question Answering Embedding Entities and Relations for Learning and Inference in Knowledge Bases Learning Deep Structured Semantic Models for Web Search using Clickthrough Data Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

SLIDE 50