[PPT] - Multimodal Knowledge Graphs Generation Methods, Applications, and PowerPoint Presentation

SLIDE 1

1

Multimodal Knowledge Graphs

Generation Methods, Applications, and Challenges Shih‐Fu Chang

Alireza Zareian, Hassan Akbari, Brian Chen, Svebor Karaman, Zhecan James Wang, and Haoxuan You Columbia University

Prof. Heng Ji,

Manling Li, Di Lu, and Spencer Whitehead University of Illinois, Urbana‐Champaign

SLIDE 2

K no wle dg e Gra phs

 E

ntitie s, e ve nts, re la tio ns, e tc . 2

T e xt IE

Visit Isr ae l Princ e Willia m

The first-ever official visit by a British royal to Israel is underway. Prince William the 36-year-old Duke of Cambridge and second in line to the throne will meet with both Israeli and Palestinian leaders over the next three days.

SLIDE 3

K no wle dg e Gra phs

 E

ntitie s, e ve nts, re la tio ns, e tc .

 E

ve nts de sc rib e wha t ha ppe ns

 E

ntitie s a re c ha ra c te rize d b y the a rg ume nt ro le the y pla y in e ve nts

3

T e xt IE

Visit Isr ae l Princ e Willia m

The first-ever official visit by a British royal to Israel is underway Prince William the 36-year-old Duke of Cambridge and second in line to the throne will meet with both Israeli and Palestinian leaders over the next three days.

Age nt De stina tion

SLIDE 4

 Applic a tio n: Que stio n Answe ring , Re a so ning , Hypo the sis Ve rific a tio n a nd Disc o ve ry

K no wle dg e Gra phs

4

T e xt IE

Visit Isr ae l Princ e Willia m

F ind re c e nt visits o f po litic ia ns to I sra e l. Answe rs:

The first-ever official visit by a British royal to Israel is underway Prince William the 36-year-old Duke of Cambridge and second in line to the throne will meet with both Israeli and Palestinian leaders over the next three days.

Age nt De stina tion

SLIDE 5

Knowledge Beyond Text

We communicate through multimedia
Our experiment shows 34% of news images

contain event arguments that are not mentioned in text

TransportPerson_Instrument = stretcher

Stretcher Fire

5

SLIDE 6

Why Multimo da l?

 Visua l da ta c o nta ins c o mple me nta ry da ta use d fo r:

 Visua l I

llustra tio n

 Disa mb ig ua tio n  Additio na l De ta ils

6

Attac k Pr

te ste r

s Bus Age nt T ar ge t Instr ume nt Stone T r anspor t Instr ume nt T r anspor t Wounde d pr

te ste r

Age nt Pe r son Suppor te r s Pe r son De stina tion Ra lly

SLIDE 7

Cha lle ng e s & Applic a tio ns

 Cha lle ng e s:

 Pa rsing ima g e s/ vide o s to struc ture s  Gro unding e ve nt/ e ntitie s a c ro ss mo da litie s  E

xtra c ting c o mple me nta ry multimo da l a rg ume nts

7

T e xt IE Visua l IE

?

Applic ation

Sc e ne g ra ph T e xt gr aph Multi- Moda l Knowle dg e Gra ph

SLIDE 8

Cha lle ng e 1: Pa rsing I ma g e s to Sc e ne Gra phs

 E

xtra c t struc ture d re pre se nta tio n o f a sc e ne

 E

ntitie s a nd the ir se ma ntic re la tio nships

8

Object Detection

SLIDE 9

Pa rsing I ma g e s to Sc e ne Gra phs

 E

xisting me tho d

 E

xtra c t o b je c t pro po sa ls

 Co nte xtua lize fe a ture s b y

RNN (o r me ssa g e pa ssing )

 Cla ssify a ll no de s a nd

pa irs o f no de s

 L

imitations

 Co mputa tio na lly e xha ustive

 𝑃 𝑜

fo r 𝑜 100 pro po sa ls

 Diffic ult to mo de l hig he r

rde r re la tio nships, e .g .

“girl e ating c ake with fo rk”

 Re q uire s full supe rvisio n

9

(Xu et. al, CVPR 2017)

Neural Motifs (Zellers, Yatskar, Thomson, Choi, CVPR 2018) One of the SOTA methods for scene graph generation

SLIDE 10

Re fo rmula te a s a n E ve nt-Ce ntric Pro b le m

 Our wo rk: Visual Se mantic Parsing Ne twork (Zare ian e t al. CVPR19)  Ge ne ra lize d fo rmula tio n o f sc e ne g ra ph g e ne ra tio n

 E

ntity-c e ntric  b ipa rtite re pre se nta tio n o f pre dic a te s & e ntitie s

 Re duc e c o mputa tio na l c o mple xity fro m 𝑃 𝑜

to sub -q ua dra tic

 Mo de l a rg ume nt ro le re la tio ns b e yo nd (sub je c t, o b je c t), (a g e nt, pa tie nt) re la tio ns

10

e ating holding be longs age nt patie nt Gir l Cake Hand F

rk

instrume nt

SLIDE 11

Re fo rmula te a s a n E ve nt-Ce ntric Pro b le m

 Our wo rk: Visual Se mantic Parsing Ne twork (Zare ian e t al. CVPR20)  Ge ne ra lize d fo rmula tio n o f sc e ne g ra ph g e ne ra tio n

 E

ntity-c e ntric  b ipa rtite re pre se nta tio n o f pre dic a te s & e ntitie s

 Re duc e c o mputa tio na l c o mple xity fro m 𝑃 𝑜

to sub -q ua dra tic

 Mo de l a rg ume nt ro le re la tio ns b e yo nd (sub je c t, o b je c t), (a g e nt, pa tie nt) re la tio ns

11

e ating holding be long age nt patie nt Gir l Cake Hand F

rk

instrume nt

SLIDE 12

Bipa rtite E mb e dding s fo r E ntity & Pre dic a te

12

𝐼

1

𝐼

2

𝐼

𝑜

𝐼

1

𝐼

2

𝐼

3

𝐼

𝑜

… …

RPN Ro I Alig n T ra ina b le Pre dic a te E mb e dding Ba nk

SLIDE 13

 I

nitia lize e ntity a nd pre dic a te no de s

 Co mpute ro le -spe c ific a tte ntio n sc o re s

 I

nput: e ntity-pre dic a te fe a ture pa irs

 Output: sc a la r fo r e a c h the ma tic ro le

Arg ume nt Ro le Pre dic tio n

13

𝐼

1

𝐼

2

𝐼

𝑜

𝐼

1

𝐼

2

𝐼

3

𝐼

𝑜

… …

FC

FC
…

age nt patie nt instrume nt

SLIDE 14

Ro le -De pe nde nt Me ssa g e Pa ssing

 Bi- dir

e c tional Me ssage passing

 E

ntitie s  Role s  Pr e dic ate s

14

𝐼

1

𝐼

2

𝐼

𝑜

𝐼

1

𝐼

2

𝐼

3

𝐼

𝑜

… … …

age nt patie nt instrume nt

… … … …

FC_

→ .

FC_

→ .

FC_

→ .

FC_

→ .

FC_

→

FC_

→

FC_

→ .

…

FC_

→

Me ssa g e Pa ssing

SLIDE 15

Ro le -De pe nde nt Me ssa g e Pa ssing

 Bi- dir

e c tional Me ssage passing

 E

ntitie s  Role s  Pr e dic ate s

15

𝐼

1

𝐼

2

𝐼

𝑜

𝐼

1

𝐼

2

𝐼

3

𝐼

𝑜

… … …

age nt patie nt instrume nt

… … … …

FC_

→

FC_

→

FC_

→

FC_

→

FC_

→

FC_

→.

…

FC_

→

Me ssa g e Pa ssing

SLIDE 16

Visua l Se ma ntic Pa rsing Ne two rk

 Bi-dire c tio na l Me ssa g e pa ssing  Re pe a t fo r 𝑣 ite ra tio ns  Cla ssify no de s a nd e dg e s 16

𝐼

1

𝐼

2

𝐼

𝑜

𝐼

1

𝐼

2

𝐼

3

𝐼

𝑜

… … … … … … … …

age nt patie nt instrume nt

… … … …

e ating holding be long Gir l Cake Hand F

rk

… …

FC

FC
Bina rize

SLIDE 17

Visua l Se ma ntic Pa rsing Ne two rk

 We akly supe r

vise d tr aining

 Unkno wn a lig nme nt b e twe e n o utput a nd g ro und truth g ra phs 17

𝐼

1

𝐼

2

𝐼

𝑜

𝐼

1

𝐼

2

𝐼

3

𝐼

𝑜

… … … … … … … …

age nt patie nt instrume nt

… … … …

e ating holding be long Gir l Cake Hand F

rk

… … Gr

und tr

uth

𝓜𝑭 𝓜𝑸 𝓜𝑺

Gir l | 𝐷1 Cake | 𝐷2 Hand| 𝐷3 F

rk| 𝐷𝑜

e ating| 𝐷1 be long| 𝐷𝑜 holding| 𝐷2

SLIDE 18

Visua l Se ma ntic Pa rsing Ne two rk

18

SLIDE 19

I nc o rpo ra te E xte rna l K B (Za re ia n, e t a l, E

CCV20)

 L

ink c o nc e pts in sc e ne g ra phs to e xte rna l kno wle dg e b a se s suc h a s Co nc e ptNe t

 Pa ss me ssa g e s o ve r b ridg e s b e twe e n

sc e ne g ra phs a nd e xte rna l g ra phs

 Re fine b ridg e s b e twe e n g ra phs

19

SLIDE 20

Sc e ne Gra ph E xa mple s o f GB-NE T

20

Ours (GB- Ne t) Base line (KE RN) Ours (GB- Ne t) Base line (KE RN)

SLIDE 21

Cha lle ng e 2: T e xt-Visua l Gro unding (Akb a ri e t a l CVPR19)

21  L

c a lize te xt q ue ry in ima g e

 Bridg e visua l a nd te xt kno wle dg e g ra phs  Witho ut using pre de fine d c la ssifie rs

 Cha lle ng e s

 Se nsitive to do ma in va ria tio ns  Ab stra c t c o nc e pt no t g ro unda b le

SLIDE 22

Cha lle ng e 3: Multimo da l E ve nt & Arg ume nt E xtra c tio n

 Cha lle ng e s:

 Pa rsing ima g e s/ vide o s to struc ture s  Gro unding e ntitie s a c ro ss mo da litie s  Jo int e xtra c tio n o f multimo da l

a rg ume nt

22

T e xt IE Visua l IE

?

Applic ation

Sc e ne g ra ph T e xt gr aph Multi- Moda l Knowle dg e Gra ph

SLIDE 23

Multimo da l K G E xa mple

23

Attac k Pr

te ste r

s Bus Age nt T ar ge t Instr ume nt Stone T r anspor t Instr ume nt T r anspor t Wounde d pr

te ste r

Age nt Pe r son Suppor te r s Pe r son De stina tion Ra lly

SLIDE 24

Event Movement.TransportPerson deploy Arguments Transporter United States Destination

utskirts

Passenger soldiers Vehicle land vehicle Vehicle land vehicle Last week , U.S . Secretary of State Rex Tillerson visited Ankara, the first senior administration official to visit Turkey, to try to seal a deal about the battle for Raqqa and to

vercome President Recep Tayyip Erdogan's strong objections

to Washington's backing of the Kurdish Democratic Union Party (PYD) militias. Turkish forces have attacked SDF forces in the past around Manbij, west of Raqqa, forcing the United States to deploy dozens of soldiers on the outskirts

f the town in a mission to prevent a repeat of clashes, which

risk derailing an assault on Raqqa.

Input: News article text and image

Output: Image‐related Events & Visual Argument Roles

land vehicle land vehicle

A New Task: Multimedia Event Extraction (M2E2)

24

SLIDE 25

A New Task: Multimedia Event Extraction (M2E2)

Event Conflict.Attack airstrikes Arguments Attacker U.S.-led coalition forces Target airplane Target vehicle

Output: Image‐related Events & Visual Argument Roles

Input: News article text and image

In March , Turkish forces escalated attacks on the YPG in northern Syria , forcing U.S. to deploy a small number of forces in and around the town of Manbij to the northwest

f Raqqa to “deter” Turkish - SDF clashes and ensure the

focus remains on Islamic State. Meanwhile, Raqqa is being pummeled by airstrikes mounted by U.S.-led coalition forces and Syrian warplanes. Local anti-IS activists say the air raids fail to distinguish between military and non-military targets …

airplane vehicle

25

SLIDE 26

Treat image as another language
Represent it with a structure that is similar to AMR in text
Can we find a common representation?

place means

Cross‐media Structured Common Space

26

Linguistic Structure (Abstract Meaning Representation (AMR) / Dependency Tree) Visual Semantic Graph [Zareian et al. CVPR20]

SLIDE 27

Image to Event Graph

ImSitu dataset: situation recognition (Yatskar et al., 2016)
Classify an image as one of 500+ FrameNet verbs (sharing part of ACE)
Identify 192 generic semantic roles

27

SLIDE 28

28

Weakly Aligned Structured Embedding (WASE)

‐‐ Cross‐media shared representation and classifiers

(Li, Zareian, et al, ACL20)

SLIDE 29

Prior work aligns image‐caption vectors by triplet loss.
We want to align two graphs, not just single vectors.

Use image‐caption data for graph alignment

Cross-Attention X – Loss

29

SLIDE 30

Cross-Attention X – Loss

30

Prior work aligns image‐caption vectors by triplet loss.
We want to align two graphs, not just single vectors.

Use image‐caption data for graph alignment

SLIDE 31

Ontology: shared between ACE and imSitu
Event Types: cover 52% of ACE event types
Argument Roles: Based on ACE argument roles, add additional

detectable visual roles (marked in red)

Event Type Argument Roles Life.Die Agent, Victim, Instrument, Place, Time Transaction.TransferMoney Giver, Recipient, Beneficiary, Money, Instrument, Place, Time Conflict.Attack Attacker, Instrument, Place, Target, Time Conflict.Demonstrate Demonstrator, Instrument, Police, Place, Time Contact.Phone-Write Participant, Instrument, Place, Time Contact.Meet Participant, Place, Time Justice.ArrestJail Agent, Person, Instrument, Place, Time Movement.Transport Agent, Artifact/Person, Instrument, Destination, Origin, Time

A New Multimodal Dataset for M2E2 Evaluation

31

(Li, Zareian, et al, ACL20)

SLIDE 32

32

Experiment Results

Training with MM Multimodal Task

SLIDE 33

Compare to Single Modality Extraction

Image helps textual event extraction, and surrounding

sentence helps visual event extraction

33

Missed by text-only model Misclassified by image-only model as “Demonstration”

SLIDE 34

Applic a tio n 1: Visua l Co mmo nse nse Re a so ning (VCR)

 Unde rsta nd se ma ntic s in ima g e s a nd la ng ua g e , e xplo re c o mmo nse nse  Pro vide to -the -po int a nswe r 34

Ze lle rs e t al. CVPR 2019

SLIDE 35

Co mb ine Visua l Sc e ne Gra phs with VCR

E

xpa nd input to inc lude o b je c ts a nd pre dic a te re la tio ns in g ra ph

Atte ntio n tra nsfo rme rs limite d to spa rse c o nne c tio ns in sc e ne g ra phs

35

[CL S] Why … ? [SE P] … Graph- base d Global- L

c al Atte ntion T

ransforme rs (GL AT , E CCV’20) … … pe r son1

e ntity pr e dic ate

bje c t

subje c t c or e fe re nc e

ma sking Ima g e - te xt ma tc hing

bje c t/ r

e lation r e c ognition QA

bje c t
bje c t

pr e dic ate

SLIDE 36

1 2 3 4 5

Gra ph-b a se d Glo b a l-L

c a l Atte ntio n T

ra nsfo rme rs (Za re ia n, e t a l E

CCV20)

36

1 2 3 4

laye r 2 laye r L … c onc at + line ar loc al he ads global he ads

1111

1

5555

5

laye r 1 Node Classifie r E dge Classifie r de c ode r

pe r son r iding be hind mountain ?

5 1 2 3 4 5 1 2 3 4 5

…

pe r son r iding be hind mountain bike

1 2 3 4 5

e ntity pr e dic ate

bje c t

subje c t

pe r son r iding be hind mountain hor se

ground truth Node & E dge L

ss

SLIDE 37

Model Type (Entity #, Predicate #) Q -> A LXMERT Initial Graph (36,18) 65.09 (baseline) Relevance Sel. (8, x) 74.04 (+8.95) GLAT

(LXMERT)

Initial Graph (36, 18) 65.24 (baseline) Relevance Sel. (26, x) 69.57 (+4.33) Relevance Sel. (18, x) 72.33 (+7.09) Relevance Sel. (8, x) 74.45 (+9.21)

Scene Graph + Query-Adaptive Concept Selection

For each question, select most relevant nodes on the scene graph

SLIDE 38

Q: Why is sheep near the construction ? A: Sheep is near its natural habitat as well.

Initial Graph man, vest, pants, building, rock, sky, window, shirt

(sorted by confidence score from SG)

Relevance, Question building, door, man, men, window, rock, ground, animal

(sorted by relevance score against question)

Relevance, Question + Answer Candidate man, building, animal, dirt, rock, gate, ground, plant

(sorted by relevance score against question + answer candidate)

SLIDE 39

Application 2: Multimodal KG Extraction from COVID‐19 Medical Papers

39

Figure 1. FDA approved drugs of most interest for repurposing as potential Ebola virus treatments.

KG from caption text FDA Drugs Ebola approve repurpose PDF images extraction, segmentation, and recognition Multimedia Knowledge Graph Construction Treatment

SLIDE 40

Co nc lusio ns

 Multimo da l K

no wle dg e Gra phs

 Unde rsta nding se ma ntic struc ture s in b o th la ng ua g e a nd visio n  Jo int re pre se nta tio n a nd mo de ls

 Applic a tio ns

 Re a so ning (VCR)  Disc o ve ry (COVI

D-19)  Cha lle ng e s

 Ope n-vo c a b ula ry a nd Se lf-Supe rvise d mo de ls  K

no wle dg e g ra phs fo r vide o

 Co mmo nse nse E

xtra c tio n fro m MM K G physic s, b e ha vio r, c a usa l/ te mpo ra l 40

T e xt IE Visua l IE

?

Applic ation

Sc e ne g ra ph T e xt gr aph Multi- Moda l Knowle dg e Gra ph

SLIDE 41

Re fe re nc e s

 Za re ia n, Alire za , Sve b o r K

a ra ma n, a nd Shih-F u Cha ng . "We a kly Supe rvise d Visua l Se ma ntic Pa rsing ." Pro c e e ding s o f the I E E E / CVF Co nfe re nc e o n Co mpute r Visio n a nd Pa tte rn Re c o g nitio n. CVPR 2020.

 Za re ia n, Alire za , Sve b o r K

a ra ma n, a nd Shih-F u Cha ng . "Bridg ing kno wle dg e g ra phs to g e ne ra te sc e ne g ra phs." a rXiv pre print a rXiv:2001.02314 (2020). E CCV 2020.

 Akb a ri, Ha ssa n, Sve b o r K

a ra ma n, Sura b hi Bha rg a va , Bria n Che n, Ca rl Vo ndric k, a nd Shih-F u Cha ng . "Multi-le ve l multimo da l c o mmo n se ma ntic spa c e fo r ima g e -phra se g ro unding ." I n Pro c e e dings o f the I

E E E Co nfe re nc e o n Co mpute r Visio n and Patte rn Re c o gnitio n. 2019.

 L

i, Ma nling , Alire za Za re ia n, Qi Ze ng , Spe nc e r White he a d, Di L u, He ng Ji, a nd Shih-F u Cha ng . "Cro ss-me dia Struc ture d Co mmo n Spa c e fo r Multime dia E ve nt E xtra c tio n." arXiv pre print arXiv:2005.02472 (2020). ACL 2020.

 Za re ia n, Alire za , Ha o xua n Yo u, Zhe c a n Wa ng , a nd Shih-F

u Cha ng . "L e a rning Visua l Co mmo nse nse fo r Ro b ust Sc e ne Gra ph Ge ne ra tio n." arXiv pre print

arXiv:2006.09623 (2020). E

CCV 2020. 41