Multi-Hop RC, HotpotQA & GNNs Select, Answer and Explain: - - PowerPoint PPT Presentation

multi hop rc hotpotqa gnns
SMART_READER_LITE
LIVE PREVIEW

Multi-Hop RC, HotpotQA & GNNs Select, Answer and Explain: - - PowerPoint PPT Presentation

Multi-Hop RC, HotpotQA & GNNs Select, Answer and Explain: Interpretable Multi-hop Reading Comprehension over Multiple Documents Tu et al., AAAI 2020 Presented By: Lovish Madaan References HotpotQA - Peng Qi (Stanford) GNNs -


slide-1
SLIDE 1

Multi-Hop RC, HotpotQA & GNNs

Select, Answer and Explain: Interpretable Multi-hop Reading Comprehension over Multiple Documents – Tu et al., AAAI 2020 Presented By: Lovish Madaan

slide-2
SLIDE 2

References

  • HotpotQA - Peng Qi (Stanford)
  • GNNs - Jure Leskovec (Stanford), AAAI 2019 Tutorial by

William Hamilton (McGill)

  • Some elements and images borrowed from Tu et al. (AAAI

2020), Yang et al. (EMNLP 2018), and Jay Alammar

slide-3
SLIDE 3

Topics

  • Introduction and HotpotQA
  • Select, Answer and Explain
  • GNNs
  • Answer and Explain
  • Results and Ablation Study
  • Reviews
slide-4
SLIDE 4

The Promise of Question Answering

In which city was Facebook first launched? Cambridge, Massachusetts. This is because Mark Zuckerberg and his business partners launched it from his Harvard dormitory [1], and Harvard is located in Cambridge, Massachusetts [2].

[1] https://en.wikipedia.org/wiki/Mark_Zuckerberg [2] https://en.wikipedia.org/wiki/Harvard_University

slide-5
SLIDE 5

The Promise of Question Answering

In which city was Facebook first launched? Cambridge, Massachusetts. This is because Mark Zuckerberg and his business partners launched it from his Harvard dormitory [1], and Harvard is located in Cambridge, Massachusetts [2].

[1] https://en.wikipedia.org/wiki/Mark_Zuckerberg [2] https://en.wikipedia.org/wiki/Harvard_University

Reality

Sorry, folks from Google!

slide-6
SLIDE 6

The Promise of Question Answering

In which city was Facebook first launched? Cambridge, Massachusetts. This is because Mark Zuckerberg and his business partners launched it from his Harvard dormitory [1], and Harvard is located in Cambridge, Massachusetts [2].

[1] https://en.wikipedia.org/wiki/Mark_Zuckerberg [2] https://en.wikipedia.org/wiki/Harvard_University

Multi-hop reasoning

slide-7
SLIDE 7

The Promise of Question Answering

In which city was Facebook first launched? Cambridge, Massachusetts. This is because Mark Zuckerberg and his business partners launched it from his Harvard dormitory [1], and Harvard is located in Cambridge, Massachusetts [2].

[1] https://en.wikipedia.org/wiki/Mark_Zuckerberg [2] https://en.wikipedia.org/wiki/Harvard_University

Multi-hop reasoning Text-based, diverse

slide-8
SLIDE 8

The Promise of Question Answering

In which city was Facebook first launched? Cambridge, Massachusetts. This is because Mark Zuckerberg and his business partners launched it from his Harvard dormitory [1], and Harvard is located in Cambridge, Massachusetts [2].

[1] https://en.wikipedia.org/wiki/Mark_Zuckerberg [2] https://en.wikipedia.org/wiki/Harvard_University

Multi-hop reasoning Explainability Text-based, diverse

slide-9
SLIDE 9

Multi-hop reasoning Explainability Text-based, diverse Comparison Questions HotpotQA

slide-10
SLIDE 10

Multi-hop Reasoning across Multiple Documents

  • Previous work (SQuAD,

TriviaQA, etc) When was Chris Martin born?

  • HotpotQA

When was the lead singer of Coldplay born?

(Rajpurkar et al., 2016; Joshi et al., 2017; Dunn et al., 2017)

slide-11
SLIDE 11

Explainability

  • Previous work
  • HotpotQA

Sup fact 1 Sup fact 2 Answer Answer

slide-12
SLIDE 12

Evaluation Settings

  • Distractor Setting
  • 2 gold paragraphs + 8 extracted from information retrieval
  • Fullwiki Setting
  • Entire Wikipedia as context
slide-13
SLIDE 13
  • Types of Instances
  • Bridge Entity Questions
  • Comparison Questions
slide-14
SLIDE 14

Topics

  • Introduction and HotpotQA
  • Select, Answer and Explain
  • GNNs
  • Answer and Explain
  • Results and Ablation Study
  • Reviews
slide-15
SLIDE 15

Multi-hop RC – Previous Works

  • Adapt techniques from

single-hop QA

  • Use Graph Neural Networks

(GNNs)

  • Cao et al., 2018 – Build entity

graph and realize multi-hop reasoning

slide-16
SLIDE 16

Shortcomings – Previous Works

  • Concatenate multiple documents / Process documents

separately

  • No document filters
  • Current application of GNNs
  • Entities as nodes – either pre specified / use NER
  • Further processing if answer is not an entity
slide-17
SLIDE 17

Select, Answer and Explain (SAE)

slide-18
SLIDE 18

Preprocessing & Inputs

  • Question and set of documents
  • Answer text
  • Set of labelled support sentences from each document
  • Label corresponding to each document - 𝐸𝑗 (0/1)
  • Answer type – (“Span” / “Yes” / “No”)
slide-19
SLIDE 19

Select Module

  • [CLS] + Q + [SEP] + D + [SEP]
  • One Approach – Use BCE

with [CLS] embeddings as features

  • Neglects inter-document

interactions

slide-20
SLIDE 20

MHSA – Single Attention Head

X – matrix of [CLS] embeddings of question/document pairs

slide-21
SLIDE 21

MHSA – Multiple Attention Heads

Output is the matrix of modified [CLS] embeddings having contextual information

slide-22
SLIDE 22

Pairwise Bi-Linear Layer

  • 𝑇 𝐸𝑗 - Score for each document (0/1/2)
  • 𝑚𝑗,𝑘 = 1 𝑗𝑔 𝑇 𝐸𝑗 > 𝑇 𝐸

𝑘

0 𝑗𝑔 𝑇 𝐸𝑗 ≤ 𝑇(𝐸

𝑘)

  • 𝑀 = − 𝑗=0

𝑜

𝑘=0,𝑘≠𝑗

𝑗

𝑚𝑗,𝑘log(𝑄 𝐸𝑗, 𝐸

𝑘 ) + (1 − 𝑚𝑗,𝑘)log(1 − 𝑄 𝐸𝑗, 𝐸 𝑘 )

  • 𝑆𝑗 = 𝑘

𝑜 Ι 𝑄 𝐸𝑗, 𝐸 𝑘 > 0.5 - Relevance score for each document

  • Take top-k documents according to this relevance score
slide-23
SLIDE 23

Answer Prediction

  • Gold Documents extracted from Select Module
  • [CLS] + Q + [SEP] + Context + [SEP]

BERT 𝐼𝑗 ∈ ℝ𝑀 × 𝑒

2-Layer MLP (𝑔

𝑡𝑞𝑏𝑜)

𝑍 ∈ ℝ𝑀 × 2

slide-24
SLIDE 24

Contextual Sentence Embeddings

  • Sentence Representation:
  • Self Attention Weights:

2-layer MLP (𝑔

𝑏𝑢𝑢)

[0/1 Label] Weighted Representation

slide-25
SLIDE 25

Contextual Sentence Embeddings - 2

  • Motivation for adding start and end span probabilities
  • Answer span -> Supporting Sentence
  • Final sentence embeddings:
slide-26
SLIDE 26

Sentence Graph

  • Construct a graph with the following properties:
  • Nodes represent the sentences
  • Each node has label 0/1 (supporting sentence)
  • 3 types of edges
  • Between nodes present in the same document (Type 1)
  • Between nodes of different documents if they have named entities / noun

phrases (can be different) present in the question (Type 2)

  • Between nodes of different documents if they have the same named

entity / noun phrase (Type 3)

slide-27
SLIDE 27

Sentence Graph

slide-28
SLIDE 28

Topics

  • Introduction and HotpotQA
  • Select, Answer and Explain
  • GNNs
  • Answer and Explain
  • Results and Ablation Study
  • Reviews
slide-29
SLIDE 29

11 Jure Leskovec, Stanford University

Images T ext/Speech

Modern deep learning toolbox is designed for simple sequences & grids

slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37

average of neighbor’s previous layer embeddings

The Math

Tutorial on Graph Representation Learning, AAAI 2019 19

§ Average neighbor messages and apply a neural network.

Initial “layer 0” embeddings are equal to node features kth layer embedding

  • f v

non-linearity (e.g., ReLU or tanh) previous layer embedding of v

h0

v = xv

hk

v = σ

@ W k X

u2 N (v)

hk− 1

u

|

N(v)|+ B khk− 1

v

1 A , 8 k > 0

slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40

Graph Attention Networks

Tutorial on Graph Representation Learning, AAAI 2019 60

§ Augment basic graph neural network model with attention.

Sum over all neighbors (and the node itself) Non-linearity

hk

v = σ(

X

u2 N (v)[ { v}

↵ v,uW khk− 1

u

)

slide-41
SLIDE 41

Training the Model

Tutorial on Graph Representation Learning, AAAI 2019 20

zA

u

§

slide-42
SLIDE 42

Training the Model

Tutorial on Graph Representation Learning, AAAI 2019 21

§ After K-layers of neighborhood aggregation, we get output embeddings for each node. § and run stochastic gradient descent to train the aggregation parameters.

trainable matrices (i.e., what we learn)

h0

v = xv

hk

v = σ

@Wk X

u2N(v)

hk− 1

u

| N(v)| + Bkhk− 1

v

1 A , 8k 2 {1, . . . , K} zv = hK

v

slide-43
SLIDE 43

Training the Model

Tutorial on Graph Representation Learning, AAAI 2019 24

§ : Directly train the model for a supervised task (e.g., node classification):

L =

X

v2 V

yv log(σ(z>

v ✓)) + (1 − yv) log(1 − σ(z> v ✓))

  • utput node

embedding classification weights node class label

slide-44
SLIDE 44

Overview of Model

Tutorial on Graph Representation Learning, AAAI 2019 25

zA

u

slide-45
SLIDE 45

Overview of Model

Tutorial on Graph Representation Learning, AAAI 2019 26

slide-46
SLIDE 46

Overview of Model

Tutorial on Graph Representation Learning, AAAI 2019 27

slide-47
SLIDE 47

Topics

  • Introduction and HotpotQA
  • Select, Answer and Explain
  • GNNs
  • Answer and Explain
  • Results and Ablation Study
  • Reviews
slide-48
SLIDE 48

Aggregation mechanism in SAE

slide-49
SLIDE 49

Graph Representation

  • Weighted sum of the embeddings of the nodes of the graph
  • The weights are given by
slide-50
SLIDE 50

Answer and Explain Pipeline

slide-51
SLIDE 51

Topics

  • Introduction and HotpotQA
  • Select, Answer and Explain
  • GNNs
  • Answer and Explain
  • Results and Ablation Study
  • Reviews
slide-52
SLIDE 52

Dataset Details

  • Train – 90K
  • Validation/Dev – 7.4K
  • Test – 7.4K
slide-53
SLIDE 53

Results

slide-54
SLIDE 54

Ablation Study – Document Selection Module

slide-55
SLIDE 55

Ablation Study – Answer & Explain Module

slide-56
SLIDE 56

Ablation Study – Bridge / Comp. Questions

slide-57
SLIDE 57

Attention Heatmap Example

Question - “Were Scott Derrickson and Ed Wood of the same nationality?”

slide-58
SLIDE 58

HotpotQA Leaderboard

slide-59
SLIDE 59

Topics

  • Introduction and HotpotQA
  • Select, Answer and Explain
  • GNNs
  • Answer and Explain
  • Results and Ablation Study
  • Reviews
slide-60
SLIDE 60

Reviews (Pros)

  • Detailed Ablation Study [Atishya, Pratyush, Rajas, Saransh]
  • Usage of contextualized sentence embeddings [Atishya, Jigyasa]
  • MHSA in Document Selection [Pratyush, Shubham, Rajas, Siddhant]
  • “Learning to Rank” framework is general [Keshav]
  • Top 3 position on the leaderboard [Pratyush, Keshav, Rajas, …..]
  • Simple Idea [Soumya]
  • Single Model gives good performance [Keshav]
  • Careful modelling of the loss function [Vipul]
  • “Explainability” of the model [Various people]
slide-61
SLIDE 61

Reviews (Cons)

  • Motivation for Type 2 edges not present [Pratyush, Rajas]
  • No clear flow [Atishya]
  • Entire context fed to BERT [Pratyush, Jigyasa]
  • Pairwise ranking costly [Siddhant, Saransh, Jigyasa]
  • Do not evaluate on Fullwiki setting, simple method for edges [Keshav]
  • Post-facto explanation [Rajas, Soumya]
  • Layers for GCN not mentioned [Vipul]
  • GNN not explained clearly, performance gain is low [Pratyush]
slide-62
SLIDE 62

Reviews (Extensions)

  • Extract relevant spans instead of documents [Pratyush]
  • Modify the above extension as span-prediction [Keshav]
  • Replace pairwise ranking [Shubham, Saransh]
  • End-to-end training (RL/integrate REALM) [Siddhant, Jigyasa]
  • OpenIE for graph generation [Keshav]
  • Enforce constraints in pairwise prediction models [Atishya]
  • Handle exposure bias by gradually replacing gold documents with

retrieved documents [Rajas]

  • Link sentences using clustering methods [Soumya]
slide-63
SLIDE 63

Thank You!

Questions?