[PPT] - Query Focused Abstractive Summarization via Incorporating Query PowerPoint Presentation

SLIDE 1

Query Focused Abstractive Summarization via Incorporating Query Relevance and Transfer Learning with Transformer Models

Md Tahmid Rahman Laskar1,3, Enamul Hoque2, Jimmy Huang2,3

1 Department of Electrical Engineering and Computer Science, 2 School of Information Technology, 3 Information Retrieval & Knowledge Management Research Lab,

York University, Toronto, Canada

SLIDE 2

Introduction

2

SLIDE 3

Query Focused Abstractive Text Summarization

Problem Statement: A set of documents along with a query is given and the

goal is to generate abstractive summaries from the document(s) based on the given query.

Abstractive summaries can contain novel words which were not appeared in

the source document. Document: Even if reality shows were not enlightening, they generate massive revenues that can be used for funding more sophisticated programs. Take BBC for example, it offers entertaining reality shows such as total wipeout as well as brilliant documentaries. Query: What is the benefit of reality shows? Summary: Reality show generates revenues.

3

SLIDE 4

Motivation

Challenges:
Lack of datasets.
Available datasets: Debatepedia, DUC.
Size of the available datasets is very small.
e.g. Debatepedia (Only around 10,000 training instances).
Few-shot Learning Problem.
Training a neural model end-to-end with small training data is challenging.
Solution: We introduce a transfer learning technique via utilizing the Transformer

architecture [Vaswani et al., 2017]:

First, we pre-train a transformer-based model in a large generic abstractive

summarization dataset.

Then, we fine-tune the pre-trained model in the target query focused abstractive

summarization dataset.

4

SLIDE 5

Contributions

Our proposed approach:
is the first work for the Query Focused Abstractive Summarization task

where transfer learning is utilized with transformer.

sets a new state-of-the-art result in the Debatepedia dataset.
does not require any in-domain data augmentation for Few-shot Learning.
The source code of our proposed model is also made publicly available:

https://github.com/tahmedge/QR-BERTSUM-TL-for-QFAS

5

SLIDE 6

Literature Review

6

SLIDE 7

Related Work

Generic Abstractive Summarization
Pointer Generator Network (PGN) [See et al., 2017].
Sequence to sequence model based on Recurrent Neural Network (RNN).
Overcomes the repetition of the same word in the generated summaries

via the copy and coverage mechanism.

BERT for SUMmarization (BERTSUM) [Liu and Lapata, 2019].
Utilized

BERT [Devlin et al., 2018] as the encoder and Decoder

f

Transformer as the Decoder.

Outperforms PGN for abstractive text summarization in several datasets.
Limitations:
Cannot incorporate query relevance.

7

SLIDE 8

Related Work (cont’d)

Query Focused Abstractive Summarization (QFAS)
Diversity Driven Attention (DDA) Model [Nema et al., 2017].
A neural encoder-decoder model based on RNN.
Introduced a new dataset for the QFAS task from Debatepedia.
Limitations:
Only performs well when the Debatepedia dataset is augmented.

8

SLIDE 9

Related Work (cont’d)

Query Focused Abstractive Summarization (QFAS)
Relevance Sensitive Attention for Query Focused Summarization (RSA-

QFS) [Baumel et al., 2018].

First, pre-trained the PGN model on a generic abstractive summarization

dataset.

Then, incorporated query relevance into the pre-trained model to predict

query focused summaries in the target datasets.

Limitations:
Did not fine-tune their model on the QFAS datasets.
Obtained a very low Precision score in the Debatepedia dataset.

9

SLIDE 10

Methodology

10

SLIDE 11

Proposed Approach

Our proposed model works in two steps via utilizing transfer learning:
We choose the XSUM dataset for pre-training since the generated summaries in

this dataset are more abstractive compared to the other datasets [Liu et al., 2019].

To incorporate the query relevance in BERTSUM, we concatenate the query with

the document as the input to the encoder [Lewis et al., 2019]. Step 1 Step 2 Pre-train the BERTSUM model on a generic abstractive summarization corpus (e.g. XSUM) Incorporate query relevance in the pre- trained model and fine-tune it for the QFAS task in the target domain (i.e. Debatepedia). Transfer Learning

11

SLIDE 12

Proposed Approach (cont’d)

(a) Pre-train the BERTSUM model on a generic abstractive summarization corpus.

[CLS] SentQ [SEP] [CLS] Sent1 [SEP] … [CLS] SentN [SEP] [CLS] Sent1 [SEP] [CLS] Sent2 [SEP] … [CLS] SentN [SEP]

Input: Document{Sent1, Sent2, ... SentN} Input: Query{SentQ}, Document{Sent1 ... SentN} Pre-train the BERTSUM model on a large generic abstractive summarization dataset. Incorporate query relevance into the pre-trained BERTSUM and fine-tune on the target domain. Transfer Learning

Query: What is the benefit of reality shows? Document: Even if reality shows were not enlightening, they generate massive revenues that can be used for funding more sophisticated programs. Take BBC for example, it

ffers

entertaining reality shows such as total wipeout as well as brilliant documentaries. Summary: Reality show generates revenues. Document: The argument that too evil can be prevented by assassination is highly questionable. The figurehead of an evil government is not necessarily the lynchpin that hold it together. Therefore, if Hitler had been assassinated, it is pure supposition that the Nazi would have acted any differently to how they did act. Summary: The idea that assassinations can prevent injustice is questionable.

Transformer Decoder BERT Encoder (b) Fine-tune the pre-trained model for the QFAS task on the target domain. Transformer Decoder BERT Encoder

12

SLIDE 13

Datasets

13

SLIDE 14

Debatepedia (Original Dataset) Average Number of Instances in each fold Train Dev Test 10859 1357 1357

Debatepedia Dataset

Original Version:

Debatepedia dataset is created from the Debatepedia1 website.
Previous work on this dataset for the QFAS task used 10-fold cross

validation.

14

1http://www.debatepedia.org/en/index.php/Welcome_to_Debatepedia%21

SLIDE 15

Debatepedia (Augmented Dataset) Average Number of Instances in each fold Train Dev Test 95843 1357 1357

Debatepedia Dataset (cont’d)

Augmented Version:

We find in the official source code of the DDA model that dataset was

augmented by creating more instances in the training set.

In the augmented dataset:
The average training instances in each fold were 95,843.
However, the test and the validation data were same as the original.

15

SLIDE 16

Data Augmentation Approach: Debatepedia Dataset

We describe the data augmentation approach based on the source code2 of DDA.
We find that for each training instance, 8 new training instances were created.
First a pre-defined vocabulary was created having 24,822 words with their synonyms.

i. Then, each new training instance was created by randomly replacing:

M (1 ≤ M ≤ 3) words in each query.
N (10 ≤ N ≤ 17) words in each document.
ii. Each word was replaced with their synonyms found in the pre-defined vocabulary.
iii. When a word was not found in the pre-defined vocabulary, GloVe vocabulary was

used.

iv. Steps i, ii, and iii are repeated 8 times to create 8 new training instances.

16

2Source Code of DDA: https://git.io/JeBZX

SLIDE 17

Experimental Details

17

SLIDE 18

Dataset:
We used the original version of the Debatepedia dataset to evaluate our

proposed model.

Evaluation Metrics:
ROUGE scores with Precision, Recall, and F1 in terms of ROUGE - 1, 2, L.
Baselines:

Experimental Setup

Baseline Model Description QR-BERTSUM BERTSUM model via only incorporating query relevance. BERTSUMXSUM Pre-trained BERTSUM model on XSUM dataset without any fine-tuning. RSA-QFS The result of the RSA-QFS model mentioned in [Baumel et al., 2018]. DDA The result of the DDA model mentioned in [Nema et al., 2017]. DDA (Original dataset) We run the DDA model in the Original version of Debatepedia. DDA (Augmented dataset) We run the DDA model in the Augmented version of Debatepedia.

18

SLIDE 19

Results and Analyses

19

SLIDE 20

Results

Models ROUGE 1 ROUGE 2 ROUGE L R P F R P F R P F QR-BERTSUM

22.31 35.68 26.42

9.94 16.73 11.90 21.22 33.85 25.09 BERTSUMXSUM 17.36 11.48 13.32 3.03 2.47 2.75 14.96 9.88 11.46 RSA-QFS [Baumel et al.] 53.09

16.10
46.18
DDA [Nema et al.]

41.26

18.75
40.43
DDA* (Original Dataset)

7.52 7.67 7.35 2.83 2.88 2.84 7.13 7.54 7.24 DDA* (Augmented Dataset) 37.80 47.38 40.49 27.55 33.74 29.37 37.27 46.68 39.90 Our Model: QR-BERTSUM-TL 57.96 60.44 58.50 45.20 46.11 45.47 57.05 59.33 57.73

Here, ‘Recall’, ‘Precision’, and ‘F1’ are denoted by ‘R’, ‘P’, and ‘F’ respectively. ‘*’ denotes our implementation of the DDA model.

An improvement of 9.17%, and 23.54% in terms of ROUGE-1, and ROUGE-L

respectively over RSA-QFS + PGN.

A huge gain in terms of ROUGE-2 compared to the previous models, with an improvement
f 141.67% from DDA and an improvement of 180.75% over RSA-QFS + PGN.

20

SLIDE 21

Discussions

In the original version of the Debatepedia dataset.
The Transformer based QR-BERTSUM outperforms the RNN based DDA model.
Suggests the effectiveness of using transformer instead of RNN.
We find that data augmentation significantly improves the performance of DDA.
Our proposed model significantly outperformed the baselines:
The QR-BERTSUM model (Which did not leverage transfer learning)
The BERTSUMXSUM model (Which did not utilize fine-tuning)
Our proposed model sets a new state-of-the-art result without any in-domain

data augmentation.

21

SLIDE 22

Conclusions and Future Work

There are lack of datasets for QFAS and the available datasets are small in size.
To address this problem, we presented a transfer learning technique with the

BERTSUM model for QFAS.

Our approach shows state-of-the-art result in the Debatepedia dataset.
In future, we will investigate the performance of our proposed approach on more

datasets (e.g. DUC).

22

SLIDE 23

Acknowledgements

This research is supported by the Natural Sciences & Engineering Research Council (NSERC)

f

Canada and an ORF-RE (Ontario Research Fund- Research Excellence) award in BRAIN Alliance. We also thank Compute Canada for providing us with computing resources.

23

SLIDE 24

Questions?

24

SLIDE 25

1. Baumel, T., et al.: Query Focused Abstractive Summarization: Incorporating Query Relevance, Multi-Document Coverage, and Summary Length Constraints into seq2seq Models. arXiv preprint arXiv:1801.07704 (2018) 2. Devlin, J., et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proc. of NAACL-HLT. pp. 4171-4186 (2019) 3. Lewis, M., et al.: BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv preprint arXiv:1910.13461 (2019) 4. Liu, Y., Lapata, M.: Text Summarization with Pretrained Encoders. In: Proc. of EMNLP-IJCNLP. pp. 3721-3731 (2019) 5. Nema, P., et al.: Diversity driven attention Model for query-based abstractive

summarization. In: Proc. of ACL. pp. 1063-1072 (2017)

6. See, A., et al.: Get To The Point: Summarization with Pointer-Generator Networks. In: Proc. of ACL. pp. 1073-1083 (2017) 7. Vaswani, A., et al.: Attention Is All You Need. In: Proc. of NIPS. pp. 5998-6008 (2017)

References

25