Investigating Citation Linkage as a Sentence Similarity Measurement - - PowerPoint PPT Presentation

investigating citation linkage as a sentence similarity
SMART_READER_LITE
LIVE PREVIEW

Investigating Citation Linkage as a Sentence Similarity Measurement - - PowerPoint PPT Presentation

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning Sudipta Singha Roy, Robert E. Mercer, Felipe Urra (ssinghar@uwo.ca) The University of Western Ontario 1 Computer Science Overview Introduction:


slide-1
SLIDE 1

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

Computer Science

Sudipta Singha Roy, Robert E. Mercer, Felipe Urra

(ssinghar@uwo.ca)

The University of Western Ontario

1

slide-2
SLIDE 2

Overview

  • Introduction: Citation Linkage
  • Problem Formulation and Research Contributions
  • Background Studies
  • Corpus Creation
  • Citation Linkage as a Sentence Similarity Measurement Task
  • Experimental Results
  • Conclusion

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

2

slide-3
SLIDE 3

Introduction: Citation Linkage

3

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-4
SLIDE 4

Problem Formulation and Research Contributions

4

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-5
SLIDE 5

Citation Linkage as Semantic Similarity

  • This citation linkage problem has been modelled as a semantic

relatedness task.

  • Given a citing sentence the framework pairs this citing

sentence with each sentence from the reference document and then tries to determine which sentence pair is semantically similar and which pair is not.

  • The citation context span may contain one or more sentences.
  • For this work this span has been restricted to only one

sentence and the citation linkage task has been designed as a semantic relatedness measurement task at the sentence level.

  • This semantic relatedness task is formulated as a two-class

classification which operates on sentence pairs.

5

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-6
SLIDE 6

Contributions

  • Building a framework to determine the appropriate cited

sentences from a cited paper given a citation sentence.

  • Building a corpus for citation linkage task containing more than

sixty thousand sentence pairs from the biomedical domain.

  • Developing a method for cleaning and preprocessing

sentences from different biomedical domains.

6

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-7
SLIDE 7

Background Studies

7

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-8
SLIDE 8

Background Studies

  • Neural Network Based Word Embedding
  • Neural Network Based Sentence Embedding
  • Attention in Natural Language Processing
  • Works for Citation Linkage

8

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-9
SLIDE 9

Background Studies

  • Neural Network Based Word Embedding
  • Fasttext (Bojanowski et al. 2017)
  • Neural Network Based Sentence Embedding
  • Attention in Natural Language Processing
  • Works for Citation Linkage

9

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-10
SLIDE 10

Background Studies

  • Neural Network Based Word Embedding
  • Neural Network Based Sentence Embedding
  • Sent2Vec (Pagliardini et al. 2017)
  • Attention in Natural Language Processing
  • Works for Citation Linkage

10

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-11
SLIDE 11

Background Studies

  • Neural Network Based Word Embedding
  • Neural Network Based Sentence Embedding
  • Attention in Natural Language Processing
  • Inner Attention (Liu et al. 2016)
  • Hierarchical Attention (Yang et al. 2016)
  • Works for Citation Linkage

11

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-12
SLIDE 12

Background Studies

  • Neural Network Based Word Embedding
  • Neural Network Based Sentence Embedding
  • Attention in Natural Language Processing
  • Works for Citation Linkage
  • Houngbo and Mercer (2017)
  • Li et al. (2018)

12

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-13
SLIDE 13

Works for Citation Linkage

  • In 2017, Houngbo and Mercer developed a framework to detect

cited sentences given a citation sentence.

  • They built a small corpus which was annotated by a

domain expert.

  • The annotation was done over sentence pairs containing

23 citation sentences and 3857 candidate cited sentences.

  • They used different machine learning models. However,

the accuracy they achieved was low.

13

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-14
SLIDE 14

Works for Citation Linkage

  • In 2018, Li et al. applied ruled-based and deep learning-based

approaches to determine the citation linkage between citation and cited sentence pairs.

  • For this task, they used textual semantic similarity at the

sentence level.

  • They used both traditional and deep learning models to

compute the textual semantic similarity.

  • Traditional Models: inverse document frequency (idf) and

Jaccard similarity

  • Deep Learning Models: Word2Vec, Convolutional Neural

Network and cosine similarity.

14

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-15
SLIDE 15

Corpus Creation

15

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-16
SLIDE 16

Corpus Creation

  • Houngbo and Mercer built a small corpus for the citation

linkage task for the biomedical domain with 3857 sentence pairs.

  • This dataset is highly imbalanced: only 81 samples are

positive.

  • Problem with this dataset:
  • Very small for the training of the deep learning models.
  • This positive and negative sample ratio would make the

model biased.

16

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-17
SLIDE 17

Corpus Creation

  • We have developed a synthetic corpus of 68,898 sentence

pairs over three biomedical topics: cell biology, biochemistry, and chemical biology.

  • The synthetic corpus has been annotated, not by humans, but

rather by Sent2Vec

  • followed by a cosine calculation of the angle between the

resulting sentence vectors as a measure of semantic similarity of the two sentences in each pair.

  • 45.89% samples are positive samples and the remaining are

negative.

  • We have used the corpus built by Houngbo and Mercer for the

validation and test purposes.

17

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-18
SLIDE 18

Corpus Creation

  • Data source for training Sent2Vec:
  • Set of 28,310 full-text articles (4,843,756 sentences) from a

wide spectrum of biomedical journals

  • Sent2Vec is trained with various parameters to generate sentence

vectors.

  • The best model is chosen against a validation set which is a

portion of the human annotated dataset from Houngbo and Mercer's work.

18

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-19
SLIDE 19

Corpus Creation

  • Data sources for building synthetic corpus:
  • 112 papers from the 28,310 articles are randomly selected as

the reference articles

  • Citation sentences from 2289 articles that cite at least one of

these reference articles are manually collected from the web

19

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-20
SLIDE 20

Annotated Sentence Pair Creation

Source: PubMed 28,310 Research Papers 112 Reference Papers 2,289 Citing Papers Manually Collected 475,807 Sentence Pairs Pretrained Sent2Vec Cosine Similarity > 0.57 ? 31,624 +ve Sentence Pairs 37,274 -ve Sentence Pairs (using –ve algorithm) Yes No

20

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-21
SLIDE 21

Annotated Sentence Pair Creation

  • ve Sample Selection Algorithm:

For each citing sentence Ci: if number of +ve samples is n and n>0: then choose n –ve samples randomly where the citing sentence is Ci else if n==0: then chose 5 –ve samples randomly where the citing sentence is Ci

21

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-22
SLIDE 22

Annotated Sentence Pair Creation

22

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

Source: PubMed 28,310 Research Papers 112 Reference Papers 2,289 Citing Papers Manually Collected 475,807 Sentence Pairs Pretrained Sent2Vec Cosine Similarity > 0.57 ? 31,624 +ve Sentence Pairs 37,274 -ve Sentence Pairs (as described in the text) Final Corpus: 68,898 Sentence Pairs Yes No

slide-23
SLIDE 23

Validation and Test Set

  • Validation set:
  • 780 negative samples are randomly chosen from the negative

portion of the human annotated corpus created by Houngbo and Mercer.

  • 20 positive samples are randomly chosen in a similar way.
  • Test set:
  • The test set is the remaining 3057 samples which contain 61

positive samples.

23

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-24
SLIDE 24

Corpus Creation:

Data Cleaning

24

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-25
SLIDE 25

Data Cleaning

  • Deletion of Unnecessary Symbols
  • Capturing Different Patterns for Equations, Numbers, Chemical

Names and Citations

  • Symbols and Their Replacements

25

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-26
SLIDE 26

Deletion of Unnecessary Symbols

26

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-27
SLIDE 27

Capturing Different Patterns for Equations, Numbers, Chemical Names and Citations

27

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-28
SLIDE 28

Symbols and Their Replacements

28

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-29
SLIDE 29

Citation Linkage as a Sentence Similarity Measurement Task

29

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-30
SLIDE 30

Citation Linkage as a Sentence Similarity Measurement Task

  • We used different Infersent architectures for measuring the

semantic relatedness between citing and cited sentence pairs.

  • Bi-LSTM with Max-pooling
  • Bi-LSTM with Inner Attention
  • Bi-LSTM with Hierarchical Attention
  • Hierarchical ConvNet
  • Bootstrapping approaches

30

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-31
SLIDE 31

Infersent Architecture

Infersent (Modified from Conneau et al., 2017)

31

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-32
SLIDE 32

Infersent with Bi-LSTM and Max-pooling

Infersent with Bi-LSTM and Max-pooling (Conneau et al., 2017)

32

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

scited /sciting

slide-33
SLIDE 33

Infersent with Inner and Hierarchical Attention

33

Infersent with Inner & Hierarchical Attention (Conneau et al., 2017)

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

scited /sciting

Inner Attention Hierarchical Attention

slide-34
SLIDE 34

Infersent with Hierarchical ConvNet

34

Infersent with Hierarchical ConvNet (Conneau et al., 2017)

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

scited /sciting

slide-35
SLIDE 35

Bootstrapping Approaches

Synthetic Corpus Annotated by: Sent2Vec Partition 3 Partition 2 Partition 1 Model Annotated 1 Partition 1 Train

35

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-36
SLIDE 36

Bootstrapping Approaches

Partition 3 Partition 2 Annotated 1 Model Annotated 2 Partition 2 Train Annotated 1

36

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-37
SLIDE 37

Bootstrapping Approaches

Partition 3 Annotated 2 Annotated 1 Model Annotated 3 Partition 3 Train Annotated 2 Annotated 1

37

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-38
SLIDE 38

Bootstrapping Approaches

Annotated 3 Annotated 2 Annotated 1 Model Human Annotated Test Data Train Test Annotated 3 Annotated 2 Annotated 1 Performance Analysis

38

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-39
SLIDE 39

Experimental Results

39

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-40
SLIDE 40

Hyperparameter Setting

FastText:

40

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-41
SLIDE 41

Hyperparameter Setting

Sent2Vec:

41

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-42
SLIDE 42

Hyperparameter Setting

  • Infersent:
  • Learning rate: 0.1
  • For a decrease in validation set accuracy, the learning rate

accuracy was divided by 5.

  • Learning Rate threshold: 10-5
  • Batch Size: 50
  • Hidden layer for last multi-layer perceptron: 512
  • Optimizer: Stochastic gradient descent

42

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-43
SLIDE 43

Result Analysis

M1: Hier. ConvNet; Bi-LSTM M2: Max-pool; M3: Inner Attn.; M4: Hier. Attn.; Boot: Bootstrapped.

43

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-44
SLIDE 44

Conclusion

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

44

slide-45
SLIDE 45

Contributions

  • Building a framework to determine the appropriate cited

sentences from a cited paper given a citation sentence.

  • Building a corpus for citation linkage task containing more than

sixty thousand sentence pairs from the biomedical domain.

  • Developing a method for cleaning and preprocessing

sentences from different biomedical domains.

45

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-46
SLIDE 46

Limitations and Future Works

  • Extend text span from a single sentence to multiple contiguous

sentence and sub sentence spans.

  • Sequential models can’t work with phrases. Tree structures can

be applied.

  • The test data set that was used in this study was created only

for method citation sentences. It would be appropriate to human annotate a test set with a variety of citation types and see how good the proposed method performs on this expanded test set.

46

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning

slide-47
SLIDE 47

Thank You!

47

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning