investigating citation linkage as a sentence similarity
play

Investigating Citation Linkage as a Sentence Similarity Measurement - PowerPoint PPT Presentation

Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning Sudipta Singha Roy, Robert E. Mercer, Felipe Urra (ssinghar@uwo.ca) The University of Western Ontario 1 Computer Science Overview Introduction:


  1. Investigating Citation Linkage as a Sentence Similarity Measurement Task using Deep Learning Sudipta Singha Roy, Robert E. Mercer, Felipe Urra (ssinghar@uwo.ca) The University of Western Ontario 1 Computer Science

  2. Overview Introduction: Citation Linkage • Problem Formulation and Research Contributions • Background Studies • Corpus Creation • Citation Linkage as a Sentence Similarity Measurement Task • Experimental Results • Conclusion • Investigating Citation Linkage as a Sentence Similarity 2 Measurement Task using Deep Learning

  3. Introduction: Citation Linkage Investigating Citation Linkage as a Sentence Similarity 3 Measurement Task using Deep Learning

  4. Problem Formulation and Research Contributions Investigating Citation Linkage as a Sentence Similarity 4 Measurement Task using Deep Learning

  5. Citation Linkage as Semantic Similarity This citation linkage problem has been modelled as a semantic • relatedness task. Given a citing sentence the framework pairs this citing • sentence with each sentence from the reference document and then tries to determine which sentence pair is semantically similar and which pair is not. The citation context span may contain one or more sentences. • For this work this span has been restricted to only one • sentence and the citation linkage task has been designed as a semantic relatedness measurement task at the sentence level. This semantic relatedness task is formulated as a two-class • classification which operates on sentence pairs. Investigating Citation Linkage as a Sentence Similarity 5 Measurement Task using Deep Learning

  6. Contributions • Building a framework to determine the appropriate cited sentences from a cited paper given a citation sentence. • Building a corpus for citation linkage task containing more than sixty thousand sentence pairs from the biomedical domain. • Developing a method for cleaning and preprocessing sentences from different biomedical domains. Investigating Citation Linkage as a Sentence Similarity 6 Measurement Task using Deep Learning

  7. Background Studies Investigating Citation Linkage as a Sentence Similarity 7 Measurement Task using Deep Learning

  8. Background Studies Neural Network Based Word Embedding • Neural Network Based Sentence Embedding • Attention in Natural Language Processing • Works for Citation Linkage • Investigating Citation Linkage as a Sentence Similarity 8 Measurement Task using Deep Learning

  9. Background Studies Neural Network Based Word Embedding • Fasttext (Bojanowski et al. 2017) • Neural Network Based Sentence Embedding • Attention in Natural Language Processing • Works for Citation Linkage • Investigating Citation Linkage as a Sentence Similarity 9 Measurement Task using Deep Learning

  10. Background Studies Neural Network Based Word Embedding • Neural Network Based Sentence Embedding • Sent2Vec (Pagliardini et al. 2017) • Attention in Natural Language Processing • Works for Citation Linkage • Investigating Citation Linkage as a Sentence Similarity 10 Measurement Task using Deep Learning

  11. Background Studies Neural Network Based Word Embedding • Neural Network Based Sentence Embedding • Attention in Natural Language Processing • Inner Attention (Liu et al. 2016) • Hierarchical Attention (Yang et al. 2016) • Works for Citation Linkage • Investigating Citation Linkage as a Sentence Similarity 11 Measurement Task using Deep Learning

  12. Background Studies Neural Network Based Word Embedding • Neural Network Based Sentence Embedding • Attention in Natural Language Processing • Works for Citation Linkage • Houngbo and Mercer (2017) • Li et al. (2018) • Investigating Citation Linkage as a Sentence Similarity 12 Measurement Task using Deep Learning

  13. Works for Citation Linkage In 2017, Houngbo and Mercer developed a framework to detect • cited sentences given a citation sentence. They built a small corpus which was annotated by a • domain expert. The annotation was done over sentence pairs containing • 23 citation sentences and 3857 candidate cited sentences. They used different machine learning models. However, • the accuracy they achieved was low. Investigating Citation Linkage as a Sentence Similarity 13 Measurement Task using Deep Learning

  14. Works for Citation Linkage In 2018, Li et al. applied ruled-based and deep learning-based • approaches to determine the citation linkage between citation and cited sentence pairs. For this task, they used textual semantic similarity at the • sentence level. They used both traditional and deep learning models to • compute the textual semantic similarity. Traditional Models: inverse document frequency (idf) and • Jaccard similarity Deep Learning Models: Word2Vec, Convolutional Neural • Network and cosine similarity. Investigating Citation Linkage as a Sentence Similarity 14 Measurement Task using Deep Learning

  15. Corpus Creation Investigating Citation Linkage as a Sentence Similarity 15 Measurement Task using Deep Learning

  16. Corpus Creation Houngbo and Mercer built a small corpus for the citation • linkage task for the biomedical domain with 3857 sentence pairs. This dataset is highly imbalanced: only 81 samples are • positive. Problem with this dataset: • Very small for the training of the deep learning models. • This positive and negative sample ratio would make the • model biased. Investigating Citation Linkage as a Sentence Similarity 16 Measurement Task using Deep Learning

  17. Corpus Creation We have developed a synthetic corpus of 68,898 sentence • pairs over three biomedical topics: cell biology, biochemistry, and chemical biology. The synthetic corpus has been annotated, not by humans, but • rather by Sent2Vec followed by a cosine calculation of the angle between the • resulting sentence vectors as a measure of semantic similarity of the two sentences in each pair. 45.89% samples are positive samples and the remaining are • negative. We have used the corpus built by Houngbo and Mercer for the • validation and test purposes. Investigating Citation Linkage as a Sentence Similarity 17 Measurement Task using Deep Learning

  18. Corpus Creation Data source for training Sent2Vec: • Set of 28,310 full-text articles (4,843,756 sentences) from a • wide spectrum of biomedical journals Sent2Vec is trained with various parameters to generate sentence • vectors. The best model is chosen against a validation set which is a • portion of the human annotated dataset from Houngbo and Mercer's work. Investigating Citation Linkage as a Sentence Similarity 18 Measurement Task using Deep Learning

  19. Corpus Creation Data sources for building synthetic corpus: • 112 papers from the 28,310 articles are randomly selected as • the reference articles Citation sentences from 2289 articles that cite at least one of • these reference articles are manually collected from the web Investigating Citation Linkage as a Sentence Similarity 19 Measurement Task using Deep Learning

  20. Annotated Sentence Pair Creation Source: PubMed 28,310 Research Papers 2,289 Citing Papers 112 Reference Papers Manually Collected 475,807 Sentence Pairs Pretrained Sent2Vec 37,274 -ve Sentence Pairs Cosine Similarity No (using –ve > algorithm) 0.57 ? Yes 31,624 +ve Sentence Pairs Investigating Citation Linkage as a Sentence Similarity 20 Measurement Task using Deep Learning

  21. Annotated Sentence Pair Creation -ve Sample Selection Algorithm: For each citing sentence C i : if number of +ve samples is n and n>0: then choose n –ve samples randomly where the citing sentence is C i else if n==0: then chose 5 –ve samples randomly where the citing sentence is C i Investigating Citation Linkage as a Sentence Similarity 21 Measurement Task using Deep Learning

  22. Annotated Sentence Pair Creation Source: PubMed 28,310 Research Papers 2,289 Citing Papers 112 Reference Papers Manually Collected 475,807 Sentence Pairs Pretrained Sent2Vec 37,274 -ve Final Corpus: Sentence Pairs Cosine Similarity 68,898 No (as described in > Sentence Pairs the text) 0.57 ? Yes 31,624 +ve Sentence Pairs Investigating Citation Linkage as a Sentence Similarity 22 Measurement Task using Deep Learning

  23. Validation and Test Set Validation set: • 780 negative samples are randomly chosen from the negative • portion of the human annotated corpus created by Houngbo and Mercer. 20 positive samples are randomly chosen in a similar way. • Test set: • The test set is the remaining 3057 samples which contain 61 • positive samples. Investigating Citation Linkage as a Sentence Similarity 23 Measurement Task using Deep Learning

  24. Corpus Creation: Data Cleaning Investigating Citation Linkage as a Sentence Similarity 24 Measurement Task using Deep Learning

  25. Data Cleaning Deletion of Unnecessary Symbols • Capturing Different Patterns for Equations, Numbers, Chemical • Names and Citations Symbols and Their Replacements • Investigating Citation Linkage as a Sentence Similarity 25 Measurement Task using Deep Learning

  26. Deletion of Unnecessary Symbols Investigating Citation Linkage as a Sentence Similarity 26 Measurement Task using Deep Learning

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend