Abbreviation detection for biomedical articles by Sonja Kenari - - PowerPoint PPT Presentation

abbreviation detection for biomedical articles
SMART_READER_LITE
LIVE PREVIEW

Abbreviation detection for biomedical articles by Sonja Kenari - - PowerPoint PPT Presentation

Abbreviation detection for biomedical articles by Sonja Kenari Agenda Introduction Background Implementation Results Further Improvents Introduction Full project description COVID-19 Open Research Dataset Challenge (CORD-19): What do we


slide-1
SLIDE 1

by Sonja Kenari

Abbreviation detection for biomedical articles

slide-2
SLIDE 2

Introduction Background Implementation Results Further Improvents

Agenda

slide-3
SLIDE 3

Abbreviation detection Dictionary tagger NER Relationship extraction

Introduction

Full project description

COVID-19 Open Research Dataset Challenge (CORD-19):

What do we know about vaccines and therapeutics?

1

slide-4
SLIDE 4

Introduction

Abbreviation Detection

?

spaCy Python library for NLP

Makes it easier to: Find articles of interest faster Keep up with the amount of new abbreviations

Abbreviation detection

2

slide-5
SLIDE 5

Background

Abbreviation Detection

scispaCy:

AbbreviationDetector

Pre trained models by spaCy Detect: abbreviations & definitions Accuracy?

long form short form

3

slide-6
SLIDE 6

data subset [json] metadata file [csv] pubannotation [json] 100 out of 60,000 articles

Implementation

Generate Pubannotations

4

slide-7
SLIDE 7

metadata file [csv] url HTML parser

BeautifulSoup

abbreviation, abbreviations, Abbreviation, Abbreviations

csv files web scraping

Implementation

Generating files of abbreviations

data subset [json]

full texts AbbreviationDetector csv files scispaCy

  • utput file format

5

slide-8
SLIDE 8

= (%)

detected abbreviations with spaCy [csv] detected abbreviations with web scraping [csv] Compare the 2 { Number unique short forms detected by spaCy Number short forms detected by web scraping Number unique long forms detected by spaCy Number long forms detected by web scraping

= (%)

Implementation

Evaluation

6

slide-9
SLIDE 9

Highest: 87.5% Lowest: 25%

short forms hit rate

Highest: 52.6% Lowest: 0%

long forms hit rate

  • spaCy weak on long form
  • text from json files not updated after url articles
  • faults in denotation extraction

notable faults

20 out of 100 Abbreviation lists in

Result Result

7

slide-10
SLIDE 10

web scraper

Update data

Further Improvements

spaCy

Improve the results

Extract from Pubannotations

Instead of full text extraction

Optimize programs

Make more time effjcient

8

slide-11
SLIDE 11

Thank you for listening!

Questions...?

Sonja Kenari nat14sta@student.lu.se 2020-05-29

9