Knowledge discovery from patient forums Anne Dirkson 12 June 2019 - - PowerPoint PPT Presentation

knowledge discovery from patient forums
SMART_READER_LITE
LIVE PREVIEW

Knowledge discovery from patient forums Anne Dirkson 12 June 2019 - - PowerPoint PPT Presentation

Knowledge discovery from patient forums Anne Dirkson 12 June 2019 Discover theworld at Leiden University Discover theworld at Leiden University 1 Patient forums are a knowledge gold mine Discover theworld at Leiden University 2 Medical


slide-1
SLIDE 1

Discover theworld at Leiden University Discover theworld at Leiden University

Knowledge discovery from patient forums

Anne Dirkson 12 June 2019

1

slide-2
SLIDE 2

Discover theworld at Leiden University

Patient forums are a knowledge gold mine

2

slide-3
SLIDE 3

Discover theworld at Leiden University

Medical anecdotes to new knowledge

3

Knowledge What is known New Knowledge Input for research/ clinical trials

slide-4
SLIDE 4

Discover theworld at Leiden University

4

clinical records patient forums biomedical literature

Advantages + Uncensored + Unprompted + Volume + Available

slide-5
SLIDE 5

Discover theworld at Leiden University

Patient forums are very noisy

5

insomnia vs. can’t sleep ablation vs. remove “gistory”

slide-6
SLIDE 6

Discover theworld at Leiden University

  • Lack of domain-specific resources
  • Language is dynamic
  • Key medical terms are frequently misspelt4

6

Challenges for spelling correction

4. Zhou et al. 2015. Context-Sensitive Spelling Correction of Consumer-Generated Content on Health Care, JMIR Med Inform, 3(3)

slide-7
SLIDE 7

Discover theworld at Leiden University

Current methods do not suffice

  • Traditional methods are unsupervised but rely on dictionaries for detection
  • Modern methods are supervised and rely on training data
  • State of the art for social media: 5

7

  • 5. Sarker, 2017, A customizable pipeline for social media text normalization. Social Network Analysis and Mining, 7 (45), 1-13

Uses language model of generic Twitter data to correct them Relies on manually created dictionary to detect mistakes

slide-8
SLIDE 8

Discover theworld at Leiden University

Research questions

  • 1. To what extent can corpus-driven spelling correction reduce the out-of-

vocabulary rate in medical social media text?

  • 2. To what extent can our corpus-driven spelling correction improve accuracy
  • f health-related classification tasks with social media text?

8

slide-9
SLIDE 9

Discover theworld at Leiden University

Our data

  • Facebook forum for GIST patients
  • 36.722 posts

9 9

500 posts 500 posts 34 unique non-word errors 23 unique non-word errors + 230 random correct tokens + 340 random correct tokens TRAINING SET TEST SET 1000 posts

slide-10
SLIDE 10

Discover theworld at Leiden University

Spelling correction

10

Absolute Edit Dist. Relative Edit Dist. Weighted Absolute Edit Dist. Weighted Relative Edit Dist6 Sarker TISC 56.6% 56.6% 54.7% 62.3% 20.8% 24.5%

  • 6. Based on frequency of 1-edit alterations compiled by Peter Norvig, https://norvig.com/ngrams/

Mistake Correction Gleevac Gleevec Gleevec Gleevec Gleevec Gleevec Colonic Gleevac Stomack Stomach Stomach Stomach Smack Stomach Smack Smack Resectected Resected Resected Resurrected Resected Resected Rusticated Resectected Sutant Sutent Mutant Mutant Sutent Sutent mutant dunant

slide-11
SLIDE 11

Discover theworld at Leiden University

Unsupervised data-driven spelling correction

11

≤ ≥

slide-12
SLIDE 12

Discover theworld at Leiden University

Unsupervised data-driven spelling correction

12

≤ ≥

slide-13
SLIDE 13

Discover theworld at Leiden University

Spelling mistake detection

F0.5 F1 Recall Precision CELEX 0.551 0.634 1.0 0.464 Decision process 0.888 0.871 0.844 0.900

slide-14
SLIDE 14

Discover theworld at Leiden University

Internal validation on a second cancer forum

14

identified mistakes GIST forum corrections health-related

slide-15
SLIDE 15

Discover theworld at Leiden University

GIST forum Reddit Spelling error 3 1 Real Word 11 21 Abbreviation 14 9 Slang 6 13 Name of person or hospital 14 2 Drug name 1 4 Not English 1 50 50

Manual error analysis of 50 most frequent OOV

15

slide-16
SLIDE 16

Discover theworld at Leiden University

Classification task evaluation

16

Data Set Size Change in F1 % of words corrected Task 1 SMM4H 16,141 +0.006 1.1 Task 4 SMM4H Flu vaccine 6,738 +0.001 0.47 Flu Vaccination 3,798 +0.002 0.83 Twitter Health 2,598 +0.010* 0.64 Task 4 SMM4H Flu infection 1,034 +0.011 0.29 Zika Conspiracy Tweets 588

  • 0.011

1.1

slide-17
SLIDE 17

Discover theworld at Leiden University

Generic social media normalization

17

F1 Precision Recall State of the art 0.836 0.880 0.796 Our method 0.522 0.646 0.577

slide-18
SLIDE 18

Discover theworld at Leiden University

Current work

1. More data 2. Improve generalization 3. Error analysis 4. Use of context of error to improve correction

18

slide-19
SLIDE 19

Discover theworld at Leiden University

19

slide-20
SLIDE 20

Discover theworld at Leiden University Discover theworld at Leiden University

20

a.r.dirkson@liacs.leidenuniv.nl github.com/AnneDirkson www.annedirkson.nl