knowledge discovery from patient forums
play

Knowledge discovery from patient forums Anne Dirkson 12 June 2019 - PowerPoint PPT Presentation

Knowledge discovery from patient forums Anne Dirkson 12 June 2019 Discover theworld at Leiden University Discover theworld at Leiden University 1 Patient forums are a knowledge gold mine Discover theworld at Leiden University 2 Medical


  1. Knowledge discovery from patient forums Anne Dirkson 12 June 2019 Discover theworld at Leiden University Discover theworld at Leiden University 1

  2. Patient forums are a knowledge gold mine Discover theworld at Leiden University 2

  3. Medical anecdotes to new knowledge What is known New Knowledge Knowledge Input for research/ clinical trials Discover theworld at Leiden University 3

  4. biomedical literature clinical records Advantages + Uncensored + Unprompted patient forums + Volume + Available Discover theworld at Leiden University 4

  5. Patient forums are very noisy insomnia vs. can’t sleep ablation vs. remove “gistory” Discover theworld at Leiden University 5

  6. Challenges for spelling correction • Lack of domain-specific resources • Language is dynamic • Key medical terms are frequently misspelt 4 4. Zhou et al. 2015. Context-Sensitive Spelling Correction of Consumer-Generated Content on Health Care, JMIR Med Inform, 3(3) Discover theworld at Leiden University 6

  7. Current methods do not suffice • Traditional methods are unsupervised but rely on dictionaries for detection • Modern methods are supervised and rely on training data • State of the art for social media: 5 Relies on manually created dictionary to detect mistakes Uses language model of generic Twitter data to correct them 5. Sarker, 2017, A customizable pipeline for social media text normalization. Social Network Analysis and Mining , 7 (45), 1-13 Discover theworld at Leiden University 7

  8. Research questions 1. To what extent can corpus-driven spelling correction reduce the out-of- vocabulary rate in medical social media text? 2. To what extent can our corpus-driven spelling correction improve accuracy of health-related classification tasks with social media text ? Discover theworld at Leiden University 8

  9. Our data • Facebook forum for GIST patients • 36.722 posts 500 posts 34 unique non-word errors TRAINING SET + 340 random correct tokens 1000 posts 9 500 posts 23 unique non-word errors TEST SET + 230 random correct tokens Discover theworld at Leiden University 9

  10. Spelling correction Absolute Relative Weighted Weighted Sarker TISC Edit Dist. Edit Dist. Absolute Relative Edit Dist 6 Edit Dist. 56.6% 56.6% 54.7% 62.3% 20.8% 24.5% Mistake Correction Gleevac Gleevec Gleevec Gleevec Gleevec Gleevec Colonic Gleevac Stomack Stomach Stomach Stomach Smack Stomach Smack Smack Resectected Resected Resected Resurrected Resected Resected Rusticated Resectected Sutant Sutent Mutant Mutant Sutent Sutent mutant dunant Discover theworld at Leiden University 6. Based on frequency of 1-edit alterations compiled by Peter Norvig, https://norvig.com/ngrams/ 10

  11. Unsupervised data-driven spelling correction ≥ ≤ Discover theworld at Leiden University 11

  12. Unsupervised data-driven spelling correction ≥ ≤ Discover theworld at Leiden University 12

  13. Spelling mistake detection F 0.5 F 1 Recall Precision CELEX 0.551 0.634 1.0 0.464 Decision 0.888 0.871 0.844 0.900 process Discover theworld at Leiden University

  14. Internal validation on a second cancer forum GIST forum identified corrections mistakes health-related Discover theworld at Leiden University 14

  15. Manual error analysis of 50 most frequent OOV GIST forum Reddit Spelling error 3 1 Real Word 11 21 Abbreviation 14 9 Slang 6 13 Name of person or 14 2 hospital Drug name 1 4 Not English 1 50 50 Discover theworld at Leiden University 15

  16. Classification task evaluation Data Set Size Change in F1 % of words corrected Task 1 SMM4H 16,141 +0.006 1.1 Task 4 SMM4H Flu 6,738 +0.001 0.47 vaccine Flu Vaccination 3,798 +0.002 0.83 Twitter Health 2,598 +0.010* 0.64 Task 4 SMM4H Flu 1,034 +0.011 0.29 infection Zika Conspiracy 588 -0.011 1.1 Tweets Discover theworld at Leiden University 16

  17. Generic social media normalization F1 Precision Recall State of the art 0.836 0.880 0.796 Our method 0.522 0.646 0.577 Discover theworld at Leiden University 17

  18. Current work 1. More data 2. Improve generalization 3. Error analysis 4. Use of context of error to improve correction Discover theworld at Leiden University 18

  19. Discover theworld at Leiden University 19

  20. a.r.dirkson@liacs.leidenuniv.nl github.com/AnneDirkson www.annedirkson.nl Discover theworld at Leiden University Discover theworld at Leiden University 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend