using high volume unstructured gp notes to predict stroke
play

Using high-volume unstructured GP notes to predict stroke Anneloes - PowerPoint PPT Presentation

Using high-volume unstructured GP notes to predict stroke Anneloes Louwe, Masters Thesis Project Supervision: Hine van Os, dept. Neurology & Epidemiology, LUMC Suzan Verberne, Text Mining & Informatjon Retrieval, LIACS


  1. Using high-volume unstructured GP notes to predict stroke Anneloes Louwe, Master’s Thesis Project Supervision: • Hine van Os, dept. Neurology & Epidemiology, LUMC • Suzan Verberne, Text Mining & Informatjon Retrieval, LIACS

  2. Contents • Study context and objectjves • Preprocessing of primary care consultatjon notes • Cleaning and tokenizatjon • Spelling correctjon • Keyphrase detectjon • Feature selectjon • Bag-of-words • Topic modeling • Predictjon models 2 20-Nov-18

  3. What is stroke? • Brain infarctjons & brain hemorrhage • NL: 43.000 strokes per year • 3rd cause of death 3 Cardiovasculair Risicomanagement, NHG 6/12/19

  4. Preventjon of stroke is key • Preventjon by general practjtjoner • Blood pressure & cholesterol medicatjon • Lifestyle change • Simplistjc risk chart, only 5 risk factors • Need for precision preventjon (and thus predictjon)! 4 Cardiovasculair Risicomanagement, NHG 6/12/19

  5. Aim • Including free text in a predictjon model for stroke • Identjfjcatjon of novel (women-specifjc) risk factors 5 6/12/19

  6. Free text • Captures patjents’ narratjve • Supportjng evidence • Uncertainty • Non-medical informatjon (eg. social problems) • Diagnosis Descriptjons • SOAP notes S: Subjectjve   O: Objectjve  A: Assessment P: Plan  6 6/12/19

  7. Data overview • Pipeline development: ELAN dataset (n = 87000) • Proof of concept: NEO dataset (n ≈ 6000)  Cases (including heart infarctjons): 182  Controls: 5890 • Main dataset: STIZON dataset (n = 3000000) 7 6/12/19

  8. Preprocessing Preparatjon  ICPC code (re)formattjng (e.g. K90.00)  Grouping SOAP lines Cleaning and tokenizatjon  Lowercasing and punctuatjon removal  Token removal: Stopwords, numbers, short words, medicatjon specifjcatjons (e.g. 100mg or 100st ), zorgdomein codes Spelling Correctjon  Vocabulary: Clinspell, ICPC defjnitjons and CoNLL  Single-character edit identjfjcatjon using Symmetric Delete Keyphrase Detectjon  Kullback–Leibler divergence 8 Insert > Header & footer 6/12/19

  9. Cases vs. controls 9 6/12/19

  10. Feature Selectjon • Unifjed Medical Language System (ULMS): Medical Concept Extractjon • Bag-of-Words • Topic Modeling  Latent Dirichlet Allocatjon (LDA)  Non-negatjve Matrix Factorizatjon (NMF)  Topic Coherence: Word Embedding model (Word2Vec) 10 6/12/19

  11. Topic Coherence 11 6/12/19

  12. Models • Logistjc Regression • Random Forest 12 6/12/19

  13. Models 13 6/12/19

  14. Next steps • STIZON dataset  Experimentatjon  Pipeline optjmizatjon • Negatjon Detectjon 14 6/12/19

  15. Thank you! Vrije Universiteit LUMC Neurologie • Mark Hoogendoorn • Hendrikus J. H. van Os • Ioannis Pantazis • Marieke J. H. Wermer LIACS LUMC PHEG • Matthijs de Leeuw • Mattijs A. Numans • Suzan Verberne • Tobias N. Bonten • Teddy Etoeharnowo • Niels H. Chavannes • Anneloes Louwe • Rolf H. H. Groenwold LUMC Statistiek • Janet Kist • Hein Putter • Michiel Meulenbroek • Erik van Zwet • Frederike Buechner Turku University (Finland) • Sepinoud Azimi 15

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend