Using high-volume unstructured GP notes to predict stroke Anneloes - - PowerPoint PPT Presentation

using high volume unstructured gp notes to predict stroke
SMART_READER_LITE
LIVE PREVIEW

Using high-volume unstructured GP notes to predict stroke Anneloes - - PowerPoint PPT Presentation

Using high-volume unstructured GP notes to predict stroke Anneloes Louwe, Masters Thesis Project Supervision: Hine van Os, dept. Neurology & Epidemiology, LUMC Suzan Verberne, Text Mining & Informatjon Retrieval, LIACS


slide-1
SLIDE 1

Using high-volume unstructured GP notes to predict stroke

Anneloes Louwe, Master’s Thesis Project Supervision:

  • Hine van Os, dept. Neurology & Epidemiology, LUMC
  • Suzan Verberne, Text Mining & Informatjon Retrieval, LIACS
slide-2
SLIDE 2

Contents

  • Study context and objectjves
  • Preprocessing of primary care consultatjon notes
  • Cleaning and tokenizatjon
  • Spelling correctjon
  • Keyphrase detectjon
  • Feature selectjon
  • Bag-of-words
  • Topic modeling
  • Predictjon models

20-Nov-18 2

slide-3
SLIDE 3

What is stroke?

  • Brain infarctjons & brain hemorrhage
  • NL: 43.000 strokes per year
  • 3rd cause of death

6/12/19 3 Cardiovasculair Risicomanagement, NHG

slide-4
SLIDE 4

Preventjon of stroke is key

  • Preventjon by general practjtjoner
  • Blood pressure & cholesterol medicatjon
  • Lifestyle change
  • Simplistjc risk chart, only 5 risk factors
  • Need for precision preventjon (and thus

predictjon)!

6/12/19 4 Cardiovasculair Risicomanagement, NHG

slide-5
SLIDE 5

Aim

  • Including free text in a predictjon model for stroke
  • Identjfjcatjon of novel (women-specifjc) risk factors

6/12/19 5

slide-6
SLIDE 6

Free text

  • Captures patjents’ narratjve
  • Supportjng evidence
  • Uncertainty
  • Non-medical informatjon (eg. social problems)
  • Diagnosis Descriptjons
  • SOAP notes

S: Subjectjve

O: Objectjve

A: Assessment

P: Plan

6/12/19 6

slide-7
SLIDE 7

Data overview

  • Pipeline development: ELAN dataset (n = 87000)
  • Proof of concept: NEO dataset (n ≈ 6000)

Cases (including heart infarctjons): 182

Controls: 5890

  • Main dataset: STIZON dataset (n = 3000000)

6/12/19 7

slide-8
SLIDE 8

Preprocessing

Preparatjon

 ICPC code (re)formattjng (e.g. K90.00)  Grouping SOAP lines

Cleaning and tokenizatjon

 Lowercasing and punctuatjon removal  Token removal: Stopwords, numbers, short words, medicatjon specifjcatjons

(e.g. 100mg or 100st), zorgdomein codes Spelling Correctjon

 Vocabulary: Clinspell, ICPC defjnitjons and CoNLL  Single-character edit identjfjcatjon using Symmetric Delete

Keyphrase Detectjon

 Kullback–Leibler divergence

6/12/19 8 Insert > Header & footer

slide-9
SLIDE 9

Cases vs. controls

6/12/19 9

slide-10
SLIDE 10

Feature Selectjon

  • Unifjed Medical Language System (ULMS): Medical Concept Extractjon
  • Bag-of-Words
  • Topic Modeling

Latent Dirichlet Allocatjon (LDA)

Non-negatjve Matrix Factorizatjon (NMF)

Topic Coherence: Word Embedding model (Word2Vec)

6/12/19 10

slide-11
SLIDE 11

Topic Coherence

6/12/19 11

slide-12
SLIDE 12

Models

  • Logistjc Regression
  • Random Forest

6/12/19 12

slide-13
SLIDE 13

Models

6/12/19 13

slide-14
SLIDE 14

Next steps

  • STIZON dataset

Experimentatjon

Pipeline optjmizatjon

  • Negatjon Detectjon

6/12/19 14

slide-15
SLIDE 15

Thank you!

15

LUMC Neurologie

  • Hendrikus J. H. van Os
  • Marieke J. H. Wermer

LUMC PHEG

  • Mattijs A. Numans
  • Tobias N. Bonten
  • Niels H. Chavannes
  • Rolf H. H. Groenwold
  • Janet Kist
  • Michiel Meulenbroek
  • Frederike Buechner

Vrije Universiteit

  • Mark Hoogendoorn
  • Ioannis Pantazis

LIACS

  • Matthijs de Leeuw
  • Suzan Verberne
  • Teddy Etoeharnowo
  • Anneloes Louwe

LUMC Statistiek

  • Hein Putter
  • Erik van Zwet

Turku University (Finland)

  • Sepinoud Azimi