Institut für Maschinelle Sprachverarbeitung
Text Mining on Clinical Data Robert McHardy Outline Motivation - - PowerPoint PPT Presentation
Text Mining on Clinical Data Robert McHardy Outline Motivation - - PowerPoint PPT Presentation
Institut fr Maschinelle Sprachverarbeitung Text Mining on Clinical Data Robert McHardy Outline Motivation Medical Entity Recognition Anonymization of Medical Reports Knowledge-based Biomedical Word Sense Disambiguation
Outline
- Motivation
- Medical Entity Recognition
- Anonymization of Medical Reports
- Knowledge-based Biomedical Word Sense Disambiguation
- Extraction of Potential Adverse Drug Events
- Resources
5.12.2017 Universität Stuttgart 2
5.12.2017 Universität Stuttgart 3
Motivation — Different Users
Motivation — Why do we need Text Mining on Clinical Data?
- Doctors need to know if a drug is safe to use or not
5.12.2017 Universität Stuttgart 4
Motivation — Why do we need Text Mining on Clinical Data?
- Doctors need to know if a drug is safe to use or not
- As fast as possible
5.12.2017 Universität Stuttgart 4
Motivation — Why do we need Text Mining on Clinical Data?
- Doctors need to know if a drug is safe to use or not
- As fast as possible
- We don‘t want to suffer from unsafe drugs
5.12.2017 Universität Stuttgart 4
Motivation — Why do we need Text Mining on Clinical Data?
- Doctors need to know if a drug is safe to use or not
- As fast as possible
- We don‘t want to suffer from unsafe drugs
- Researchers want to use the data
5.12.2017 Universität Stuttgart 4
Motivation — Why do we need Text Mining on Clinical Data?
- Doctors need to know if a drug is safe to use or not
- As fast as possible
- We don‘t want to suffer from unsafe drugs
- Researchers want to use the data
- It has to be anonymized
5.12.2017 Universität Stuttgart 4
Motivation — PubMed, again!
5.12.2017 Universität Stuttgart 5
Unified Medical Language System Metathesaurus (UMLS)
5.12.2017 Universität Stuttgart 6
Medical Entity Recognition — Overview
- Abacha and Zweigenbaum: Consists of two parts
- Detecting phrases referring to medical entities
- Assigning semantic categories to the found entities
5.12.2017 Universität Stuttgart 7
Medical Entity Recognition — Overview
5.12.2017 Universität Stuttgart 8
Medical Entity Recognition — Overview
5.12.2017 Universität Stuttgart 8
Diabetes type 1 Type 1 diabetes IDDM Juvenile diabetes T1D
Medical Entity Recognition — Noun Phrase Chunking
Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […]
5.12.2017 Universität Stuttgart 9
Medical Entity Recognition — Noun Phrase Chunking
Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […]
- Many tools for NP chunking available
5.12.2017 Universität Stuttgart 9
Medical Entity Recognition — Noun Phrase Chunking
Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […]
- Many tools for NP chunking available
- Maximum recall is desired
5.12.2017 Universität Stuttgart 9
Medical Entity Recognition — Noun Phrase Chunking
Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […]
- Many tools for NP chunking available
- Maximum recall is desired
- Open-domain tools like IMS‘ TreeTagger are suitable
5.12.2017 Universität Stuttgart 9
Medical Entity Recognition — MetaMap and the UMLS
- MetaMap is a tool which maps noun phrases in raw text to UMLS concepts
- This is done according to a matching score
5.12.2017 Universität Stuttgart 10
Medical Entity Recognition — MetaMap and the UMLS
- Three problems with MetaMap
5.12.2017 Universität Stuttgart 11
Medical Entity Recognition — MetaMap and the UMLS
- Three problems with MetaMap
- Noun chunking performance is worse than with specialized NLP tools
5.12.2017 Universität Stuttgart 11
Medical Entity Recognition — MetaMap and the UMLS
- Three problems with MetaMap
- Noun chunking performance is worse than with specialized NLP tools
- Medical entity detection often finds verbs and general words which aren‘t MEs
5.12.2017 Universität Stuttgart 11
Medical Entity Recognition — MetaMap and the UMLS
- Three problems with MetaMap
- Noun chunking performance is worse than with specialized NLP tools
- Medical entity detection often finds verbs and general words which aren‘t MEs
- Some ambiguity is left
5.12.2017 Universität Stuttgart 11
Medical Entity Recognition — MetaMap and the UMLS
- Three problems with MetaMap
- Noun chunking performance is worse than with specialized NLP tools
- Medical entity detection often finds verbs and general words which aren‘t MEs
- Some ambiguity is left
- UMLS can provide several concepts for a term
5.12.2017 Universität Stuttgart 11
Medical Entity Recognition — MetaMap and the UMLS
- Three problems with MetaMap
- Noun chunking performance is worse than with specialized NLP tools
- Medical entity detection often finds verbs and general words which aren‘t MEs
- Some ambiguity is left
- UMLS can provide several concepts for a term
- and several semantic categories for a concept
5.12.2017 Universität Stuttgart 11
Medical Entity Recognition — MetaMap and the UMLS Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […]
5.12.2017 Universität Stuttgart 12
Cold (term) Cold temperature Common cold Chronic
- bstructive lung
disease Cold storage (term) Cold storage
Medical Entity Recognition — MetaMap+
- Use tools like TreeTagger for the NP chunking
5.12.2017 Universität Stuttgart 13
Medical Entity Recognition — MetaMap+
- Use tools like TreeTagger for the NP chunking
- Filter NPs with a stop-word list
5.12.2017 Universität Stuttgart 13
Medical Entity Recognition — MetaMap+
- Use tools like TreeTagger for the NP chunking
- Filter NPs with a stop-word list
- Search in specialized lists for candidate terms
5.12.2017 Universität Stuttgart 13
Medical Entity Recognition — MetaMap+
- Use tools like TreeTagger for the NP chunking
- Filter NPs with a stop-word list
- Search in specialized lists for candidate terms
- Annotate entities with MetaMap
5.12.2017 Universität Stuttgart 13
Medical Entity Recognition — MetaMap+
- Use tools like TreeTagger for the NP chunking
- Filter NPs with a stop-word list
- Search in specialized lists for candidate terms
- Annotate entities with MetaMap
- Filter frequent errors and too broad semantic types
5.12.2017 Universität Stuttgart 13
Medical Entity Recognition — MetaMap+
- Voting mechanism to disambiguate semantic categories
5.12.2017 Universität Stuttgart 14
Medical Entity Recognition — Support Vector Machines (SVMs)
- Word level features:
- words of the NP
- number of words of the NP
- window of words around the NP
- Orthographical features:
- first letter capitalized
- all letters upper-/lowercase
- contains abbreviation(s)
- POS tags
5.12.2017 Universität Stuttgart 15
Medical Entity Recognition — BIO-CRFs
Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […]
- Words are annotated with the the tags B, I and O
5.12.2017 Universität Stuttgart 16
Medical Entity Recognition — BIO-CRFs
Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […]
- Words are annotated with the the tags B, I and O
- B-x: Begin of a phrase of class x
5.12.2017 Universität Stuttgart 16
Medical Entity Recognition — BIO-CRFs
Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […]
- Words are annotated with the the tags B, I and O
- B-x: Begin of a phrase of class x
- I-x: Intermediate part of a phrase of class x
5.12.2017 Universität Stuttgart 16
Medical Entity Recognition — BIO-CRFs
Pharmacodynamic studies, including positron-emission tomography (PET) and computed tomography (CT) […]
- Words are annotated with the the tags B, I and O
- B-x: Begin of a phrase of class x
- I-x: Intermediate part of a phrase of class x
- O: Outside entities
5.12.2017 Universität Stuttgart 16
Medical Entity Recognition — BIO-CRFs
- Word level features:
- The word itself
- Window of words
- Lemmas
- Orthographical features:
- Upper/lowercase
- contains a digit
- pre- and suffixes
- POS tags
- (Semantic category of word (provided by MetaMap+))
5.12.2017 Universität Stuttgart 17
Medical Entity Recognition — Evaluation
- Corpus contains discharge summaries and progress notes
- De-identified and annotated by hand
- Entities: Problem, Treatment and Test
- Overall 76,665 sentences
5.12.2017 Universität Stuttgart 18
20.01.2016 Universität Stuttgart 19
Medical Entity Recognition — Evaluation
Setting Precision Recall F-Score MetaMap 15.52 16.10 15.80 MetaMap+ 48.68 56.46 52.28 SVM 43.65 47.16 45.33 BIO-CRF 70.15 83.31 76.17 BIO-CRF-Hybrid 72.18 83.78 77.55
Anonymization of Medical Reports
20.01.2016 Universität Stuttgart 20
Anonymization of Medical Reports — What is anonymization?
- De-Identification
5.12.2017 Universität Stuttgart 21
Anonymization of Medical Reports — What is anonymization?
- De-Identification
- Completely remove all personal health information
5.12.2017 Universität Stuttgart 21
Anonymization of Medical Reports — What is anonymization?
- De-Identification
- Completely remove all personal health information
- Anonymization
- Identify and classify personal information in the documents
5.12.2017 Universität Stuttgart 21
Anonymization of Medical Reports — What is anonymization?
- De-Identification
- Completely remove all personal health information
- Anonymization
- Identify and classify personal information in the documents
- Do not delete but replace it
5.12.2017 Universität Stuttgart 21
Anonymization of Medical Reports — Anonymization vs De-Identification
- The documents remain (easily) readable
- Unclear if remaining PHI is from the original document or arbitrary
5.12.2017 Universität Stuttgart 22
Anonymization of Medical Reports — The features
- Use Named Entity Recognition to detect the PHI
- No “deep knowledge information“ such as syntatic information (POS tags) or
domain-specific resources
5.12.2017 Universität Stuttgart 23
Anonymization of Medical Reports — The features
- Use Named Entity Recognition to detect the PHI
- No “deep knowledge information“ such as syntatic information (POS tags) or
domain-specific resources
- Orthographical features, frequency and phrasal information, dictionaries and
contextual information
5.12.2017 Universität Stuttgart 23
Anonymization of Medical Reports — The features
- Use Named Entity Recognition to detect the PHI
- No “deep knowledge information“ such as syntatic information (POS tags) or
domain-specific resources
- Orthographical features, frequency and phrasal information, dictionaries and
contextual information
- Trigger words within a certain window indicate PHI
5.12.2017 Universität Stuttgart 23
Anonymization of Medical Reports — Machine learning techniques
- Use of decision trees for NER
- C4.5 for building the tree
- Boosting for improving the performance of the decision tree
5.12.2017 Universität Stuttgart 24
Anonymization of Medical Reports — Boosting and C4.5
- C4.5 is used for building the decision tree
- The tree doesn‘t have to be binary
- The data is splitted according to the information gain
5.12.2017 Universität Stuttgart 25
Anonymization of Medical Reports — Boosting and C4.5
- C4.5 is used for building the decision tree
- The tree doesn‘t have to be binary
- The data is splitted according to the information gain
- Boosting combines multiple (weak) classifiers to one strong classifier
- The decision of each (weak) classifier is weighted
- The training of a Boosting classifier is finding these weights
5.12.2017 Universität Stuttgart 25
Anonymization of Medical Reports — Training
- Trusted entities found in first training phase
5.12.2017 Universität Stuttgart 26
Anonymization of Medical Reports — Training
- Trusted entities found in first training phase
- Iterative training with new features
5.12.2017 Universität Stuttgart 26
Anonymization of Medical Reports — Performance of the system
- Different models and two baselines tested
- Majority baseline only predicting non-PHI
- C4.5 decision tree without Boosting but with domain specific extensions
- Iteratively trained models
5.12.2017 Universität Stuttgart 27
Anonymization of Medical Reports — Performance of the system
5.12.2017 Universität Stuttgart 28
Evaluation Evaluation standardized 9-way F 8-way P/R/F 9-way F 8-way P/R/F Majority 94.29 0.00 94.29 0.00 C4.5 99.46 97.92/92.12/94.93 99.52 97.92/93.19/95.49 ITR1_BEST 99.61 98.92/93.97/96.38 99.74 98.47/96.04/97.42 ITR1_VOTE 99.64 98.99/94.35/96.61 99.75 98.79/96.41/97.58 ITR2_BEST 99.65 98.79/94.72/96.71 99.75 98.81/96.39/97.58 ITR2_VOTE 99.65 98.79/94.73/96.71 99.75 98.89/96.42/97.64
Anonymization of Medical Reports — Analysis of the feature set
- The dictionaries were useless
5.12.2017 Universität Stuttgart 29
Anonymization of Medical Reports — Analysis of the feature set
- The dictionaries were useless
- Basic features were the most important
5.12.2017 Universität Stuttgart 29
Anonymization of Medical Reports — Analysis of the feature set
- The dictionaries were useless
- Basic features were the most important
- Orthographical and frequency information 2nd and 3rd most important
5.12.2017 Universität Stuttgart 29
Anonymization of Medical Reports — Analysis of the feature set
- The dictionaries were useless
- Basic features were the most important
- Orthographical and frequency information 2nd and 3rd most important
- Sentence position and quotation marks/brackets didn‘t help
5.12.2017 Universität Stuttgart 29
Anonymization of Medical Reports — Errors
5.12.2017 Universität Stuttgart 30
Error Percentage of errors DOCTOR 23 % LOCATION 20 % HOSPITAL 16 % DATE 10 %
Knowledge-based Biomedical Word Sense Disambiguation — Ambiguity
- We already know genes, proteins and diseases are ambiguous
- Biomedical texts also contain some form of standard language
- Building a manually annotated corpus for statistical WSD is expensive
5.12.2017 Universität Stuttgart 31
Knowledge-based Biomedical WSD — Machine Readable Dictionaries
- Build a vector for each concept the word represents
5.12.2017 Universität Stuttgart 32
Knowledge-based Biomedical WSD — Machine Readable Dictionaries
- Build a vector for each concept the word represents
- Build a vector for the context
5.12.2017 Universität Stuttgart 32
Knowledge-based Biomedical WSD — Machine Readable Dictionaries
- Build a vector for each concept the word represents
- Build a vector for the context
- Cosine similarity to select the most similar concept
𝑁𝑆𝐸 𝑑 = 𝑏𝑠max
𝑑∈𝐷𝑥
𝑑 ∙ 𝑑𝑦 𝑑 ∙ |𝑑𝑦|
5.12.2017 Universität Stuttgart 32
Knowledge-based Biomedical WSD — PageRank
- Idea: Use the topology of the resource network
5.12.2017 Universität Stuttgart 33
Knowledge-based Biomedical WSD — Automatic Corpus Extraction
- Idea: Use one really big corpus (i.e. MEDLINE) and extract multiple corpora for
training
- Get monosemous relatives of an ambiguous term
- Use the result to train statistical model
5.12.2017 Universität Stuttgart 34
Knowledge-based Biomedical WSD — Automatic Corpus Extraction
5.12.2017 Universität Stuttgart 34
„Surgical repair“ OR („repair“ AND [„Corneal Transplantation“ OR „Corneal Transplantations“ OR „Corneal Graftings“ OR „Corneal Grafting“ OR „Cornea Transplantation“ OR „Repair of the Middle Ear“])
Monosemous synonyms Ambiguous term Monosemous terms from related concepts
Knowledge-based Biomedical WSD — Journal Descriptor Indexing
- Idea: Use semantic types for disambiguation
In the mouse, the process of implantation is initiated by the attachment reaction between the blastocyst trophectoderm and uterine luminal epithelium that occurs at 2200–2300 h on day 4 (day 1 = vaginal plug) of pregnancy.
5.12.2017 Universität Stuttgart 35
Knowledge-based Biomedical WSD — Journal Descriptor Indexing
- Idea: Use semantic types for disambiguation
In the mouse, the process of implantation is initiated by the attachment reaction between the blastocyst trophectoderm and uterine luminal epithelium that occurs at 2200–2300 h on day 4 (day 1 = vaginal plug) of pregnancy. 1000 Implantation <1> (Blastocyst Implantation, natural) [Organism Function] 1000 Implantation <2> (Implantation procedure) [Therapeutic or Preventive Procedure]
5.12.2017 Universität Stuttgart 35
Knowledge-based Biomedical WSD — ST vector for implantation
Rank Semantic Type abbreviation Semantic Type Score 57 aapp Amino Acid, Peptide, or Protein 0.3373 5 diap Diagnostic Procedure 0.6637 39 emst Embryonic Structure 0.4168 13
- rgf
Organism Function 0.6013 1 spco Spatial Concept 0.7027 2 topp Therapeutic or Preventive Procedure 0.6937 108 vtbt Vertebrate 0.1748
5.12.2017 Universität Stuttgart 36
Knowledge-based Biomedical WSD — ST vector for blastocyst implantation
Rank Semantic Type abbreviation Semantic Type Score 1
- rgf
Organism Function 0.5506 4 emst Embryonic Structure 0.5132 12 spco Spatial Concept 0.4340 13 topp Therapeutic or Preventive Procedure 0.4316 16 diap Diagnostic Procedure 0.4182 45 aapp Amino Acid, Peptide, or Protein 0.2766 92 vtbt Vertebrate 0.1746
5.12.2017 Universität Stuttgart 37
Knowledge-based Biomedical WSD — Evaluation
Accuracy all Accuracy JDI set Machine Readable Dictionary 0.639 0.653 PageRank 0.583 0.587 Automatic Corpus Extraction 0.683 0.693 Journal Descriptor Indexing 0.748 Linear Combination 0.762 0.779 Combined Voting 0.760 0.774 Maximum Frequency Sense 0.855 0.867 Naive Bayes 0.883 0.906
5.12.2017 Universität Stuttgart 38
Extraction of Potential Adverse Drug Events — Where to get the data from?
- ADE corpus
- Contains 2972 MEDLINE case reports
- Relations only on sentence-level
- If there is no relation between a drug and a condition, they weren‘t
annotated
5.12.2017 Universität Stuttgart 39
Extraction of Potential Adverse Drug Events — Where to get the data from?
- ADE-EXT corpus
- NER with dictionaries for identification of drugs and conditions
- Every pair not previously annotated as a relation formed a False relation
- Overall 5969 False relations
5.12.2017 Universität Stuttgart 40
Extraction of Potential Adverse Drug Events — Drug-Cause-Condition
5.12.2017 Universität Stuttgart 41
Extraction of Potential Adverse Drug Events — How the system was built
- Nested annotations such as „acute lithium toxcity“ were removed
- The resulting corpus was divided into a training and a test set (90:10 split)
5.12.2017 Universität Stuttgart 42
Extraction of Potential Adverse Drug Events — How the system was built
- Tokens enriched:
- POS-tags, lemmas and named entity flags
5.12.2017 Universität Stuttgart 43
Extraction of Potential Adverse Drug Events — Evaluation
- F-Score of 0.87 over the ADE-TRAIN-EXT corpus with 10-fold cross-validation
- Errors typically caused by missing context, distantly co-occuring inter-related
entities
5.12.2017 Universität Stuttgart 44
Extraction of Potential Adverse Drug Events — Evaluation
- F-Score of 0.87 over the ADE-TRAIN-EXT corpus with 10-fold cross-validation
- Errors typically caused by missing context, distantly co-occuring inter-related
entities
- A 65-year-old patient chronically treated with the selective serotonin reuptake
inhibitor (SSRI) citalopram developed confusion, agitation, tachycardia, tremors, myoclonic jerks and unsteady gait, consistent with serotonin syndrome, following initiation of fentanyl, and all symptoms and signs resolved following discontinuation of fentanyl
5.12.2017 Universität Stuttgart 44
Extraction of Potential Adverse Drug Events — Training size
5.12.2017 Universität Stuttgart 45
F-Score #Documents Mean Standard Derivation 10 0.55 0.38 20 0.64 0.37 50 0.82 0.09 100 0.78 0.04 200 0.84 0.04 500 0.84 0.02 1000 0.86 0.01 2000 0.87 0.01
Take-away
- Different approaches for detecting medical entities
- What anonymization is and how to do it
- Disambiguation of biomedical terms
- What adverse drug events are and how to extract them
5.12.2017 Universität Stuttgart 45
Resources
- Gurulingappa, Mateen-Rajput and Toldo, 2012: Extraction of potential adverse drug
events from medical case reports
- Uzuner, South, Shen and DuVall; 2010: i2b2/VA challenge on concepts, assertions, and
relations in clinical text
- Humphrey, Rogers, Kilicoglu, Demner-Fushman, Rindflesch; 2006: Word Sense
Disambiguation by Selecting the Best Semantic Type Based on Journal Descriptor Indexing: Preliminary Experiment
- Szarvas, Farkas and Busa-Fekete; 2007: State-of-the-art Anonymization of Medical
Records Using an Iterative Machine Learning Framework
- Jimeno-Yepes and Aronson; 2010: Knowledge-based biomedical word sense
disambiguation: comparison of approaches
5.12.2017 Universität Stuttgart 46
Resources
- Agirre and Soroa, 2009: Personalizing PageRank for Word Sense Disambiguation
- Abacha and Zweigenbaum; 2011: Medical Entity Recognition: A Comparison of Semantic
and Statistical Methods
20.01.2016 Universität Stuttgart 83