CLASSIFYING LABORATORY TEST RESULTS USING MACHINE LEARNING Joy - PowerPoint PPT Presentation

CLASSIFYING LABORATORY TEST RESULTS USING MACHINE LEARNING Joy (Sizhe) Chen , Kenny Chiu , William Lu , Nilgoon Zarei A U GU ST 31, 2018

TEAM Joy (Sizhe) Chen Kenny Chiu Nelly (Nilgoon) Zarei William Lu 2

AGENDA • Background • Project Scope • Dataset • Machine Learning Approach and Results • Symbolic Approach and Results • Pipeline Architecture • Future Work 3

BACKGROUND 4

BACKGROUND Lab Result Specimen rejected | Test not performed. | No evidence of HCV infection. Semi-structured free form No Bordetella pertussis DNA detected by PCR. text data from lab reports Result inconclusive. | Culture results to follow. | Varicella Zoster Virus | 'Isolated.' containing raw test results 'Organism identified as:' | Haemophilus influenzae | Biotype | | non serotypable (non encapsulated) Manual classification process (expensive, slow) Test Performed Test Outcome Organism Name Structured data used to No Negative *Not Found analyze population-level Yes Negative *Not Found disease trends Yes Indeterminate *Not Found Yes Positive Haemophilus influenzae 5

PROJECT SCOPE Identify, implement, and test appropriate machine learning and natural language processing techniques for interpreting and labeling unstructured lab results Lab Result ML / NLP Label "Influenza Type B RNA detected by RT-PCR." 6

DATASET ~1 million rows; ~360K usable rows after filtering out proficiency tests and purely numeric results Test Performed? Test Outcome 32% Labelled Labelled 68% Unlabelled 100% 6% 17% Positive Yes Negative 13% No Indeterminate 1% 69% 94% Missing 7

DATASET ~1 million rows; ~360K usable rows after filtering out proficiency tests and purely numeric results Organism Genus Organism Name 11% 89% Labelled Unlabelled 8

DATASET 9

DATASET • Lab results may be incomplete sentences and may contain typographical errors BCCDC seretype: non froup 5 | Final | 12/Jun/2009 | Sputum | Streptococcus pneumoniae | STUDY Isolate not | Salmonella species • Lab results may contain contradictory information TEST NOT PERFORMED | Galactomannan testing is valid only for Haematology and lung transplant patients with no recent antifungal exposure | Test performed at Provincial Laboratory of Public Health, Edmonton Organism identified as: | Neisseria meningitidis nongroupable | Upon further investigation | Organism identified as: | Moraxella osloensis | by 16S rRNA gene sequence analysis. 10

DATASET • One organism may be positive, while another may be negative NEGATIVE for Shiga toxin stx1 and stx2 genes by PCR. | Isolate serotyped as: | Escherichia coli | not | O157:H7 • Lots of negative organisms may be mentioned in the result full description Rhinovirus or Enterovirus detected by multiplex NAT. | | Adenovirus detected by multiplex NAT. | | Multiplex NAT is capable of detecting Influenza A and B, Respiratory Syncytial Virus, Parainfluenza 1, 2, 3, and 4, Rhinovirus, Enterovirus, Adenovirus, Coronaviruses HKU1, NL63, OC43, and 229E, hMetapneumovirus, Bocavirus, C. pneumoniae, L. pneumophila, and M. pneumoniae. | | MULTIPLE INFECTION DETECTED 11

MACHINE LEARNING APPROACH • Automatically learn patterns from existing categorized data to categorize new data • Data is represented in terms of features • Machine learning model has a number of parameters • During training, old data is used to optimize the parameter values • During classification on new data, a computation is performed on the new data’s features and the optimized parameter values in order to determine the classifications • Parameters are fitted to the training data , thus allowing the model to learn. 12

RESULTS – BINARY TEST OUTCOME • Started with trivial case: binary Test Outcome (Positive / Negative) • Bag-of-words : represent document by vector of integers that denote number of times each unigram (single word) appears • simple and convenient but loses word ordering information Unigram Count differentiate 1 “Unable to differentiate between Streptococcus identified 0 mitis and … … Streptococcus pneumoniae.” streptococcus 2 unable 1 13

RESULTS – BINARY TEST OUTCOME RF Predicted Predicted Recall (100 trees) Positive Negative True 3860 41 99% Positive True 16 2987 99% Negative Precision 99% 99% SVM Predicted Predicted Recall (Linear) Positive Negative True 3885 16 99% Positive True 9 2994 99% Negative 14 Precision 99% 99%

RESULTS – BINARY TEST OUTCOME Important unigrams for Negative and Positive based on Logistic Regression weights 15

RESULTS – BINARY TEST OUTCOME Important bigrams for Test Outcome as ranked by Random Forest 16

RESULTS – 4 CLASS TEST OUTCOME 17

RESULTS – 4 CLASS TEST OUTCOME ) 18

RESULTS – FEATURE SELECTION • Remove unhelpful features to prevent overfitting and speed up training . ) • For example, Test Outcome classifiers still do well with only 200 unigram features! 19

RESULTS – TEST PERFORMED Support Vector Machine (Linear): 98% accuracy SVM Predicted Predicted Recall (Linear) Yes No True 67696 411 99% Yes True 947 3475 79% No Precision 99% 89% • Class imbalance caused the classifier to over-predict the majority class. 20

RESULTS – TEST PERFORMED • Strategies to fix this: • Down-sampling – in the training set, randomly throw out rows from the majority class until classes are balanced. • Disadvantage: throws out too much training data. • Up-sampling – in the training set, randomly duplicate rows from the minority class until classes are balanced. • Disadvantage: takes too long to train. 21

RESULTS – TEST PERFORMED • Class reweighting – during training, penalize the classifier more for misclassifying minority rows. Support Vector Machine (Linear): 98% accuracy SVM Predicted Predicted Recall (Linear) Yes No True 66355 1800 97% Yes True 429 3945 90% No Precision 99% 69% • Disadvantage: Reduces false positives at the expense of false negatives. 22

RESULTS – TEST PERFORMED Add bigrams (pairs of consecutive words) and trigrams (triples of consecutive words) to the feature space to boost interpretability but at the cost of introducing duplicates . Most important Test Performed features (ranked by Random Forest) Unigrams only Unigrams, bigrams, and trigrams performed missing not test not test test not performed missing not performed routinely performed patient not 23

SYMBOLIC APPROACH FOR ORGANISM NAME • Problems with the machine learning approach: • Data-hungry – there are not enough labelled rows for some organisms • Can’t find new organisms – there is no complete dictionary of organism names, so an approach is needed • We must consider an alternative approach for classifying organism name. 24

MACHINE LEARNING VS. SYMBOLIC Machine Learning Symbolic Description Automatically learn patterns Tag each word by referring to a from existing categorized data knowledge base, then apply (“training set”) to categorize domain rules to categorize data new data (“test set”) Pros • Adapts to new coding styles • More interpretable • More robust to typos and • Can find labels that do not grammatical errors already exist in the database Cons • Data hungry • Long tagging time • Long training time • Requires significant domain • Requires domain knowledge knowledge 25

METAMAP MetaMap application : annotates text with UMLS Metathesaurus concepts • e.g. Bacterium , Functional Concept , Finding NEGATIVE for Shiga toxin stx1 and stx2 genes by PCR. | Escherichia coli | not | O157:H7 [Qualitative Concept] [Gene or Genome] [Bacterium] [Hazardous or Poisonous Substance,Organic Chemical] (Negation) [Functional Concept] Usages : 1. Extract all recognized Bacterium and Viruses as microorganisms 2. Generalize classifiers by including UMLS concepts as classifier inputs 26

RESULTS – ORGANISM GENUS • Training stage: construct dictionary of all existing organisms in the database. • We use a two-part algorithm for classifying Organism Genus label. • First, look at Test Outcome classification. • If Test Outcome is negative, Organism Genus is “*Not Found” by definition. Rhinovirus or Enterovirus detected by multiplex NAT. | | Adenovirus detected by multiplex NAT. | • Then, look at the list of organisms recognized by MetaMap: | Multiplex NAT is capable of detecting Influenza A and B, Respiratory Syncytial Virus, Parainfluenza 1, 2, 3, and 4, Rhinovirus, Enterovirus, Adenovirus, Coronaviruses HKU1, NL63, • Pick the first organism that appears in the dictionary. OC43, and 229E, hMetapneumovirus, Bocavirus, C. pneumoniae, L. pneumophila, and M. • Arbitrarily pick any organism if no organisms are in the dictionary. pneumoniae. | | MULTIPLE INFECTION DETECTED • This approach achieves ~85% accuracy. • Fails mostly on rows with lots of negative organisms . 27

CLASSIFYING LABORATORY TEST RESULTS USING MACHINE LEARNING Joy - PowerPoint PPT Presentation

CLASSIFYING LABORATORY TEST RESULTS USING MACHINE LEARNING Joy (Sizhe) Chen , Kenny Chiu , William Lu , Nilgoon Zarei A U GU ST 31, 2018 TEAM Joy (Sizhe) Chen Kenny Chiu Nelly (Nilgoon) Zarei William Lu 2 AGENDA Background

V0D 2016 Classifying Studies V0D V0D 2016 Classifying Studies 1 2016 Classifying Studies

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Classifying Homogeneous Structures Cherlin Introduction The finite case Gregory Cherlin

Language and Computers Unsupervised Learning Features & Classifying Documents Evidence

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

200511316 200511316 Test plan Test design specification g p

FLSA DUTIES TEST Exemption/Duties Test Types of Duties/Exemption Test Executive Exemption

Engineering Best Practices Test, test, test, and test some more; test as you go Start from a

Test automation Building automatically repeatable test suites Test automation n Test automation

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Flu infects up to 20% of the population each year and kills 4 A S EASONAL P ROBLEM

Relative Infectivity as a Reliable Alternative to the TCID 50 Assay Win Den Cheung, Ph.D.

The C1 Technology Platform Making Healthcare Accessible & Affordable June 25, 2019

New Zealands GMO releases: Oncolytic viruses for cancer therapy clinical trials Dr Tim

Making STEM Connections to Ignite Student Interests Julie Hellweg Dr. Christopher Cirmo

Urban Water Security Research Alliance Potential Health Risks from Pathogens in Alternative

Endocrine System Chemical Control Endocrine System A system of glands that release chemical

Body Fluids And Electrolytes A Programmed Presentation Page 1/114 1041152 Body Fluids And

CLASSIFYING LABORATORY TEST RESULTS USING MACHINE LEARNING Joy - PowerPoint PPT Presentation

CLASSIFYING LABORATORY TEST RESULTS USING MACHINE LEARNING Joy (Sizhe) Chen , Kenny Chiu , William Lu , Nilgoon Zarei A U GU ST 31, 2018 TEAM Joy (Sizhe) Chen Kenny Chiu Nelly (Nilgoon) Zarei William Lu 2 AGENDA Background

V0D 2016 Classifying Studies V0D V0D 2016 Classifying Studies 1 2016 Classifying Studies

Model-Based Testing (ISTQB Chapter 4) Arie van Deursen 1 4.1 ISTQB Test Design Test Scripts

Classifying Homogeneous Structures Cherlin Introduction The finite case Gregory Cherlin

Language and Computers Unsupervised Learning Features &amp; Classifying Documents Evidence

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

200511316 200511316 Test plan Test design specification g p

FLSA DUTIES TEST Exemption/Duties Test Types of Duties/Exemption Test Executive Exemption

Engineering Best Practices Test, test, test, and test some more; test as you go Start from a

Test automation Building automatically repeatable test suites Test automation n Test automation

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Flu infects up to 20% of the population each year and kills 4 A S EASONAL P ROBLEM

Relative Infectivity as a Reliable Alternative to the TCID 50 Assay Win Den Cheung, Ph.D.

The C1 Technology Platform Making Healthcare Accessible &amp; Affordable June 25, 2019

New Zealands GMO releases: Oncolytic viruses for cancer therapy clinical trials Dr Tim

Making STEM Connections to Ignite Student Interests Julie Hellweg Dr. Christopher Cirmo

Urban Water Security Research Alliance Potential Health Risks from Pathogens in Alternative

Endocrine System Chemical Control Endocrine System A system of glands that release chemical

Body Fluids And Electrolytes A Programmed Presentation Page 1/114 1041152 Body Fluids And

Language and Computers Unsupervised Learning Features & Classifying Documents Evidence

The C1 Technology Platform Making Healthcare Accessible & Affordable June 25, 2019