and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, - PowerPoint PPT Presentation

Automated Case Identification and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, MSc, BSc(Psych), Grad Dip Psych, FACHI, FACS,MAMIA CEO

Awarded 7 th June 2018

The Many Faces of NLP? • Text Mining - rules, regular expressions, bag of words – deterministic – cannot find anything that hasn’t been defined in the rules – Strong on Positives but typically over-generalises. No ability to find unseens. • Real NLP – the field of computing the structure of language - nka Computational Linguistics • Statistical NLP – NLP plus Machine Learning from examples and then can generalize – non-deterministic – finds what it hasn’t seen • Language Engineering – Building production grade SNLP solutions

California Cancer Registry Problem • 500,000 documents per annum – potentially 1 million in the coming years • 50% unwanted and need to be filtered out • Separation of non-reportable and reportable cancers • Coding the reportables

California CR Project Objectives - Analysis of Histopathology Reports • Develop an automated service to: • Determine Reportability • Codify 5 attributes of – Site – Histology – Grade – Behaviour – Laterality

The Language Engineering Issues • How well can we do it - efficiency • With what volume of training materials - 5000 • In what amount of time – 15 months • To what accuracy - 90% • For what cost – not enough • The SNLP is our infrastructure – it is the Language Modeling using machine learning, and coding to the client deliverables that have to be engineered.

Language Engineering Overview of Tasks • Pathology Reports Classifier – In scope – histopathology – Out-of-scope - Immunohistochemistry & Genetics • Case Identification – Reportability Classifier • Clinical Concept Recognizer • Coding Inference Engine

Two Training Corpora needed to create a Gold Standard for learning • 5000 Reportables in 10 Batches • 5000 Non-Reportbales • From 50+ laboratories • Covering – 133 Histology codes – 140 Site codes • Subsequently - 212 Reportables in Non- Reportables corpus transferred to Batch 11

Pathology Report Type Classifier F- TP FP FN P R Score Classes 4373 57 29 98.71 99.34 99.03 Histopathology 653 29 57 95.75 91.97 93.82 Other 5026 86 86 98.32 98.32 98.32 OVERALL

Reportability Results for Reportability Corpus TP FP FN TN P R F 3510 27 111 1160 99.24 96.93 98.10 Interpreted as 0.76% FP and 3.07% FN (loss of Cancer reports to the Non-cancer class). Acceptable but pride would want us to do better.

Finding Clinical Concepts: Annotating and Tagging • Design a schema of semantic tags • Manually annotate the training corpus with the tags • Build a machine learning Language Model with the training corpus to recognise the tags • Check the consistency of the annotations with the model • Iteratively correct the annotations and the structure of the model to improve it

Language Model Accuracy – RUN 29B & Run 41 - Sample of 34/32 Tags 2017 2018 TP FP FN P R F N F N 18072 50 59 99.72 99.67 99.70 18131 99.87 21943 Site 11172 7 87 99.94 99.23 99.58 11259 99.78 13784 Histology 8247 5 27 99.94 99.67 99.81 8274 99.91 9603 Behaviour 897 1 19 99.89 97.93 98.90 916 98.79 1335 Grade 6351 10 20 99.84 99.69 99.76 6371 99.89 8000 Laterality Total (34 115550 226 921 99.80 99.21 99.51 116471 99.71 141412 tags) 29 16 21 27 >99% 29 10 14 22 >99.5% 34 34 34 32 TOTAL

Coding Reports to ICD O3 – Problem Definition • Separate out each specimen. • Clinical Concept Recognition: Apply the Language Model to tag the correct concepts needed for coding. • CODIFY: Map the tags to the appropriate elements in the ICD O3 definitions for each coded attribute including applying: – SEER multiple primaries rules and – any other local rules. • Evaluate each specimen for its cancer reportability • The Summary - Select the specimens that are required to produce the correct case codes.

Coding Results Overview – All Reportables 2017.v1 # of # of correct Extractable records coded Accuracy Site (4 digits) 3165 3014 95.23% hist_type (4 digits) 3165 3050 96.37% hist_grade 3165 3126 98.77% hist_behavior 3165 3150 99.53% laterality 3165 3096 97.82% TOTAL 15825 15436 487.72% Average 97.54%

FINAL Estimated Efficiency Gains • 100% automated Case finding - Reportability • 72% automated coding at 94% overall accuracy => 28% manual coding. • Up to 90% automatic coding is possible. • Histology codes coverage: 93.9% of supplied examples • Site codes coverage: 94.2% of supplied examples • Automated Coding of code set coverage 97.5% of reports. • Reduce manual errors by 40-80%. • More improvements are being made.

Epilogue – 2018 Revisions • 187 Site codes – 35% increase • 237 Histology codes – 85% increase • 15% increase in reports – Drawn from audits, new samplings • 2018.v1 is being tested by CCR • 2017.v2 is released

The Foreseeable Future • Increase accuracy for more difficult complexity classes -> reduces manual processing • Further investigate tumour stream classifications for improving accuracy • Analyse Immunohistochemistry and Genetics reports • Add more extraction and coding functions – e.g. Biomarkers, recurrence, tumour size, margins, … • Add more document types – Radiology, Nuclear Medicine, • For the CDPH – Other Health topics – First Fractures of osteoarthritis, Adherence to National guidelines for diagnostics • Provide a Subscription Reportability Service for third parties.

• END

and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, - PowerPoint PPT Presentation

Automated Case Identification and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, MSc, BSc(Psych), Grad Dip Psych, FACHI, FACS,MAMIA CEO Awarded 7 th June 2018 The Many Faces of NLP? Text Mining - rules, regular

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Centre of Defence Pathology Centre of Defence Pathology Impact of Friction upon the BMS

Pathology Penny Page Acting Pathology Training Officer Excellent care with compassion Pathology

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

Coding and Applications in Sensor Networks Coding and Applications in Sensor Networks Why coding?

Risk-Based Coding and Reimbursement What is Risk-Based Coding? Risk-Based Coding Overview A

Applications of Random Coding and Algebraic Coding Theories to Universal Lossless Source Coding

Coding and Applications in Sensor Networks Why coding? Information compression

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Meritus Health Systems 1 Breast Cancer Breast Cancer is cancer that forms in breast cells

A microbial beacon for cancer detection Primary Metastasis cancer 1 WHO Cancer Fact Sheet N

Image and Video Coding: Motion Estimation and Coding 4 5 6 B C D 1 D 0 3 7 A current 2

Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jrgen

Image and Video Coding: Hybrid Video Coding s n 1 [ x , y ] s n [ x , y ] m k = ( m x , m

R a P n d a AAAP Slide Study Set # 31 o g February 2011 m e s Prepared by: S f o a

Assurance 2.0 An Overview Prepared By: Andy Brogan, Founding Partner at Easier Inc Contact:

Similarity of Neural Network Representations Revisited Simon Kornblith, Mohammad Norouzi,

Telephone Based Automatic Telephone Based Automatic Voice Pathology Assessment. Voice Pathology

Whole Slide Imaging in Diagnostic Pathology P. Schirmacher, N. Grabe, H.P. Sinn Institute of

Controlled Substance Diversion: Active and Proactive Solutions Christopher Fortier, PharmD, FASHP

Error Type Refinement for Assurance of Families of Platform- Based Systems

Mid-Winter Meeting February 2020 Creative Solutions and Redneckery on the Pharm Lisa Hoopes,

and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, - PowerPoint PPT Presentation

Automated Case Identification and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, MSc, BSc(Psych), Grad Dip Psych, FACHI, FACS,MAMIA CEO Awarded 7 th June 2018 The Many Faces of NLP? Text Mining - rules, regular

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Centre of Defence Pathology Centre of Defence Pathology Impact of Friction upon the BMS

Pathology Penny Page Acting Pathology Training Officer Excellent care with compassion Pathology

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

Coding and Applications in Sensor Networks Coding and Applications in Sensor Networks Why coding?

Risk-Based Coding and Reimbursement What is Risk-Based Coding? Risk-Based Coding Overview A

Applications of Random Coding and Algebraic Coding Theories to Universal Lossless Source Coding

Coding and Applications in Sensor Networks Why coding? Information compression

Entropy Coding Definition of Entropy Three Entropy coding techniques: (taken from the

Meritus Health Systems 1 Breast Cancer Breast Cancer is cancer that forms in breast cells

A microbial beacon for cancer detection Primary Metastasis cancer 1 WHO Cancer Fact Sheet N

Image and Video Coding: Motion Estimation and Coding 4 5 6 B C D 1 D 0 3 7 A current 2

Speech &amp; Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jrgen

Image and Video Coding: Hybrid Video Coding s n 1 [ x , y ] s n [ x , y ] m k = ( m x , m

R a P n d a AAAP Slide Study Set # 31 o g February 2011 m e s Prepared by: S f o a

Assurance 2.0 An Overview Prepared By: Andy Brogan, Founding Partner at Easier Inc Contact:

Similarity of Neural Network Representations Revisited Simon Kornblith, Mohammad Norouzi,

Telephone Based Automatic Telephone Based Automatic Voice Pathology Assessment. Voice Pathology

Whole Slide Imaging in Diagnostic Pathology P. Schirmacher, N. Grabe, H.P. Sinn Institute of

Controlled Substance Diversion: Active and Proactive Solutions Christopher Fortier, PharmD, FASHP

Error Type Refinement for Assurance of Families of Platform- Based Systems

Mid-Winter Meeting February 2020 Creative Solutions and Redneckery on the Pharm Lisa Hoopes,

Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jrgen