SLIDE 1
and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, - - PowerPoint PPT Presentation
and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, - - PowerPoint PPT Presentation
Automated Case Identification and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, MSc, BSc(Psych), Grad Dip Psych, FACHI, FACS,MAMIA CEO Awarded 7 th June 2018 The Many Faces of NLP? Text Mining - rules, regular
SLIDE 2
SLIDE 3
The Many Faces of NLP?
- Text Mining - rules, regular expressions, bag of words
– deterministic – cannot find anything that hasn’t been defined in the rules – Strong on Positives but typically over-generalises. No ability to find unseens.
- Real NLP – the field of computing the structure of
language - nka Computational Linguistics
- Statistical NLP – NLP plus Machine Learning from
examples and then can generalize – non-deterministic – finds what it hasn’t seen
- Language Engineering – Building production grade
SNLP solutions
SLIDE 4
California Cancer Registry Problem
- 500,000 documents per annum
– potentially 1 million in the coming years
- 50% unwanted and need to be filtered out
- Separation of non-reportable and reportable
cancers
- Coding the reportables
SLIDE 5
California CR Project Objectives - Analysis of Histopathology Reports
- Develop an automated service to:
- Determine Reportability
- Codify 5 attributes of
– Site – Histology – Grade – Behaviour – Laterality
SLIDE 6
The Language Engineering Issues
- How well can we do it - efficiency
- With what volume of training materials - 5000
- In what amount of time – 15 months
- To what accuracy - 90%
- For what cost – not enough
- The SNLP is our infrastructure – it is the
Language Modeling using machine learning, and coding to the client deliverables that have to be engineered.
SLIDE 7
Language Engineering Overview
- f Tasks
- Pathology Reports Classifier
– In scope – histopathology – Out-of-scope - Immunohistochemistry & Genetics
- Case Identification – Reportability Classifier
- Clinical Concept Recognizer
- Coding Inference Engine
SLIDE 8
Two Training Corpora needed to create a Gold Standard for learning
- 5000 Reportables in 10 Batches
- 5000 Non-Reportbales
- From 50+ laboratories
- Covering
– 133 Histology codes – 140 Site codes
- Subsequently - 212 Reportables in Non-
Reportables corpus transferred to Batch 11
SLIDE 9
Pathology Report Type Classifier
Classes
TP FP FN P R F- Score
Histopathology
4373 57 29 98.71 99.34 99.03
Other
653 29 57 95.75 91.97 93.82
OVERALL
5026 86 86 98.32 98.32 98.32
SLIDE 10
Reportability Results for Reportability Corpus
TP FP FN TN P R F 3510 27 111 1160 99.24 96.93 98.10 Interpreted as 0.76% FP and 3.07% FN (loss of Cancer reports to the Non-cancer class). Acceptable but pride would want us to do better.
SLIDE 11
Finding Clinical Concepts: Annotating and Tagging
- Design a schema of semantic tags
- Manually annotate the training corpus with
the tags
- Build a machine learning Language Model
with the training corpus to recognise the tags
- Check the consistency of the annotations
with the model
- Iteratively correct the annotations and the
structure of the model to improve it
SLIDE 12
Language Model Accuracy – RUN 29B & Run 41 - Sample of 34/32 Tags
TP FP FN P R F N F N
Site
18072 50 59 99.72 99.67 99.70 18131 99.87 21943
Histology
11172 7 87 99.94 99.23 99.58 11259 99.78 13784
Behaviour
8247 5 27 99.94 99.67 99.81 8274 99.91 9603
Grade
897 1 19 99.89 97.93 98.90 916 98.79 1335
Laterality
6351 10 20 99.84 99.69 99.76 6371 99.89 8000
Total (34 tags)
115550 226 921 99.80 99.21 99.51 116471 99.71 141412
>99%
29 16 21 27
>99.5%
29 10 14 22
TOTAL
34 34 34 32
2017 2018
SLIDE 13
Coding Reports to ICD O3 – Problem Definition
- Separate out each specimen.
- Clinical Concept Recognition: Apply the Language
Model to tag the correct concepts needed for coding.
- CODIFY: Map the tags to the appropriate elements in
the ICD O3 definitions for each coded attribute including applying:
– SEER multiple primaries rules and – any other local rules.
- Evaluate each specimen for its cancer reportability
- The Summary - Select the specimens that are required
to produce the correct case codes.
SLIDE 14
Coding Results Overview – All Reportables 2017.v1
Extractable # of records # of correct coded Accuracy Site (4 digits) 3165 3014 95.23% hist_type (4 digits) 3165 3050 96.37% hist_grade 3165 3126 98.77% hist_behavior 3165 3150 99.53% laterality 3165 3096 97.82% TOTAL 15825 15436 487.72% Average 97.54%
SLIDE 15
FINAL Estimated Efficiency Gains
- 100% automated Case finding - Reportability
- 72% automated coding at 94% overall accuracy =>
28% manual coding.
- Up to 90% automatic coding is possible.
- Histology codes coverage: 93.9% of supplied examples
- Site codes coverage: 94.2% of supplied examples
- Automated Coding of code set coverage 97.5% of
reports.
- Reduce manual errors by 40-80%.
- More improvements are being made.
SLIDE 16
Epilogue – 2018 Revisions
- 187 Site codes – 35% increase
- 237 Histology codes – 85% increase
- 15% increase in reports
– Drawn from audits, new samplings
- 2018.v1 is being tested by CCR
- 2017.v2 is released
SLIDE 17
The Foreseeable Future
- Increase accuracy for more difficult complexity classes ->
reduces manual processing
- Further investigate tumour stream classifications for
improving accuracy
- Analyse Immunohistochemistry and Genetics reports
- Add more extraction and coding functions – e.g.
Biomarkers, recurrence, tumour size, margins, …
- Add more document types – Radiology, Nuclear Medicine,
- For the CDPH – Other Health topics – First Fractures of
- steoarthritis, Adherence to National guidelines for
diagnostics
- Provide a Subscription Reportability Service for third
parties.
SLIDE 18
- END