and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, - - PowerPoint PPT Presentation

and coding of cancer pathology
SMART_READER_LITE
LIVE PREVIEW

and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, - - PowerPoint PPT Presentation

Automated Case Identification and Coding of Cancer Pathology Reports Jon Patrick PhD, DipLSurv, MSc, BSc(Psych), Grad Dip Psych, FACHI, FACS,MAMIA CEO Awarded 7 th June 2018 The Many Faces of NLP? Text Mining - rules, regular


slide-1
SLIDE 1

Automated Case Identification and Coding of Cancer Pathology Reports

Jon Patrick PhD, DipLSurv, MSc, BSc(Psych), Grad Dip Psych, FACHI, FACS,MAMIA CEO

slide-2
SLIDE 2

Awarded 7th June 2018

slide-3
SLIDE 3

The Many Faces of NLP?

  • Text Mining - rules, regular expressions, bag of words

– deterministic – cannot find anything that hasn’t been defined in the rules – Strong on Positives but typically over-generalises. No ability to find unseens.

  • Real NLP – the field of computing the structure of

language - nka Computational Linguistics

  • Statistical NLP – NLP plus Machine Learning from

examples and then can generalize – non-deterministic – finds what it hasn’t seen

  • Language Engineering – Building production grade

SNLP solutions

slide-4
SLIDE 4

California Cancer Registry Problem

  • 500,000 documents per annum

– potentially 1 million in the coming years

  • 50% unwanted and need to be filtered out
  • Separation of non-reportable and reportable

cancers

  • Coding the reportables
slide-5
SLIDE 5

California CR Project Objectives - Analysis of Histopathology Reports

  • Develop an automated service to:
  • Determine Reportability
  • Codify 5 attributes of

– Site – Histology – Grade – Behaviour – Laterality

slide-6
SLIDE 6

The Language Engineering Issues

  • How well can we do it - efficiency
  • With what volume of training materials - 5000
  • In what amount of time – 15 months
  • To what accuracy - 90%
  • For what cost – not enough
  • The SNLP is our infrastructure – it is the

Language Modeling using machine learning, and coding to the client deliverables that have to be engineered.

slide-7
SLIDE 7

Language Engineering Overview

  • f Tasks
  • Pathology Reports Classifier

– In scope – histopathology – Out-of-scope - Immunohistochemistry & Genetics

  • Case Identification – Reportability Classifier
  • Clinical Concept Recognizer
  • Coding Inference Engine
slide-8
SLIDE 8

Two Training Corpora needed to create a Gold Standard for learning

  • 5000 Reportables in 10 Batches
  • 5000 Non-Reportbales
  • From 50+ laboratories
  • Covering

– 133 Histology codes – 140 Site codes

  • Subsequently - 212 Reportables in Non-

Reportables corpus transferred to Batch 11

slide-9
SLIDE 9

Pathology Report Type Classifier

Classes

TP FP FN P R F- Score

Histopathology

4373 57 29 98.71 99.34 99.03

Other

653 29 57 95.75 91.97 93.82

OVERALL

5026 86 86 98.32 98.32 98.32

slide-10
SLIDE 10

Reportability Results for Reportability Corpus

TP FP FN TN P R F 3510 27 111 1160 99.24 96.93 98.10 Interpreted as 0.76% FP and 3.07% FN (loss of Cancer reports to the Non-cancer class). Acceptable but pride would want us to do better.

slide-11
SLIDE 11

Finding Clinical Concepts: Annotating and Tagging

  • Design a schema of semantic tags
  • Manually annotate the training corpus with

the tags

  • Build a machine learning Language Model

with the training corpus to recognise the tags

  • Check the consistency of the annotations

with the model

  • Iteratively correct the annotations and the

structure of the model to improve it

slide-12
SLIDE 12

Language Model Accuracy – RUN 29B & Run 41 - Sample of 34/32 Tags

TP FP FN P R F N F N

Site

18072 50 59 99.72 99.67 99.70 18131 99.87 21943

Histology

11172 7 87 99.94 99.23 99.58 11259 99.78 13784

Behaviour

8247 5 27 99.94 99.67 99.81 8274 99.91 9603

Grade

897 1 19 99.89 97.93 98.90 916 98.79 1335

Laterality

6351 10 20 99.84 99.69 99.76 6371 99.89 8000

Total (34 tags)

115550 226 921 99.80 99.21 99.51 116471 99.71 141412

>99%

29 16 21 27

>99.5%

29 10 14 22

TOTAL

34 34 34 32

2017 2018

slide-13
SLIDE 13

Coding Reports to ICD O3 – Problem Definition

  • Separate out each specimen.
  • Clinical Concept Recognition: Apply the Language

Model to tag the correct concepts needed for coding.

  • CODIFY: Map the tags to the appropriate elements in

the ICD O3 definitions for each coded attribute including applying:

– SEER multiple primaries rules and – any other local rules.

  • Evaluate each specimen for its cancer reportability
  • The Summary - Select the specimens that are required

to produce the correct case codes.

slide-14
SLIDE 14

Coding Results Overview – All Reportables 2017.v1

Extractable # of records # of correct coded Accuracy Site (4 digits) 3165 3014 95.23% hist_type (4 digits) 3165 3050 96.37% hist_grade 3165 3126 98.77% hist_behavior 3165 3150 99.53% laterality 3165 3096 97.82% TOTAL 15825 15436 487.72% Average 97.54%

slide-15
SLIDE 15

FINAL Estimated Efficiency Gains

  • 100% automated Case finding - Reportability
  • 72% automated coding at 94% overall accuracy =>

28% manual coding.

  • Up to 90% automatic coding is possible.
  • Histology codes coverage: 93.9% of supplied examples
  • Site codes coverage: 94.2% of supplied examples
  • Automated Coding of code set coverage 97.5% of

reports.

  • Reduce manual errors by 40-80%.
  • More improvements are being made.
slide-16
SLIDE 16

Epilogue – 2018 Revisions

  • 187 Site codes – 35% increase
  • 237 Histology codes – 85% increase
  • 15% increase in reports

– Drawn from audits, new samplings

  • 2018.v1 is being tested by CCR
  • 2017.v2 is released
slide-17
SLIDE 17

The Foreseeable Future

  • Increase accuracy for more difficult complexity classes ->

reduces manual processing

  • Further investigate tumour stream classifications for

improving accuracy

  • Analyse Immunohistochemistry and Genetics reports
  • Add more extraction and coding functions – e.g.

Biomarkers, recurrence, tumour size, margins, …

  • Add more document types – Radiology, Nuclear Medicine,
  • For the CDPH – Other Health topics – First Fractures of
  • steoarthritis, Adherence to National guidelines for

diagnostics

  • Provide a Subscription Reportability Service for third

parties.

slide-18
SLIDE 18
  • END