Text Mining in Clinical Domain: Dealing with Noise Author: Hoang - - PowerPoint PPT Presentation

text mining in clinical domain dealing with noise
SMART_READER_LITE
LIVE PREVIEW

Text Mining in Clinical Domain: Dealing with Noise Author: Hoang - - PowerPoint PPT Presentation

Text Mining in Clinical Domain: Dealing with Noise Author: Hoang Nguyen, Jon Patrick Source: KDD16 Advisor: Jia-Ling Koh Speaker: Avon Yu Date: 2018/12/4 1 Outline Introduc*on Method Experiment Conclusion 2


slide-1
SLIDE 1

Text Mining in Clinical Domain: Dealing with Noise

Author: Hoang Nguyen, Jon Patrick Source: KDD’16
 Advisor: Jia-Ling Koh
 Speaker: Avon Yu

Date: 2018/12/4

  • 1
slide-2
SLIDE 2

Outline

  • Introduc*on
  • Method
  • Experiment
  • Conclusion

2

slide-3
SLIDE 3

Introduction

  • MoOvaOon
  • High level of noise in clinical corpus.
  • unknown word (ex. misspellings, acronym, abbreviaOons)
  • non-word (clinical scores & measure ex. BP140/65, HR 72…)
  • poor grammaOcal sentence
  • Costly labelled data, which sOll o\en contain errors and

inconsistencies.

  • Imbalanced data distribuOon.

3

slide-4
SLIDE 4

Introduction

  • Goal
  • Introduces a general clinical data mining architecture

which is potenOal of addressing these challenges using:

  • Pre-processing system (proof-reading)
  • InteracOve model development
  • AcOve learning

4

slide-5
SLIDE 5

Introduction

  • Framework

5

slide-6
SLIDE 6

Outline

  • IntroducOon
  • Method
  • Experiment
  • Conclusion

6

slide-7
SLIDE 7

Method

  • StandardisaOon
  • Ring-fencing tokeniser
  • Finite State Recognizer (FSR) uses training examples to

recognize token paaerns consOtuOng a score or measurement that requires standardisaOon.

7

slide-8
SLIDE 8

Method

  • NormalisaOon & Clinical Concepts RecogniOon
  • The Lexicon Management System (LMS) store the

accumulated lexical knowledge and contains categorizaOons of spelling errors, acronyms and non- word tokens.

dictionary for English and Medical terms

8

slide-9
SLIDE 9

Method

  • IteraOve Model Development
  • The model is evaluated and the algorithm is revised in

a feedback process to produce a more accurate result.

9

slide-10
SLIDE 10

Method

  • IteraOve Model Development
  • Feature selecOon:
  • Bag of words(BOW)
  • Proof reading
  • Ring-fencing
  • Lemma
  • Medical term and gazeaer
  • Bag of tags(BOT)
  • Context feature
  • NegaOon and modality

10

slide-11
SLIDE 11

Method

  • IteraOve Model Development
  • New model is delivered to the Visual Annotator(VA) to

perform manual correcOon with the support of an annotaOon validaOon tool.

11

slide-12
SLIDE 12

Method

  • AcOve Learning
  • The learner queries the most informaOve instances to

retrain the model instead of making a random selecOon.

12

slide-13
SLIDE 13

Method

  • AcOve Learning
  • Pool-based acOve learning

13

slide-14
SLIDE 14

Method

  • AcOve Learning
  • Simple AL

Data within the margin is less imbalanced than the enOre data.

14

slide-15
SLIDE 15

Method

  • AcOve Learning
  • Self Confident
  • Chooses the next example to be labeled so that, when

it is added to the training data, the future generalizaOon error probability is minimized

  • log-loss funcOon:

15

slide-16
SLIDE 16

Method

  • AcOve Learning
  • Kernel Farthest-First
  • The most informaOve instance is the farthest instance

in the unseen pool from the current training set

16

slide-17
SLIDE 17

Method

  • AcOve Learning
  • Balanced Explora*on and Exploita*on(Balance-EE)
  • A combinaOon of Simple and KFF
  • The probability p for exploraOon will be updated as:

17

slide-18
SLIDE 18

Outline

  • IntroducOon
  • Method
  • Experiment
  • Conclusion

18

slide-19
SLIDE 19

Experiment

  • Dataset:
  • All reports provided in a year’s data collecOon by the

imaging services in Australia.

  • Sample of 16472 reports was drawn from Lake Imaging

and assigned to cancer (4784 reports) or non-cancer (11 688 reports) classes by the cancer registry

19

slide-20
SLIDE 20

Experiment

  • Descriptor (De)
  • 形態學、地形學、細胞型態..
  • EnOty (En)
  • subject of the report
  • LinguisOc (Li)
  • lexical polarity, normality and modifier
  • Radiologist’s coding (Ra)
  • cancer stage , TNM
  • Structure (St)
  • heading tags

20

slide-21
SLIDE 21

Experiment

21

slide-22
SLIDE 22

Experiment

The evaluaOon of the reportability classifier presented here was executed independently at the Cancer Registry. The final version is implemented based on two ML algorithms, they are CondiOonal Random Fields(CRFs) and SVMs. ‘sensitivity’ is equal to ‘recall’ of the posiOve class (reportable) ‘specificity’ is the ‘recall’ of the negaOve class (non-reportable)

22

slide-23
SLIDE 23

Outline

  • IntroducOon
  • Method
  • Experiment
  • Conclusion

23

slide-24
SLIDE 24

Conclusion

  • Presents a general system for text mining in clinical

domain with a focus on dealing with mulOple frequent kinds of noise.

  • Can dramaOcally reduce human effort in idenOfying

relevant reports from the large imaging pool for further invesOgaOon of cancer.

  • The classifier is built on a large real-world dataset and

can achieve high performance in filtering relevant reports.

24