Seminars in Software and Services for the Information Society - - PowerPoint PPT Presentation

seminars in software and services for the information
SMART_READER_LITE
LIVE PREVIEW

Seminars in Software and Services for the Information Society - - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Lara


slide-1
SLIDE 1

Master of Science in Engineering in Computer Science (MSE-CS)

DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI

(MSE-CS)

Seminars in Software and Services for the Information Society

Umberto Nanni

Lara Malfatti (MD-Thesis, March 2013)

1 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Lara Malfatti (MD-Thesis, March 2013)

Data Mining for evaluating the risk of chemotherapy-associated thrombosis

slide-2
SLIDE 2

Outline

  • Problem and contextualization
  • Problem and contextualization
  • Data Mining methodologies
  • Dataset preprocessing
  • Attributes’ selection
  • Classification

2 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

  • Classification
  • Costs’ evaluation
  • Conclusion
slide-3
SLIDE 3

Venous Thrombo-Embolism (VTE)

  • It increases from 0,1% in

general population to 3% in general population to 3% in cancer patients

  • It is the second cause of

mortality in cancer patients

  • Its treatment represents a big

cost for National Health

3 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

cost for National Health Service (about 8.000 € per patient)

slide-4
SLIDE 4

Data set description

Dataset contains 565 instances (526 negative + 39 positive). Each entry contains 35 variables which can be grouped in:

  • 1. Patient risk factors: as age, sex,

laboratory analysis and comorbid condition (i.e. obesity)

  • 2. Cancer risk factors: as site and

4 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

  • 2. Cancer risk factors: as site and

stage of tumor

  • 3. Treatment risk factors: as

assumption of chemotherapy or targeted therapy agents

slide-5
SLIDE 5

State of the art

5 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

slide-6
SLIDE 6

Terminology

  • Classification process: takes in input an instance and tries to

forecast if it will be positive or negative

  • Medical evaluation metrics are derived from the related

confusion matrix:

6 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

slide-7
SLIDE 7

Statistical approach: Khorana’s score

This model uses 5 biological variables as predictors and classifies patients into three risk categories: low, intermediate and high risk

LOW INTERME DIATE HIGH

  • Num. of

patients 280 252 33 Metrics Values

Pros:

  • Simple and clear model
  • Low cost of predictive variables

Cons:

7 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Metrics Values Accuracy 53% PPV 10% NPV 96%

Cons:

  • Too many patients classified as

“intermediate risk”

  • Poor performances
slide-8
SLIDE 8

Challenge:

  • Is it possible to find better variable

combinations able to predict thrombosis combinations able to predict thrombosis through data mining?

  • What is the the best predictive combination in

terms of cost/benefit among all the possible

  • nes?

8 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

  • Are the screening cost of these combinations

sustainable by the National Health Service?

slide-9
SLIDE 9

Outline

  • Problem and contextualization
  • Problem and contextualization
  • Data Mining methodologies
  • Dataset preprocessing
  • Attributes’ selection
  • Classification

9 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

  • Classification
  • Costs’ evaluation
  • Conclusion
slide-10
SLIDE 10

Knowledge Discovery in Health Care

10 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

slide-11
SLIDE 11

WEKA

WEKA: Waikato Environment for Knowledge Analysis

  • It is a free tool for data mining
  • It is a free tool for data mining

applications, written in JAVA

  • It implements all the steps of

KDD workflow from data preprocessing to the visualization of discovered patterns

11 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

patterns

  • Attention is focused on data

preprocessing, attribute selection and learning phase

slide-12
SLIDE 12

WEKA: learning phase

Unbalanced data set causes: Learning phase: training and testing data sets must be disjoint Unbalanced data set causes:

  • Excessive influence of majority class
  • n classification model
  • High global performance without

forecasting a single instance of the minority class

12 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

minority class The creation of balanced training and testing datasets is manually conducted during the preprocessing phase

slide-13
SLIDE 13

Outline

  • Problem and contextualization
  • Problem and contextualization
  • Data Mining methodologies
  • Dataset preprocessing
  • Attributes’ selection
  • Classification

13 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

  • Classification
  • Costs’ evaluation
  • Conclusion
slide-14
SLIDE 14

Data set pre-processing: cleaning (1/3)

Create three balanced folders and combine the partial results All the instances are partial results

  • All the instances are

classified exactly once

  • All the training sets have

the same number of positive and negative instances Training and testing

14 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Training and testing datasets are disjoint Extra cost: each experiment needs three run execution

slide-15
SLIDE 15

Data set pre-processing: cleaning (2/3)

The objective is to remove noisy instances

  • VTE normally falls

within 6 months from within 6 months from the beginning of chemotherapy

  • Time interval is

enlarged to 12 months to cover also

15 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

Outliers are given by:

  • Intrinsic probability of having a thrombotic event
  • Changes in anticancer treatments

to cover also asymptomatic events

slide-16
SLIDE 16

Data set preprocessing: improvements (3/3)

Unstructured numerical data are aggregated, to not badly influence the classification model (see the classification model (see figure) Instances with missing values are discarded because:

  • Artificial values cannot

16 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

  • Artificial values cannot

correspond to real cases

  • They can create problems both

in training and testing data set

slide-17
SLIDE 17

Outline

  • Problem and contextualization
  • Problem and contextualization
  • Data Mining methodologies
  • Dataset preprocessing
  • Attributes’ selection
  • Classification

17 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

  • Classification
  • Costs’ evaluation
  • Conclusion
slide-18
SLIDE 18

Attribute selection (1/2)

Feature selection returns meaningful subsets of the original attributes ignoring the ones which provide no information Filter methods: Filter methods:

  • they are independent from any

learning algorithms and rely only

  • n data properties
  • they can be seen as the

combination of search techniques

18 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

they can be seen as the combination of search techniques for proposing new subsets and evaluation metrics to rank them WEKA provides lots of possibilities

slide-19
SLIDE 19

Attribute selection (2/2)

GreedyStepwise: performs a greedy search through the space of attribute subsets in both directions (backward and forward) starting from the empty set forward) starting from the empty set CorrelationFeautureSubSetEval: prefers subsets with attributes highly correlated with the class but having low inter-correlation

19 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

slide-20
SLIDE 20

Outline

  • Problem and contextualization
  • Problem and contextualization
  • Data Mining methodologies
  • Dataset preprocessing
  • Attributes’ selection
  • Classification

20 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

  • Classification
  • Costs’ evaluation
  • Conclusion
slide-21
SLIDE 21

Classification

Guidelines:

  • For each subset found in previous step some experiments

are conducted using different learning algorithms are conducted using different learning algorithms

  • PPV, NPV and Accuracy are compared, Khorana’s results

are used as benchmarks

  • A constraint is fixed, no NPV values lower than 96% are

allowed WEKA provides a variety of learning algorithms, the ones

21 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

WEKA provides a variety of learning algorithms, the ones used in experiments are:

  • Bayes algorithms, Decision trees, Cover rules, Logistic

regression functions and Lazy algorithms

slide-22
SLIDE 22

Classification: Accuracy

All the predictive groups have better accuracy than Pure-KS

22 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

slide-23
SLIDE 23

Classification: NPV

Khorana group violates the NPV constraint which is under 96%

23 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

slide-24
SLIDE 24

Classification: PPV

WEKA and ThP groups doubles the PPV obtained by Pure-KS

24 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

slide-25
SLIDE 25

Outline

  • Problem and contextualization
  • Problem and contextualization
  • Data Mining methodologies
  • Dataset preprocessing
  • Attributes’ selection
  • Classification

25 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

  • Classification
  • Costs’ evaluation
  • Conclusion
slide-26
SLIDE 26

Cost Evaluation (1/2)

Evaluation of the screening cost and eventual NHS savings

26 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

slide-27
SLIDE 27

Cost Evaluation (2/2)

  • In all the cases, National Health Service saves money from

correctly predicted thrombosis (no treatment needed) and

27 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

correctly predicted thrombosis (no treatment needed) and covers the screening costs at the same time

  • Augmented-KS is the best predictive combination from an

economic point of view

slide-28
SLIDE 28

Outline

  • Problem and contextualization
  • Problem and contextualization
  • Data Mining methodologies
  • Dataset preprocessing
  • Attributes’ selection
  • Classification

28 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

  • Classification
  • Costs’ evaluation
  • Conclusion
slide-29
SLIDE 29

Conclusion and future works

From the use of data mining for the study of chemotherapy- associated thrombosis:

  • PPV increases of 150% respect to the statistical approach
  • NHS saves money from correctly predicted thrombosis and

covers the screening costs at the same time Due to the limited size of dataset to be analyzed, better results can be reached:

29 Seminars of Software and Services for the Information Society Lara Malfatti - MD Thesis (Advisor: Umberto Nanni)

be reached:

  • repeating the experiments by integrating more biological

variables

  • repeating the experiments by integrating more instances into

dataset