Seminars in Software and Services for the Information Society - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Lara Malfatti (MD-Thesis, March 2013) Lara Malfatti (MD-Thesis, March 2013) Data Mining for evaluating the risk of chemotherapy-associated thrombosis Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 1

Outline • Problem and contextualization • Problem and contextualization • Data Mining methodologies • Dataset preprocessing • Attributes’ selection • Classification • Classification • Costs’ evaluation • Conclusion Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 2

Venous Thrombo-Embolism (VTE) • It increases from 0,1% in general population to 3% in general population to 3% in cancer patients • It is the second cause of mortality in cancer patients • Its treatment represents a big cost for National Health cost for National Health Service (about 8.000 € per patient) Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 3

Data set description Dataset contains 565 instances (526 negative + 39 positive). Each entry contains 35 variables which can be grouped in: 1. Patient risk factors: as age, sex, laboratory analysis and comorbid condition (i.e. obesity) 2. Cancer risk factors: as site and 2. Cancer risk factors: as site and stage of tumor 3. Treatment risk factors: as assumption of chemotherapy or targeted therapy agents Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 4

State of the art Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 5

Terminology • Classification process: takes in input an instance and tries to forecast if it will be positive or negative • Medical evaluation metrics are derived from the related confusion matrix: Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 6

Statistical approach: Khorana’s score This model uses 5 biological variables as predictors and classifies patients into three risk categories: low, intermediate and high risk Pros: LOW INTERME HIGH • Simple and clear model DIATE Num. of 280 252 33 • Low cost of predictive variables patients Cons: Cons: Metrics Metrics Values Values • Too many patients classified as Accuracy 53% “intermediate risk” PPV 10% • Poor performances NPV 96% Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 7

Challenge: • Is it possible to find better variable combinations able to predict thrombosis combinations able to predict thrombosis through data mining? • What is the the best predictive combination in terms of cost/benefit among all the possible ones? • Are the screening cost of these combinations sustainable by the National Health Service? Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 8

Knowledge Discovery in Health Care Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 10

WEKA WEKA: Waikato Environment for Knowledge Analysis • It is a free tool for data mining • It is a free tool for data mining applications, written in JAVA • It implements all the steps of KDD workflow from data preprocessing to the visualization of discovered patterns patterns • Attention is focused on data preprocessing, attribute selection and learning phase Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 11

WEKA: learning phase Learning phase : training and testing data sets must be disjoint Unbalanced data set causes: Unbalanced data set causes: • Excessive influence of majority class on classification model • High global performance without forecasting a single instance of the minority class minority class The creation of balanced training and testing datasets is manually conducted during the preprocessing phase Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 12

Data set pre-processing: cleaning (1/3) Create three balanced folders and combine the partial results partial results • All the instances are All the instances are classified exactly once • All the training sets have the same number of positive and negative instances Training and testing Training and testing datasets are disjoint Extra cost: each experiment needs three run execution Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 14

Data set pre-processing: cleaning (2/3) The objective is to remove noisy instances • VTE normally falls within 6 months from within 6 months from the beginning of chemotherapy • Time interval is enlarged to 12 months to cover also to cover also asymptomatic events Outliers are given by: • Intrinsic probability of having a thrombotic event • Changes in anticancer treatments Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 15

Data set preprocessing: improvements (3/3) Unstructured numerical data are aggregated, to not badly influence the classification model (see the classification model (see figure) Instances with missing values are discarded because: • Artificial values cannot • Artificial values cannot correspond to real cases • They can create problems both in training and testing data set Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 16

Attribute selection (1/2) Feature selection returns meaningful subsets of the original attributes ignoring the ones which provide no information Filter methods: Filter methods: • they are independent from any learning algorithms and rely only on data properties • they can be seen as the they can be seen as the combination of search techniques combination of search techniques for proposing new subsets and evaluation metrics to rank them WEKA provides lots of possibilities Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 18

Attribute selection (2/2) GreedyStepwise : performs a greedy search through the space of attribute subsets in both directions (backward and forward) starting from the empty set forward) starting from the empty set CorrelationFeautureSubSetEval : prefers subsets with attributes highly correlated with the class but having low inter-correlation Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 19

Classification Guidelines: • For each subset found in previous step some experiments are conducted using different learning algorithms are conducted using different learning algorithms • PPV, NPV and Accuracy are compared, Khorana’s results are used as benchmarks • A constraint is fixed, no NPV values lower than 96% are allowed WEKA provides a variety of learning algorithms, the ones WEKA provides a variety of learning algorithms, the ones used in experiments are: • Bayes algorithms, Decision trees, Cover rules, Logistic regression functions and Lazy algorithms Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 21

Classification: Accuracy All the predictive groups have better accuracy than Pure-KS Lara Malfatti - MD Thesis (Advisor: Umberto Nanni) Seminars of Software and Services for the Information Society 22

Seminars in Software and Services for the Information Society - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Lara

Presentation in Seminars and Conferences Seminars and conferences offer alternative means of

2009 CCOutreach 2009 CCOutreach Regional Seminars Regional Seminars 1 Disclaimer Disclaimer

Seminars in Software and Services for the Information Society Umberto Nanni Key Performance

Seminars in Software and Services for the Information Society Umberto Nanni Social Networks

Introduction to Data Mining Umberto Nanni Seminars of Software and Services for the Information

Seminars in Software and Services for the Information Society Umberto Nanni Social Networks

Introduction to Introduction to Datawarehousing Umberto Nanni Seminars of Software and Services

HMI 2019 April Seminars Facilities Market Concentration and Remedies Funders Market

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

Deposit Insurance Coverage Free Nationwide Seminars for Bank Officers and Employees Registering for

Raising the profile of the four Medical Associate Professions Regional Seminars October 2018

SILICA SEMINAR Linda Apthorpe & Ian Firth www.aioh.org.au AIOH Seminars AIOH runs technical

MSA Unlicensed Officials / Training Instructors Seminars 2016 With the support of the British

Conversational state management in Web Service Technologies Homework for Seminars in Software

HCDI seminars Brunel University, 29th February 2012 Design for Services

CSE 2221 Software I: Software Components and CSE 2231 Software II: Software Development and

SAFE DIVING Karim Elmaaroufi and Viren Bajaj 1 OUTLINE 1. Motivation 2. Related work 3.

HEALTHCARE REFORM Hospital-acquired Conditions and Infections Policies AUGUST 2011 The enclosed

An editable version of these slides is available on request by emailing curriculum@rcr.ac.uk 1

Safe Practices to Decrease the Inherent Risk of High Alert Medications Meghan Duck, RNC-OB, MS,

Disclosure Crashing Patient (Beyond A-B-C and ACLS) I have no financial relationships to

Prevention of Stroke and non-CNS Embolism with Rivaroxaban Compared with Warfarin in Patients

Applications of dispersive estimates to the acoustic pressure waves for incompressible fluid

PORTEC trials Gynecologic Cancer InterGroup Translational Research Brainstorming October 2016

Seminars in Software and Services for the Information Society - PowerPoint PPT Presentation

D IPARTIMENTO DI I NGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE A NTONIO R UBERTI Master of Science in Engineering in Computer Science (MSE-CS) (MSE-CS) Seminars in Software and Services for the Information Society Umberto Nanni Lara

Presentation in Seminars and Conferences Seminars and conferences offer alternative means of

2009 CCOutreach 2009 CCOutreach Regional Seminars Regional Seminars 1 Disclaimer Disclaimer

Seminars in Software and Services for the Information Society Umberto Nanni Key Performance

Seminars in Software and Services for the Information Society Umberto Nanni Social Networks

Introduction to Data Mining Umberto Nanni Seminars of Software and Services for the Information

Seminars in Software and Services for the Information Society Umberto Nanni Social Networks

Introduction to Introduction to Datawarehousing Umberto Nanni Seminars of Software and Services

HMI 2019 April Seminars Facilities Market Concentration and Remedies Funders Market

Introduction to Software Testing Software Testing - Module 1 Part 1 The Software Engineering

Deposit Insurance Coverage Free Nationwide Seminars for Bank Officers and Employees Registering for

Raising the profile of the four Medical Associate Professions Regional Seminars October 2018

SILICA SEMINAR Linda Apthorpe &amp; Ian Firth www.aioh.org.au AIOH Seminars AIOH runs technical

MSA Unlicensed Officials / Training Instructors Seminars 2016 With the support of the British

Conversational state management in Web Service Technologies Homework for Seminars in Software

HCDI seminars Brunel University, 29th February 2012 Design for Services

CSE 2221 Software I: Software Components and CSE 2231 Software II: Software Development and

SAFE DIVING Karim Elmaaroufi and Viren Bajaj 1 OUTLINE 1. Motivation 2. Related work 3.

HEALTHCARE REFORM Hospital-acquired Conditions and Infections Policies AUGUST 2011 The enclosed

An editable version of these slides is available on request by emailing curriculum@rcr.ac.uk 1

Safe Practices to Decrease the Inherent Risk of High Alert Medications Meghan Duck, RNC-OB, MS,

Disclosure Crashing Patient (Beyond A-B-C and ACLS) I have no financial relationships to

Prevention of Stroke and non-CNS Embolism with Rivaroxaban Compared with Warfarin in Patients

Applications of dispersive estimates to the acoustic pressure waves for incompressible fluid

PORTEC trials Gynecologic Cancer InterGroup Translational Research Brainstorming October 2016

SILICA SEMINAR Linda Apthorpe & Ian Firth www.aioh.org.au AIOH Seminars AIOH runs technical