[PPT] - Patients and Doctors Together to Discover Medical Knowledge with PowerPoint Presentation

SLIDE 1

Patients and Doctors Together to Discover Medical Knowledge with Statistics and Classification: of Patients, by Patients and for Patients

Ye-In Chang (張玉盈) Department of Computer Science and Engineering National Sun Yat-sen University

SLIDE 2

Outline

 Introduction  A Survey of Knowledge Discovery  Proposed Methods

 Chronic Kidney Disease as an Important Risk Factor for Tumor

Recurrences, Progression and Overall Survival in Primary Non- Muscle-Invasive Bladder Cancer

 Applying the Chi-Square Test to Improve the Performance of the

Decision Tree for Classification by Taking Baseball Database as an Examples

 Conclusion

2

SLIDE 3

Introduction

 Knowledge discovery in the database focuses upon

methodologies for extracting useful information from collection of data.

3

SLIDE 4

Introduction

 One of approaches for knowledge discovery is data

mining.

 Data classification is one of famous and useful

techniques for data mining that assigns categories to collected data in order to analyze the accurate prediction.

 Moreover, one of models for data classification is a

decision tree.

 In fact, one of key points of a good decision tree is

the kind of deciding factors in the internal nodes.

4

SLIDE 5

Introduction

 In statistical tests, the chi-square test is one of good

ways to analyze whether categorical variable A is the significant factor to categorical variable B.

 From our observation from research papers in the

topic of medicine, we consider that the risk factor (i.e., the significant factor of the chi-square in statistics) is strongly related to the important deciding factor in the decision tree.

5

SLIDE 6

Introduction

 In this research, first, we study the chronic kidney

disease as an important risk factor for the bladder cancer by cooperating with Department of Urology, Chang Gung Memorial Hospital, Kaohsiung, Taiwan, and we propose a statistic approach to check the relation.

6

SLIDE 7

Introduction

 Second, we make use of the significant factor to

improve the performance of the decision tree, and we propose an approach which aims to reduce the number of deciding factors and decide the order of deciding factors in a decision tree.

7

SLIDE 8

Survey

 Statistical tests and data mining techniques have been

largely studied, developed and applied for many fields.

8

SLIDE 9

Statistics

 In statistics, there are two types of attributes including

continuous numbers and categorical variables.

 Examples of continuous number are age and weight.

 I am 30 years old and my weight is 60.5 kg.

 Moreover, examples of categorical variables are

gender and location.

 I am a girl and I live in Kaohsiung.

9

SLIDE 10

Statistics

 The chi-square test is one of the statistical tests, and is

designed to analyze whether a significant relationship exists between two categorical variables.

 Furthermore, it is used in many fields including

medical studies, finance and market, education and sports.

10

SLIDE 11

Chi-Square Test

 There are four steps in the chi-square test which are

described as follows.

 Step 1. State the null hypothesis (H0) and alternative

hypothesis (Ha)

 Step 2. Determine the significant level  Step 3. Analyze the database  Step 4. Explain the results in terms of the hypothesis

11

SLIDE 12

Data Mining

 The preprocess of the data mining technique includes

 data selection  data cleaning  data transformation

12

SLIDE 13

Data Mining

 Classification is one of the most common learning

models in data mining.

 One of approaches for classification is a decision tree,

which makes the analysis of very large datasets effective.

13

SLIDE 14

Decision Tree

Player

utlook

humidity temperature action 001 sunny low hot stay at home 002

vercast

medium low go to play 003 rainy medium hot stay at home 004 sunny low low go to play 005

vercast

low hot go to play 006 rainy high low stay at home 007

vercast

medium low go to play 008 rainy high low stay at home 009 rainy high hot stay at home 010 sunny medium low go to play

14

SLIDE 15

Bladder Cancer

15

SLIDE 16

Bladder Cancer

 We consider ten putative risk factors.  We follow up five observations.

 Bladder tumor recurrence  Upper urinary tract (UUT) tumor recurrence  Cancer progression  Cancer-specific survival  Overall survival

16

SLIDE 17

17

Chronic Kidney Disease as an Important Risk Factor for Tumor Recurrences, Progression and Overall Survival in Primary Non-Muscle-Invasive Bladder Cancer

International Urology and Nephrology, Vol. 48, No. 6, pp. 993-999, June 2016. (SCI)

SLIDE 18

Method

 This retrospective study was approved by the hospital

review boards of Kaohsiung Chang Gung Memorial Hospital, and has been performed in accordance with the ethical standards in the Declaration of Helsinki.

18

SLIDE 19

Method

 In the step of data selection, we choose the medical

record from the Cancer Center, Chang Gung Memorial Hospital, Kaohsiung, Taiwan.

 The content of the medical records contain 2140

bladder cancer patients and 119 medical record fields for each patient.

 They were reviewed for the 10 putative variables,

including patient age, gender, white blood cell leucocyte (WBC), NL ratio, tumor count, size, grade, stage, eGFR and squamous differentiation of histology.

19

SLIDE 20

Method

 A total of 158 patients with primary diagnosis of

TaT1-NMIBC (i.e. attribute stage = ’Ta’ or stage = ’T1’) from January 2008 to December 2010 were treated by transurethral resection of bladder tumors (TURBT) at the urologic department.

 All of the patients were followed up for more than

four years until December 2014.

20

SLIDE 21

Method

 In the step of cleaning unclear data, we must fill in

the related data for the unclear data by checking the related data from the other surgery database.

 For example, a patient’s tumor count record is empty

in the original database, we have to check his(her) related data in the surgery database to know the clear record.

21

SLIDE 22

Method

 In the step of data enrichment, we convert the

medical result from the continuous number into the categorical variable.

 For example, we have to convert the putative factor,

eGFR(≥60, 30-59 and <30), into three categorical values, respectively.

22

SLIDE 23

Method

 In the step of statistic test, we analyze the P-value of

putative risk factors by using chi-square test.

 In fact, we use the Number Cruncher Statistical

System (NCSS) to do such a chi-square test.

23

SLIDE 24

Bladder Cancer

 We consider ten putative risk factors.

 Ages (年齡)  Gender (性別)  White blood cell leucocyte

(WBC) (白血球數)

 Neutrophil to lymphocyte

ratio (NL ratio) (嗜中性白血球與淋巴球比例)

 Tumor Count (腫瘤數目)

24

 Tumor size (腫瘤尺寸)  Grade (程度)  Stage (期別)  Estimated glomerular

filtration rate (eGFR) (腎絲球過濾率)

 Squamous differentiation

(鱗狀分化)

SLIDE 25

Result

Characteristic No(%) Characteristic No(%) Age Tumor size(cm) < 40 3(2) < 3 153(97) 40-69 85(54) ≥ 3 5(3) ≥ 70 70(44) Grade Gender Low 49(31) Male 112(71) High 109(69) Female 46(29) Stage WBC(k/ul) Ta 104(66) < 10 143(91) T1 54(34) ≥ 10 15(9) eGFR(ml/min) NL ratio ≥ 60 86(54) < 4 79(77) 30-59, CKD stage 3 35(22) ≥ 4 23(23) < 30, CKD stage 4, 5 37(24) Tumor count Squamous differentiation 1 90(57) No 148(94) 2-7 63(40) Yes 10(6) ≥ 8 5(3)

25

SLIDE 26

 Tumor recurrences, progression and overall survival

f NMIBC patients with 4-year follow-up

Result

With CKD %(no.) Without CKD %(no.) Total %(no.) Bladder tumor recurrence 40(29/72) 26(22/86) 32(51/158) UUT tumor recurrence 7(5/72) 0(0/86) 3(5/158) Progression 7(5/72) 5(4/86) 6(9/158) Overall survival 63(45/72) 91(78/86) 78(123/158)

26

SLIDE 27

Result

27

Bladder tumor recurrence Progression Overall survival UUT tumor recurrence Cancer-specific survival 1 eGFR(CKD) ✔ ✔ ✔ ✔ 2 Grade ✔ ✔ ✔ ✔ 3 Stage ✔ ✔ ✔ ✔ 4 Tumor Count ✔ ✔ 5 Squamous differentiation ✔ 6 Tumor Size ✔ ✔ 7 Age 8 Gender ✔ 9 WBC 10 NL ratio

SLIDE 28

Summary

 Chronic kidney disease (CKD) is an important risk

factor for tumor recurrences and progression.

 We have studied that NMIBC patients with CKD

should be intensively monitored at the UUT and bladder.

28

SLIDE 29

Summary

 We thank the Department of Urology,

Chang Gung Memorial Hospital, Kaohsiung, Taiwan for providing data for this study.

 The content of the medical records

contain 2140 patients and 119 medical record fields for each patient.

29

SLIDE 30

30

Applying the Chi-Square Test to Improve the Performance of the Decision Tree for Classification by Taking Baseball Database as an Example

Journal of Computers, Dec. 2018.

SLIDE 31

Baseball Database

 In the real world, a large amount of baseball batting

data has been collected as the digital database.

 There are 659 baseball players who have at bats

identified through Chinese Professional Baseball League (CPBL) team website from 2009 to 2015.

 Each team plays 120 games in a year since 2009.

31

SLIDE 32

Baseball Database

 We focus on the data of 132 baseball players whose

at bats (AB) are greater than or equal to 372, and the

rder is not important in this study.

 Note that at bats is equal to the number of games multiplied

3.1 according to the rule which is recorded on the rule 9.22(a) of the official baseball rules 2017 edition on the Major League Baseball (MLB) website and the Chinese Professional Baseball League (CPBL), 2017.

 Therefore, we have 120 × 3.1 = 372.

32

SLIDE 33

Baseball Database

 They are reviewed for the thirteen factors including

 games (G) (出賽次數)  plate appearances (PA) (打席)  at bats (AB) (打數)  run batted in (RBI) (打點)  runs (R) (得分)  hits (H) (安打)  one-base hit (1B) (一壘安打)

 Furthermore, batting average (AVG) is reviewed in

this study.

33

 two-base hit (2B) (二壘安打)  three-base hit (3B) (三壘安打)  home run (HR) (全壘打)  total bases (TB) (壘打數)  strike outs (SO) (三振)  stolen bases (SB) (盜壘成功)

SLIDE 34

Baseball Database

 In general, the cut-off value of batting average

(AVG) is defined as 0.300.

 When the batting data is greater than or equal to the

cut-off value, it means that the batter has excellent betting performance.

34

SLIDE 35

Method

 Our proposed method contains five steps as follows:

 (1) splitting factors and batting data  (2) analyzing the significant factor  (3) dividing training data  (4) constructing the decision tree  (5) classifying testing data

35

SLIDE 36

Method

 In Step 1, we split factors and batting data such that

nly categorical symbols.

 All of the thirteen factors are split based on the

statistical mean, respectively.

36

SLIDE 37

Method

 If the possible value of the attribute is a continuous

number, we have to convert such a continuous number into a categorical symbol.

 For example, the statistical mean of stolen bases (SB) is 10.  We categorize stolen bases (SB) <10 as a symbol 1 and

stolen bases (SB) ≥10 as the other symbol 2.

 Moreover, we symbolize the new added attribute,

batting average (AVG).

 If it is greater than or equal to 0.300, we symbolize it by

class A; otherwise, we symbolize it by class B.

37

SLIDE 38

Method

 In Step 2, we analyze the input data with only

categorical values converted from Step 1.

 We perform the chi-square test to analyze whether the

factors are significant related to AVG or not.

 We state the null hypothesis (H0) and alternative

hypothesis (Ha) as follows:

 H0: batting average (AVG) and stolen bases (SB) are

independent,

 Ha: batting average (AVG) and stolen bases (SB) are

dependent.

38

SLIDE 39

Method

 In Step 3, in order to make a comparison of the

performance between database DB13Factors and database DBSignificantFactors, for each of those two databases, we divide the database into training data and testing data for constructing the decision tree.

 We consider different cases of training data by

selecting different percentages of data.

39

SLIDE 40

Method

 In this study, we use one case with 60% training data

and another case with 80% training data.

 Furthermore, we denote those two databases for

constructing decision trees as

 DB13Factors60%  DB13Factors80%  DBSignificantFactors60%  DBSignificantFactors80%

40

SLIDE 41

Method

 In Step 4, we use the training data to construct a

decision tree.

 Since we concern four databases DB13Factors60%,

DB13Factors80%, DBSignificantFactors60% and DBSignificantFactors80% for performance comparison, we have four resulting decision trees.

41

SLIDE 42

Method

 Finally, in Step 5, we classify the testing data by the

related decision tree and then calculate the correct ratio.

 That is, for the decision tree constructed based on

training data DB13Factors80%, we use the remaining 20% of the original baseball database as the testing data.

 Similarly, we test the cases of the other three

databases.

42

SLIDE 43

Result

Name of Factor Number of Players % Name of Factor Number of Players % Game Two-base hit < 110 60 45 < 21 62 47 ≥ 110 72 55 ≥ 21 70 53 Plate appearances Three-base hit < 453 65 49 < 3 81 61 ≥ 453 67 51 ≥ 3 51 39 At bats Home run < 402 64 48 < 8 74 56 ≥ 402 68 52 ≥ 8 58 44 Run batted in Total bases < 59 79 60 < 174 74 56 ≥ 59 53 40 ≥ 174 58 44 Runs Strike outs < 62 73 55 < 57 66 50 ≥ 62 59 45 ≥ 57 66 50 Hits Stolen bases < 123 72 55 < 10 74 56 ≥ 123 60 45 ≥ 10 58 44 One-base hit < 91 71 54 ≥ 91 61 46

43

SLIDE 44

Result

44

Name of Factor Univariate analysis Chi-square P-value Batting average Game 0.637 0.4250 Plate appearances 8.738 0.0031 At bats 5.840 0.0157 Run batted in 18.589 < 0.001 Runs 30.930 < 0.001 Hits 41.149 < 0.001 One-base hit 19.912 < 0.001 Two-base hit 26.936 < 0.001 Three-base hit 0.004 0.9480 Home run 16.018 < 0.001 Total bases 41.831 < 0.001 Strike outs 0.122 0.7266 Stolen bases 0.862 0.3531

SLIDE 45

Performance

45

 The resulting decision tree for the nine

significant factors and batting average (in the case of 80% training data)

SLIDE 46

Performance

Li_dissertation - 46

 The resulting decision tree for the

thirteen factors and batting average (in the case of 80% training data)

SLIDE 47

Performance

47

 The number of classifying levels of a leaf node and

the average number of classifying levels with nine significant factors

ID(Level) 1(3) 2(5) 3(5) 4(4) 5(4) ID(Level) 6(4) 7(3) 8(3) 9(4) 10(4) ID(Level) 11(3) 12(5) 13(6) 14(6) 15(8) ID(Level) 16(8) 17(7) 18(6) 19(6) 20(6) The average number of classifying levels: 5 (3 + 5 + 5 + 4 + 4 + 4 + 3 + 3 + 4 + 4 + 3 + 5 + 6 + 6 + 8 + 8 + 7 + 6 + 6 + 6) / 20 = 5

SLIDE 48

Performance

48

 The number of classifying levels of a leaf node and

the average number of classifying levels with thirteen

riginal factors

ID(Level) 1(3) 2(4) 3(5) 4(5) 5(3) 6(6) ID(Level) 6(6) 7(4) 8(3) 9(5) 10(6) 12(6) ID(Level) 11(5) 12(5) 13(6) 14(6) 15(5) 18(3) ID(Level) 16(4) 17(5) 18(7) 19(8) 20(9) 24(11) ID(Level) 25(11) 26(10) 27(7) 28(7) 29(7) 30(4) ID(Level) 31(5) 32(5) The average number of classifying levels: 6 (3 + 4 + 5 + 5 + 3 + 6 + 6 + 4 + 3 + 5 + 6 + 6 + 5 + 5 + 6 + 6 + 5 + 3 + 4 + 5 + 7 + 8 + 9 + 11 + 11 + 10 + 7 + 7 + 7 + 4 + 5 + 5) / 32 = 5.8125 ≈ 6

SLIDE 49

Performance

49

 A summary of the storage cost of the decision tree

with nine significant factors and thirteen original factors

N

.

Number of Factors Lev el Internal Node Leaf Node Number of Nodes 1 Nine Factors 9 19 20 39 2 Thirteen Factors 13 31 32 63

SLIDE 50

Performance

50

SLIDE 51

Summary

 We have proposed a method which applies the P-

value from the chi-square test (i.e., the significant factor) to analyzing the relationship between the factor and the class and then construct a compact decision tree.

51

SLIDE 52

Summary

 From our performance study, we have shown that our

decision tree concerning only nine significant factors can provide less storage cost, faster prediction time (due to the less average number of levels of the decision tree) and higher degree of accuracy for data classification than the decision tree concerning the

riginal thirteen factors related to AVG, both in the

cases of using 80% and 60% training data.

52

SLIDE 53

Conclusion

 Knowledge discovery in the database focuses upon

methodologies for extracting useful information from collection of data.

 First, we have used the P-value resulting from the chi-

square test to decide chronic kidney disease as an important risk factor for tumor recurrences, progression and overall survival in primary non-muscle-invasive bladder cancer.

 Second, we have applied the chi-square test to improve the

performance of the decision tree for classification by taking baseball database as an example.

SLIDE 54

54

SLIDE 55

55

SLIDE 56

56