Patients and Doctors Together to Discover Medical Knowledge with - - PowerPoint PPT Presentation
Patients and Doctors Together to Discover Medical Knowledge with - - PowerPoint PPT Presentation
Patients and Doctors Together to Discover Medical Knowledge with Statistics and Classification: of Patients, by Patients and for Patients Ye-In Chang ( ) Department of Computer Science and Engineering National Sun Yat-sen University
Outline
Introduction A Survey of Knowledge Discovery Proposed Methods
Chronic Kidney Disease as an Important Risk Factor for Tumor
Recurrences, Progression and Overall Survival in Primary Non- Muscle-Invasive Bladder Cancer
Applying the Chi-Square Test to Improve the Performance of the
Decision Tree for Classification by Taking Baseball Database as an Examples
Conclusion
2
Introduction
Knowledge discovery in the database focuses upon
methodologies for extracting useful information from collection of data.
3
Introduction
One of approaches for knowledge discovery is data
mining.
Data classification is one of famous and useful
techniques for data mining that assigns categories to collected data in order to analyze the accurate prediction.
Moreover, one of models for data classification is a
decision tree.
In fact, one of key points of a good decision tree is
the kind of deciding factors in the internal nodes.
4
Introduction
In statistical tests, the chi-square test is one of good
ways to analyze whether categorical variable A is the significant factor to categorical variable B.
From our observation from research papers in the
topic of medicine, we consider that the risk factor (i.e., the significant factor of the chi-square in statistics) is strongly related to the important deciding factor in the decision tree.
5
Introduction
In this research, first, we study the chronic kidney
disease as an important risk factor for the bladder cancer by cooperating with Department of Urology, Chang Gung Memorial Hospital, Kaohsiung, Taiwan, and we propose a statistic approach to check the relation.
6
Introduction
Second, we make use of the significant factor to
improve the performance of the decision tree, and we propose an approach which aims to reduce the number of deciding factors and decide the order of deciding factors in a decision tree.
7
Survey
Statistical tests and data mining techniques have been
largely studied, developed and applied for many fields.
8
Statistics
In statistics, there are two types of attributes including
continuous numbers and categorical variables.
Examples of continuous number are age and weight.
I am 30 years old and my weight is 60.5 kg.
Moreover, examples of categorical variables are
gender and location.
I am a girl and I live in Kaohsiung.
9
Statistics
The chi-square test is one of the statistical tests, and is
designed to analyze whether a significant relationship exists between two categorical variables.
Furthermore, it is used in many fields including
medical studies, finance and market, education and sports.
10
Chi-Square Test
There are four steps in the chi-square test which are
described as follows.
Step 1. State the null hypothesis (H0) and alternative
hypothesis (Ha)
Step 2. Determine the significant level Step 3. Analyze the database Step 4. Explain the results in terms of the hypothesis
11
Data Mining
The preprocess of the data mining technique includes
data selection data cleaning data transformation
12
Data Mining
Classification is one of the most common learning
models in data mining.
One of approaches for classification is a decision tree,
which makes the analysis of very large datasets effective.
13
Decision Tree
Player
- utlook
humidity temperature action 001 sunny low hot stay at home 002
- vercast
medium low go to play 003 rainy medium hot stay at home 004 sunny low low go to play 005
- vercast
low hot go to play 006 rainy high low stay at home 007
- vercast
medium low go to play 008 rainy high low stay at home 009 rainy high hot stay at home 010 sunny medium low go to play
14
Bladder Cancer
15
Bladder Cancer
We consider ten putative risk factors. We follow up five observations.
Bladder tumor recurrence Upper urinary tract (UUT) tumor recurrence Cancer progression Cancer-specific survival Overall survival
16
17
Chronic Kidney Disease as an Important Risk Factor for Tumor Recurrences, Progression and Overall Survival in Primary Non-Muscle-Invasive Bladder Cancer
International Urology and Nephrology, Vol. 48, No. 6, pp. 993-999, June 2016. (SCI)
Method
This retrospective study was approved by the hospital
review boards of Kaohsiung Chang Gung Memorial Hospital, and has been performed in accordance with the ethical standards in the Declaration of Helsinki.
18
Method
In the step of data selection, we choose the medical
record from the Cancer Center, Chang Gung Memorial Hospital, Kaohsiung, Taiwan.
The content of the medical records contain 2140
bladder cancer patients and 119 medical record fields for each patient.
They were reviewed for the 10 putative variables,
including patient age, gender, white blood cell leucocyte (WBC), NL ratio, tumor count, size, grade, stage, eGFR and squamous differentiation of histology.
19
Method
A total of 158 patients with primary diagnosis of
TaT1-NMIBC (i.e. attribute stage = ’Ta’ or stage = ’T1’) from January 2008 to December 2010 were treated by transurethral resection of bladder tumors (TURBT) at the urologic department.
All of the patients were followed up for more than
four years until December 2014.
20
Method
In the step of cleaning unclear data, we must fill in
the related data for the unclear data by checking the related data from the other surgery database.
For example, a patient’s tumor count record is empty
in the original database, we have to check his(her) related data in the surgery database to know the clear record.
21
Method
In the step of data enrichment, we convert the
medical result from the continuous number into the categorical variable.
For example, we have to convert the putative factor,
eGFR(≥60, 30-59 and <30), into three categorical values, respectively.
22
Method
In the step of statistic test, we analyze the P-value of
putative risk factors by using chi-square test.
In fact, we use the Number Cruncher Statistical
System (NCSS) to do such a chi-square test.
23
Bladder Cancer
We consider ten putative risk factors.
Ages (年齡) Gender (性別) White blood cell leucocyte
(WBC) (白血球數)
Neutrophil to lymphocyte
ratio (NL ratio) (嗜中性白血球與淋巴球比例)
Tumor Count (腫瘤數目)
24
Tumor size (腫瘤尺寸) Grade (程度) Stage (期別) Estimated glomerular
filtration rate (eGFR) (腎絲球過濾率)
Squamous differentiation
(鱗狀分化)
Result
Characteristic No(%) Characteristic No(%) Age Tumor size(cm) < 40 3(2) < 3 153(97) 40-69 85(54) ≥ 3 5(3) ≥ 70 70(44) Grade Gender Low 49(31) Male 112(71) High 109(69) Female 46(29) Stage WBC(k/ul) Ta 104(66) < 10 143(91) T1 54(34) ≥ 10 15(9) eGFR(ml/min) NL ratio ≥ 60 86(54) < 4 79(77) 30-59, CKD stage 3 35(22) ≥ 4 23(23) < 30, CKD stage 4, 5 37(24) Tumor count Squamous differentiation 1 90(57) No 148(94) 2-7 63(40) Yes 10(6) ≥ 8 5(3)
25
Tumor recurrences, progression and overall survival
- f NMIBC patients with 4-year follow-up
Result
With CKD %(no.) Without CKD %(no.) Total %(no.) Bladder tumor recurrence 40(29/72) 26(22/86) 32(51/158) UUT tumor recurrence 7(5/72) 0(0/86) 3(5/158) Progression 7(5/72) 5(4/86) 6(9/158) Overall survival 63(45/72) 91(78/86) 78(123/158)
26
Result
27
Bladder tumor recurrence Progression Overall survival UUT tumor recurrence Cancer-specific survival 1 eGFR(CKD) ✔ ✔ ✔ ✔ 2 Grade ✔ ✔ ✔ ✔ 3 Stage ✔ ✔ ✔ ✔ 4 Tumor Count ✔ ✔ 5 Squamous differentiation ✔ 6 Tumor Size ✔ ✔ 7 Age 8 Gender ✔ 9 WBC 10 NL ratio
Summary
Chronic kidney disease (CKD) is an important risk
factor for tumor recurrences and progression.
We have studied that NMIBC patients with CKD
should be intensively monitored at the UUT and bladder.
28
Summary
We thank the Department of Urology,
Chang Gung Memorial Hospital, Kaohsiung, Taiwan for providing data for this study.
The content of the medical records
contain 2140 patients and 119 medical record fields for each patient.
29
30
Applying the Chi-Square Test to Improve the Performance of the Decision Tree for Classification by Taking Baseball Database as an Example
Journal of Computers, Dec. 2018.
Baseball Database
In the real world, a large amount of baseball batting
data has been collected as the digital database.
There are 659 baseball players who have at bats
identified through Chinese Professional Baseball League (CPBL) team website from 2009 to 2015.
Each team plays 120 games in a year since 2009.
31
Baseball Database
We focus on the data of 132 baseball players whose
at bats (AB) are greater than or equal to 372, and the
- rder is not important in this study.
Note that at bats is equal to the number of games multiplied
3.1 according to the rule which is recorded on the rule 9.22(a) of the official baseball rules 2017 edition on the Major League Baseball (MLB) website and the Chinese Professional Baseball League (CPBL), 2017.
Therefore, we have 120 × 3.1 = 372.
32
Baseball Database
They are reviewed for the thirteen factors including
games (G) (出賽次數) plate appearances (PA) (打席) at bats (AB) (打數) run batted in (RBI) (打點) runs (R) (得分) hits (H) (安打) one-base hit (1B) (一壘安打)
Furthermore, batting average (AVG) is reviewed in
this study.
33
two-base hit (2B) (二壘安打) three-base hit (3B) (三壘安打) home run (HR) (全壘打) total bases (TB) (壘打數) strike outs (SO) (三振) stolen bases (SB) (盜壘成功)
Baseball Database
In general, the cut-off value of batting average
(AVG) is defined as 0.300.
When the batting data is greater than or equal to the
cut-off value, it means that the batter has excellent betting performance.
34
Method
Our proposed method contains five steps as follows:
(1) splitting factors and batting data (2) analyzing the significant factor (3) dividing training data (4) constructing the decision tree (5) classifying testing data
35
Method
In Step 1, we split factors and batting data such that
- nly categorical symbols.
All of the thirteen factors are split based on the
statistical mean, respectively.
36
Method
If the possible value of the attribute is a continuous
number, we have to convert such a continuous number into a categorical symbol.
For example, the statistical mean of stolen bases (SB) is 10. We categorize stolen bases (SB) <10 as a symbol 1 and
stolen bases (SB) ≥10 as the other symbol 2.
Moreover, we symbolize the new added attribute,
batting average (AVG).
If it is greater than or equal to 0.300, we symbolize it by
class A; otherwise, we symbolize it by class B.
37
Method
In Step 2, we analyze the input data with only
categorical values converted from Step 1.
We perform the chi-square test to analyze whether the
factors are significant related to AVG or not.
We state the null hypothesis (H0) and alternative
hypothesis (Ha) as follows:
H0: batting average (AVG) and stolen bases (SB) are
independent,
Ha: batting average (AVG) and stolen bases (SB) are
dependent.
38
Method
In Step 3, in order to make a comparison of the
performance between database DB13Factors and database DBSignificantFactors, for each of those two databases, we divide the database into training data and testing data for constructing the decision tree.
We consider different cases of training data by
selecting different percentages of data.
39
Method
In this study, we use one case with 60% training data
and another case with 80% training data.
Furthermore, we denote those two databases for
constructing decision trees as
DB13Factors60% DB13Factors80% DBSignificantFactors60% DBSignificantFactors80%
40
Method
In Step 4, we use the training data to construct a
decision tree.
Since we concern four databases DB13Factors60%,
DB13Factors80%, DBSignificantFactors60% and DBSignificantFactors80% for performance comparison, we have four resulting decision trees.
41
Method
Finally, in Step 5, we classify the testing data by the
related decision tree and then calculate the correct ratio.
That is, for the decision tree constructed based on
training data DB13Factors80%, we use the remaining 20% of the original baseball database as the testing data.
Similarly, we test the cases of the other three
databases.
42
Result
Name of Factor Number of Players % Name of Factor Number of Players % Game Two-base hit < 110 60 45 < 21 62 47 ≥ 110 72 55 ≥ 21 70 53 Plate appearances Three-base hit < 453 65 49 < 3 81 61 ≥ 453 67 51 ≥ 3 51 39 At bats Home run < 402 64 48 < 8 74 56 ≥ 402 68 52 ≥ 8 58 44 Run batted in Total bases < 59 79 60 < 174 74 56 ≥ 59 53 40 ≥ 174 58 44 Runs Strike outs < 62 73 55 < 57 66 50 ≥ 62 59 45 ≥ 57 66 50 Hits Stolen bases < 123 72 55 < 10 74 56 ≥ 123 60 45 ≥ 10 58 44 One-base hit < 91 71 54 ≥ 91 61 46
43
Result
44
Name of Factor Univariate analysis Chi-square P-value Batting average Game 0.637 0.4250 Plate appearances 8.738 0.0031 At bats 5.840 0.0157 Run batted in 18.589 < 0.001 Runs 30.930 < 0.001 Hits 41.149 < 0.001 One-base hit 19.912 < 0.001 Two-base hit 26.936 < 0.001 Three-base hit 0.004 0.9480 Home run 16.018 < 0.001 Total bases 41.831 < 0.001 Strike outs 0.122 0.7266 Stolen bases 0.862 0.3531
Performance
45
The resulting decision tree for the nine
significant factors and batting average (in the case of 80% training data)
Performance
Li_dissertation - 46
The resulting decision tree for the
thirteen factors and batting average (in the case of 80% training data)
Performance
47
The number of classifying levels of a leaf node and
the average number of classifying levels with nine significant factors
ID(Level) 1(3) 2(5) 3(5) 4(4) 5(4) ID(Level) 6(4) 7(3) 8(3) 9(4) 10(4) ID(Level) 11(3) 12(5) 13(6) 14(6) 15(8) ID(Level) 16(8) 17(7) 18(6) 19(6) 20(6) The average number of classifying levels: 5 (3 + 5 + 5 + 4 + 4 + 4 + 3 + 3 + 4 + 4 + 3 + 5 + 6 + 6 + 8 + 8 + 7 + 6 + 6 + 6) / 20 = 5
Performance
48
The number of classifying levels of a leaf node and
the average number of classifying levels with thirteen
- riginal factors
ID(Level) 1(3) 2(4) 3(5) 4(5) 5(3) 6(6) ID(Level) 6(6) 7(4) 8(3) 9(5) 10(6) 12(6) ID(Level) 11(5) 12(5) 13(6) 14(6) 15(5) 18(3) ID(Level) 16(4) 17(5) 18(7) 19(8) 20(9) 24(11) ID(Level) 25(11) 26(10) 27(7) 28(7) 29(7) 30(4) ID(Level) 31(5) 32(5) The average number of classifying levels: 6 (3 + 4 + 5 + 5 + 3 + 6 + 6 + 4 + 3 + 5 + 6 + 6 + 5 + 5 + 6 + 6 + 5 + 3 + 4 + 5 + 7 + 8 + 9 + 11 + 11 + 10 + 7 + 7 + 7 + 4 + 5 + 5) / 32 = 5.8125 ≈ 6
Performance
49
A summary of the storage cost of the decision tree
with nine significant factors and thirteen original factors
N
- .
Number of Factors Lev el Internal Node Leaf Node Number of Nodes 1 Nine Factors 9 19 20 39 2 Thirteen Factors 13 31 32 63
Performance
50
Summary
We have proposed a method which applies the P-
value from the chi-square test (i.e., the significant factor) to analyzing the relationship between the factor and the class and then construct a compact decision tree.
51
Summary
From our performance study, we have shown that our
decision tree concerning only nine significant factors can provide less storage cost, faster prediction time (due to the less average number of levels of the decision tree) and higher degree of accuracy for data classification than the decision tree concerning the
- riginal thirteen factors related to AVG, both in the
cases of using 80% and 60% training data.
52
Conclusion
Knowledge discovery in the database focuses upon
methodologies for extracting useful information from collection of data.
First, we have used the P-value resulting from the chi-
square test to decide chronic kidney disease as an important risk factor for tumor recurrences, progression and overall survival in primary non-muscle-invasive bladder cancer.
Second, we have applied the chi-square test to improve the
performance of the decision tree for classification by taking baseball database as an example.
54
55
56