Supervised Learning: Classifica4on
- Sept. 24, 2018
Supervised Learning: Classifica4on Sept. 24, 2018 Classification: - - PowerPoint PPT Presentation
Supervised Learning: Classifica4on Sept. 24, 2018 Classification: Basic concepts Classifica4on: Basic Concepts Decision Tree Induc4on Bayes Classifica4on Methods Model Evalua4on and Selec4on Techniques to Improve Classifica4on
determined by the class label aRribute
mathema4cal formulae
from the model
by the model
6
NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no
7
NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes
8
9
31..40
age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
10
m = 2
) ( log ) (
2 1 i m i i
p p D Info
=
− =
) ( | | | | ) (
1 j v j j A
D Info D D D Info × = ∑
=
(D) Info Info(D) Gain(A)
A
− =
13
means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence, Similarly,
age pi ni I(pi, ni) <=30 2 3 0.971 31…40 4 >40 3 2 0.971
694 . ) 2 , 3 ( 14 5 ) , 4 ( 14 4 ) 3 , 2 ( 14 5 ) ( = + + = I I I D Infoage
048 . ) _ ( 151 . ) ( 029 . ) ( = = = rating credit Gain student Gain income Gain
246 . ) ( ) ( ) ( = − = D Info D Info age Gain
age
age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
) 3 , 2 ( 14 5 I
940 . ) 14 5 ( log 14 5 ) 14 9 ( log 14 9 ) 5 , 9 ( ) (
2 2
= − − = = I D Info
SplitInfoA(D) = − | Dj | | D | ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
j=1 v
× log2 | Dj | | D | ⎛ ⎝ ⎜ ⎞ ⎠ ⎟
16
gini(D)=1− j 2 p j=1 n ∑
A
1|
1
( )+|D2|
2
A
17
459 . 14 5 14 9 1 ) (
2 2
= ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = D gini
) ( 14 4 ) ( 14 10 ) (
2 1 } , {
D Gini D Gini D gini
medium low income
⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ =
∈
independence
preferred): – The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the excep4ons to the tree
– CART: finds mul4variate splits based on a linear comb. of aRrs.
– Most give good results, none is significantly superior than others
24
25
– Let X be a data sample (“evidence”): class label is unknown – Let H be a hypothesis that X belongs to class C – Classifica4on is to determine P(H|X), (i.e., posteriori probability): the probability that the hypothesis holds given the observed data sample X – P(H) (prior probability): the ini4al probability
– P(X): probability that sample data is observed – P(X|H) (likelihood): the probability of observing the sample X, given that the hypothesis holds
medium income
) ( ) 1 | ( ) ( i A P M i i A B P B P
∑
= =
26
27
) | ( ... ) | ( ) | ( 1 ) | ( ) | (
2 1
Ci x P Ci x P Ci x P n k Ci x P Ci P
n k
× × × = ∏ = = X
2 2
2 ) (
2 1 ) , , (
σ µ
σ π σ µ
− −
=
x
e x g
i i
C C k
29
age income student credit_rating uys_compu <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
P(buys_computer = “no”) = 5/14= 0.357
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_ra4ng = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_ra4ng = “fair” | buys_computer = “no”) = 2/5 = 0.4
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”)
age income student credit_rating uys_compu <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
31
∏ = = n k Ci xk P Ci X P 1 ) | ( ) | (
Symptoms: fever, cough etc., Disease: lung cancer,
33
34
Actual class\Predicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000
Actual class\Predicted class C1 ¬ C1 C1 True Posi4ves (TP) False Nega4ves (FN) ¬ C1 False Posi4ves (FP) True Nega4ves (TN) Example of Confusion Matrix:
35
n Class Imbalance Problem:
n One class may be rare, e.g.
n Significant majority of the
n Sensi4vity: True Posi4ve
n Sensi4vity = TP/P
n Specificity: True Nega4ve
n Specificity = TN/N
A\P C ¬C C TP FN P ¬C FP TN N P’ N’ All
36
37
38
– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
Actual Class\Predicted class cancer = yes cancer = no Total Recogni4on(%) cancer = yes 90 210 300 30.00 (sensi4vity cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.40 (accuracy)
39
40
41
42
43
– Given a set D of d tuples, at each itera4on i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap) – A classifier model Mi is learned for each training set Di
– Each classifier Mi returns its class predic4on – The bagged classifier M* counts the votes and assigns the class with the most votes to X
the average value of each predic4on for a given test tuple
– O`en significantly beRer than a single classifier derived from D – For noise data: not considerably worse, more robust – Proved improved accuracy in predic4on
44
45
46
– Tuples from D are sampled (with replacement) to form a training set Di of the same size – Each tuple’s chance of being selected is based on its weight – A classifica4on model Mi is derived from Di – Its error rate is calculated using Di as a test set – If a tuple is misclassified, its weight is increased, o.w. it is decreased
error rate is the sum of the weights of the misclassified tuples:
) ( ) ( 1 log
i i
M error M error −
d j j i
j
– Each classifier in the ensemble is a decision tree classifier and is generated using a random selec4on of aRributes at each node to determine the split – During classifica4on, each tree votes and the most popular class is returned
– Forest-RI (random input selec4on): Randomly select, at each node, F aRributes as candidates for the split at the node. The CART methodology is used to grow the trees to maximum size – Forest-RC (random linear combina4ons): Creates new aRributes (or features) that are a linear combina4on of the exis4ng aRributes (reduces the correla4on between individual classifiers)
split, and faster than bagging or boos4ng
47
n Classifica4on is a form of data analysis that extracts models
n Effec4ve and scalable methods have been developed for decision
n Evalua4on metrics include: accuracy, sensi4vity, specificity,
n Stra4fied k-fold cross-valida4on is recommended for accuracy
48
n There have been numerous comparisons of the different
n No single method has been found to be superior over all others
n Issues such as accuracy, training 4me, robustness, scalability,
n References: hRp://hanj.cs.illinois.edu/
49