CS570 Introduction to Data Mining
1
Classification and Prediction
Partial slide credits: Han and Kamber Tan,Steinbach, Kumar
1
CS570 Introduction to Data Mining Classification and Prediction - - PowerPoint PPT Presentation
CS570 Introduction to Data Mining Classification and Prediction Partial slide credits: Han and Kamber Tan,Steinbach, Kumar 1 1 Overview
1
Partial slide credits: Han and Kamber Tan,Steinbach, Kumar
1
Decision tree induction Bayesian classification Data Mining: Concepts and Techniques 2 kNN classification Support Vector Machines (SVM) Neural Networks
2
Safe Hard Large Green Hairy safe Hard Large Brown Hairy Conclusion Flesh Size Color Skin
Li Xiong Data Mining: Concepts and Techniques 3 3
… Dangerous Hard Small Smooth Safe Soft Large Green Hairy Dangerous Soft Red Smooth Safe Hard Large Green Hairy Large Red
3
Classification
predicts categorical class labels constructs a model based on the training set and uses
Prediction (Regression)
4
models continuous8valued functions, i.e., predicts
Typical applications
Credit approval Target marketing Medical diagnosis Fraud detection
4
Name Age Income … Credit Clark 35 High … Excellent Milton 38 High … Excellent Neo 25 Medium … Fair … … … … …
Data Mining: Concepts and Techniques 5
If age = “31...40” and income = high then credit_rating =
excellent
Paul: age = 35, income = high ⇒ excellent credit rating John: age = 20, income = medium ⇒ fair credit rating
… … … … …
5
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction is training set The model is represented as classification rules, decision trees,
Data Mining: Concepts and Techniques 6
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over8fitting
will occur
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
6
7
8
Supervised learning (classification)
Supervision: The training data (observations,
Data Mining: Concepts and Techniques 9
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown Given a set of measurements, observations, etc. with
9
Accuracy Speed
time to construct the model (training time) time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Data Mining: Concepts and Techniques 10
Scalability: efficiency in disk8resident databases Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, decision tree
10
Decision tree Bayesian classification Data Mining: Concepts and Techniques 11 kNN classification Support Vector Machines (SVM) Others
11
age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 3140 high no fair yes >40 medium no fair yes >40 low yes fair yes
Data Mining: Concepts and Techniques 12
>40 low yes fair yes >40 low yes excellent no 3140 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 3140 medium no excellent yes 3140 high yes fair yes >40 medium no excellent no
12
13
down recursive partitioning
A test attribute is selected that “best” separate the data into
Data Mining: Concepts and Techniques 14
A test attribute is selected that “best” separate the data into
partitions
Samples are partitioned recursively based on selected attributes
All samples for a given node belong to the same class There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
14
Idea: select attribute that partition samples into
Measures
Information gain (ID3)
Data Mining: Concepts and Techniques 15
Gain ratio (C4.5) Gini index (CART)
15
estimated by |Ci, D|/|D|
=
Data Mining: Concepts and Techniques 16
classify D:
attribute A
=
16
Class P: buys_computer = “yes”, Class N: buys_computer = “no”
age pi ni I(pi, ni) <=30 2 3 0.971 3140 4 >40 3 2 0.971
" #
"
=
income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 3140 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no
Data Mining: Concepts and Techniques 17
&'# ( !
$
%
" #
#
"
%
+ + =
( !
( !
( !
= =
( !
− =
low yes excellent no 3140 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 3140 medium no excellent yes 3140 high yes fair yes >40 medium no excellent no
'#! ( !
%
%
'
'
" '
− − = =
Let attribute A be a continuous8valued attribute Must determine the best split point for A
Sort the value A in increasing order Typically, the midpoint between each pair of adjacent
Data Mining: Concepts and Techniques 18
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
Split:
D1 is the set of tuples in D satisfying A ≤ split8point, and
18
Information gain measure is biased towards attributes
C4.5 uses gain ratio to overcome the problem
Data Mining: Concepts and Techniques 19
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
gain_ratio(income) = 0.029/0.926 = 0.031
The attribute with the maximum gain ratio is selected as
− = ∑
=
'& ( !
#
#
&
&
#
#
× − × − × − =
defined as where pj is the relative frequency of class j in D
gini(D) is defined as
20
1 2
gini(D) is defined as
in impurity) is chosen to split the node
and 4 in D2 #%' ( ! # % # '
− − =
#
!
" ,
+ =
∈
Data Mining: Concepts and Techniques 21
but gini{medium,high} is 0.30 and thus the best since it is the lowest
# #
21
The three measures, in general, return good results but
Information gain:
biased towards multivalued attributes
Gain ratio:
Data Mining: Concepts and Techniques 22 tends to prefer unbalanced splits in which one
Gini index:
biased to multivalued attributes tends to favor tests that result in equal8sized
22
for independence
Data Mining: Concepts and Techniques 23
is preferred):
The best tree as the one that requires the fewest # of bits to both
(1) encode the tree, and (2) encode the exceptions to the tree
CART: finds multivariate splits based on a linear comb. of attrs.
23
Overfitting: An induced tree may overfit the training data
Too many branches, some may reflect anomalies and noises
24 24
25
Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” tree
Data Mining: Concepts and Techniques 26
Postpruning: Remove branches from a “fully grown” tree
Use a set of data different from the training data to decide
which is the “best pruned tree”
Occam's razor: prefers smaller decision trees (simpler theories)
26
Allow for continuous8valued attributes
Dynamically define new discrete8valued attributes that
Handle missing attribute values
Data Mining: Concepts and Techniques 27
Handle missing attribute values
Assign the most common value of the attribute Assign probability to each of the possible values
Attribute construction
Create new attributes based on existing ones that are
This reduces fragmentation, repetition, and replication
27
SLIQ (EDBT’96 — Mehta et al.)
Builds an index for each attribute and only class list and
SPRINT (VLDB’96 — J. Shafer et al.)
Constructs an attribute list data structure
Data Mining: Concepts and Techniques 28
Constructs an attribute list data structure
PUBLIC (VLDB’98 — Rastogi & Shim)
Integrates tree splitting and tree pruning: stop growing
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)
Builds an AVC8list (attribute, value, class label)
BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)
Uses bootstrapping to create several small samples
28
Separates the scalability aspects from the criteria that
Builds an AVC.list (of an attribute )
Data Mining: Concepts and Techniques 29
Projection of training dataset onto the attribute and
(of a node )
Set of AVC.sets of all predictor attributes at the node
29
Age Buy_Computer yes no <=30 3 2 31..40 4
age income studentcredit_rating uys_compu <=30 high no fair no <=30 high no excellent no 3140 high no fair yes >40 medium no fair yes
AVC8set on income AVC8set on Age
income Buy_Computer yes no high 2 2 medium 4 2
Data Mining: Concepts and Techniques 30
student Buy_Computer yes no yes 6 1 no 3 4 >40 3 2 Credit rating Buy_Computer yes no fair 6 2 excellent 3 3
>40 low yes fair yes >40 low yes excellent no 3140 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 3140 medium no excellent yes 3140 high yes fair yes >40 medium no excellent no
AVC8set on Student
low 3 1
AVC8set on credit_rating
30
Use a statistical technique called to create
Each subset is used to create a tree, resulting in several
Data Mining: Concepts and Techniques 31
These trees are examined and used to construct a new
It turns out that is very close to the tree that would
Adv: requires only two scans of DB, an incremental alg.
31
Relatively faster learning speed (than other classification
Convertible to simple and easy to understand classification
Comparable classification accuracy with other methods
Data Mining: Concepts and Techniques 32 32
Decision tree induction Bayesian classification Data Mining: Concepts and Techniques 33 kNN classification Support Vector Machines (SVM) Others
33
A statistical classifier: performs probabilistic prediction,
Foundation: Based on Bayes’ Theorem. Naïve Bayesian Independence assumption
Data Mining: Concepts and Techniques 34
Bayesian network Concept Using Bayesian network Training/learning Bayesian network
34
Bayes' theorem/rule/law relates the conditional and
P(H) is the prior probability of H. P(H|X) is the conditional probability (posteriori probability) of H
given X.
P(X|H) is the conditional probability of X given H. P(X|H) is the conditional probability of X given H. P(X) is the prior probability of X
Bowl A: 10 chocolate + 30 plain; Bowl B: 20 chocolate + 20 plain Pick a bowl, and then pick a cookie If it’s a plain cookie, what’s the probability the cookie is picked out
Data Mining: Concepts and Techniques 35
35
Naïve Bayesian / idiot Bayesian / simple Bayesian Let D be a training set of tuples and their associated class
Suppose there are m classes C1, C2, …, Cm.
Data Mining: Concepts and Techniques 36
1 2 m
Classification is to derive the maximum posteriori, i.e., the
Since P(X) is constant for all classes, maximal
A simplified assumption: attributes are conditionally
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
× × = ∏ = =
37
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
If Ak is continous8valued, P(xk|Ci) is usually computed
"
π σ
−
=
Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’
Data Mining: Concepts and Techniques 38
Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
38
P(buys_computer = “no”) = 5/14= 0.357
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
Data Mining: Concepts and Techniques 39
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
";@$% P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 ";@$A"$%P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 (?;,"+,-.<-/$
39
Advantages
Fast to train and use Can be highly effective in most of the cases
Disadvantages
Based on a false assumption: class conditional
Data Mining: Concepts and Techniques 40
Based on a false assumption: class conditional
Idiot’s Bayesian, not so stupid after all? David J. Hand,
How to deal with dependencies? Bayesian Belief Networks 40
41
networks, Bayesian networks, probabilistic networks) is a graphical model that represents a set of variables and their probabilistic independencies
42
in AI
for classification and reasoning
recognition, diagnostic systems
"#$ %#
42
false 0.6 true 0.4
"7@$ false false 0.01 false true 0.99 true false 0.7 true true 0.3 7
false false 0.4 false true 0.6 true false 0.9 true true 0.1 7 * "*@7$ false false 0.02 false true 0.98 true false 0.05 true true 0.95
43
44
false 0.6 true 0.4
"7@$ false false 0.01 false true 0.99 true false 0.7 true true 0.3 7
false false 0.4 false false 0.4 false true 0.6 true false 0.9 true true 0.1 7 * "*@7$ false false 0.02 false true 0.98 true false 0.05 true true 0.95
For a Boolean variable with k Boolean parents, how many probabilities need to be stored?
45
1.
2.
46
47
=
P(A = true, B = true, C = true, D = true) = P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95)
'()* '()&* '&()* '&()&*
+ ,
The ,,-, () for variable LungCancer:
49
"#$ %#
& ,
7-79B
Using the Bayesian Network: P(LungCancer | Smoker, PXRay, Dyspnea)?
49
Using a Bayesian network to compute probabilities is
General form: P( X | E )
Exact inference is feasible in small to medium8sized
Exact inference in large networks takes a very long time
Approximate inference techniques which are much
50
P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)
Joint probability:
P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)
Joint probability: P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R) Suppose the grass is wet, which is more likely?
51
Several scenarios:
Given both the network structure and all variables
Network structure known, some hidden variables:
February 12, 2008 Data Mining: Concepts and Techniques 52
Network structure unknown, all variables observable:
Unknown structure, all hidden variables: No good
52
Bayesian networks (directed graphical model) Markov networks (undirected graphical model)
Conditional random field
Applications:
Sequential data
Natural language text Protein sequences 53
Decision tree induction Bayesian classification Data Mining: Concepts and Techniques 54 kNN classification Support Vector Machines (SVM) Neural Networks
54