CS570 Introduction to Data Mining Classification and Prediction - - PowerPoint PPT Presentation

cs570 introduction to data mining classification and
SMART_READER_LITE
LIVE PREVIEW

CS570 Introduction to Data Mining Classification and Prediction - - PowerPoint PPT Presentation

CS570 Introduction to Data Mining Classification and Prediction Partial slide credits: Han and Kamber Tan,Steinbach, Kumar 1 1 Overview


slide-1
SLIDE 1

CS570 Introduction to Data Mining

1

Classification and Prediction

Partial slide credits: Han and Kamber Tan,Steinbach, Kumar

1

slide-2
SLIDE 2
  • Overview
  • Classification algorithms and methods

Decision tree induction Bayesian classification Data Mining: Concepts and Techniques 2 kNN classification Support Vector Machines (SVM) Neural Networks

  • Regression
  • Evaluation and measures
  • Ensemble methods

2

slide-3
SLIDE 3

Safe Hard Large Green Hairy safe Hard Large Brown Hairy Conclusion Flesh Size Color Skin

Li Xiong Data Mining: Concepts and Techniques 3 3

… Dangerous Hard Small Smooth Safe Soft Large Green Hairy Dangerous Soft Red Smooth Safe Hard Large Green Hairy Large Red

3

slide-4
SLIDE 4

Classification

predicts categorical class labels constructs a model based on the training set and uses

it in classifying new data

Prediction (Regression)

models continuous8valued functions, i.e., predicts

  • Data Mining: Concepts and Techniques

4

models continuous8valued functions, i.e., predicts

unknown or missing values

Typical applications

Credit approval Target marketing Medical diagnosis Fraud detection

4

slide-5
SLIDE 5

Name Age Income … Credit Clark 35 High … Excellent Milton 38 High … Excellent Neo 25 Medium … Fair … … … … …

Data Mining: Concepts and Techniques 5

  • Classification rule:

If age = “31...40” and income = high then credit_rating =

excellent

  • Future customers

Paul: age = 35, income = high ⇒ excellent credit rating John: age = 20, income = medium ⇒ fair credit rating

… … … … …

5

slide-6
SLIDE 6

!

  • Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute

The set of tuples used for model construction is training set The model is represented as classification rules, decision trees,

  • r mathematical formulae
  • Model usage: for classifying future or unknown objects

Data Mining: Concepts and Techniques 6

  • Model usage: for classifying future or unknown objects

Estimate accuracy of the model

The known label of test sample is compared with the

classified result from the model

Accuracy rate is the percentage of test set samples that are

correctly classified by the model

Test set is independent of training set, otherwise over8fitting

will occur

If the accuracy is acceptable, use the model to classify data

tuples whose class labels are not known

6

slide-7
SLIDE 7

"#$%

  • Data Mining: Concepts and Techniques

7

  • 7
slide-8
SLIDE 8

"&$%'(

  • Data Mining: Concepts and Techniques

8

  • 8
slide-9
SLIDE 9

!')

Supervised learning (classification)

Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating the class of the observations

Data Mining: Concepts and Techniques 9

indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown Given a set of measurements, observations, etc. with

the aim of establishing the existence of classes or clusters in the data

9

slide-10
SLIDE 10

%(

Accuracy Speed

time to construct the model (training time) time to use the model (classification/prediction time)

Robustness: handling noise and missing values

Scalability: efficiency in disk8resident databases

Data Mining: Concepts and Techniques 10

Scalability: efficiency in disk8resident databases Interpretability

understanding and insight provided by the model

Other measures, e.g., goodness of rules, decision tree

size or compactness of classification rules

10

slide-11
SLIDE 11
  • Overview
  • Classification algorithms and methods

Decision tree Bayesian classification Data Mining: Concepts and Techniques 11 kNN classification Support Vector Machines (SVM) Others

  • Evaluation and measures
  • Ensemble methods

11

slide-12
SLIDE 12

*

age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 3140 high no fair yes >40 medium no fair yes >40 low yes fair yes

Data Mining: Concepts and Techniques 12

>40 low yes fair yes >40 low yes excellent no 3140 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 3140 medium no excellent yes 3140 high yes fair yes >40 medium no excellent no

12

slide-13
SLIDE 13

*+,-./

  • Data Mining: Concepts and Techniques

13

  • 13
slide-14
SLIDE 14

(*

  • ID3 (Iterative Dichotomiser), C4.5, by Quinlan
  • CART (Classification and Regression Trees)
  • Basic algorithm (a greedy algorithm) 8 tree is constructed with top8

down recursive partitioning

  • At start, all the training examples are at the root

A test attribute is selected that “best” separate the data into

Data Mining: Concepts and Techniques 14

A test attribute is selected that “best” separate the data into

partitions

Samples are partitioned recursively based on selected attributes

  • Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf

There are no samples left

14

slide-15
SLIDE 15

,!

Idea: select attribute that partition samples into

homogeneous groups

Measures

Information gain (ID3)

Gain ratio (C4.5)

Data Mining: Concepts and Techniques 15

Gain ratio (C4.5) Gini index (CART)

15

slide-16
SLIDE 16

,!% 0"*1$

  • Select the attribute with the highest information gain
  • Let pi be the probability that an arbitrary tuple in D belongs to class Ci,

estimated by |Ci, D|/|D|

  • Information (entropy) needed to classify a tuple in D (before split):

=

− =

Data Mining: Concepts and Techniques 16

  • Information needed (after using A to split D into v partitions) to

classify D:

  • Information gain – difference between before and after splitting on

attribute A

  • =
  • ×

= ∑

=

=

16

slide-17
SLIDE 17

%0

Class P: buys_computer = “yes”, Class N: buys_computer = “no”

age pi ni I(pi, ni) <=30 2 3 0.971 3140 4 >40 3 2 0.971

  • !

" #

  • #
  • $

"

  • %
  • +

=

  • age

income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 3140 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no

Data Mining: Concepts and Techniques 17

&'# ( !

  • "

$

  • #

%

  • !

" #

  • #

#

  • $

"

  • #

%

  • =

+ + =

  • !#)

( !

  • *
  • %

( !

  • !'

( !

  • =

= =

  • #&

( !

  • =

− =

  • >40

low yes excellent no 3140 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 3140 medium no excellent yes 3140 high yes fair yes >40 medium no excellent no

'#! ( !

  • #

%

  • #

%

  • #

'

  • #

'

  • %

" '

  • =

− − = =

  • 17
slide-18
SLIDE 18

2,

Let attribute A be a continuous8valued attribute Must determine the best split point for A

Sort the value A in increasing order Typically, the midpoint between each pair of adjacent

values is considered as a possible split point

Data Mining: Concepts and Techniques 18

values is considered as a possible split point

(ai+ai+1)/2 is the midpoint between the values of ai and ai+1

The point with the minimum expected information

requirement for A is selected as the split8point for A

Split:

D1 is the set of tuples in D satisfying A ≤ split8point, and

D2 is the set of tuples in D satisfying A > split8point

18

slide-19
SLIDE 19

,!%03"45$

Information gain measure is biased towards attributes

with a large # of values (# of splits)

C4.5 uses gain ratio to overcome the problem

(normalization to information gain)

Data Mining: Concepts and Techniques 19

GainRatio(A) = Gain(A)/SplitInfo(A)

Ex.

gain_ratio(income) = 0.029/0.926 = 0.031

The attribute with the maximum gain ratio is selected as

the splitting attribute

  • ×

− = ∑

=

'& ( !

  • #

#

  • #

#

  • #

&

  • #

&

  • #

#

  • #

#

  • =

× − × − × − =

  • 19
slide-20
SLIDE 20

,!%0"3$

  • If a data set D contains examples from n classes, gini index, gini(D) is

defined as where pj is the relative frequency of class j in D

  • If a data set D is split on A into two subsets D1 and D2, the gini index

gini(D) is defined as

∑ = − =

  • Data Mining: Concepts and Techniques

20

1 2

gini(D) is defined as

  • Reduction in Impurity:
  • The attribute provides the smallest ginisplit(D) (or the largest reduction

in impurity) is chosen to split the node

  • +

=

=

  • 20
slide-21
SLIDE 21

%0

  • Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
  • Suppose the attribute income partitions D into 10 in D1: {low, medium}

and 4 in D2 #%' ( ! # % # '

  • =

      −       − =

  • #

#

  • #

!

  • +

" ,

     +       =

Data Mining: Concepts and Techniques 21

but gini{medium,high} is 0.30 and thus the best since it is the lowest

# #    

21

slide-22
SLIDE 22

,!

The three measures, in general, return good results but

Information gain:

biased towards multivalued attributes

Gain ratio:

tends to prefer unbalanced splits in which one

Data Mining: Concepts and Techniques 22 tends to prefer unbalanced splits in which one

partition is much smaller than the others

Gini index:

biased to multivalued attributes tends to favor tests that result in equal8sized

partitions and purity in both partitions

22

slide-23
SLIDE 23

6(,!

  • CHAID: a popular decision tree algorithm, measure based on χ2 test

for independence

  • C8SEP: performs better than info. gain and gini index in certain cases
  • G8statistics: has a close approximation to χ2 distribution
  • MDL (Minimal Description Length) principle (i.e., the simplest solution

Data Mining: Concepts and Techniques 23

is preferred):

The best tree as the one that requires the fewest # of bits to both

(1) encode the tree, and (2) encode the exceptions to the tree

  • Multivariate splits (partition based on multiple variable combinations)

CART: finds multivariate splits based on a linear comb. of attrs.

  • Which attribute selection measure is the best?
  • Most give good results, none is significantly superior than others

23

slide-24
SLIDE 24

6

Overfitting: An induced tree may overfit the training data

Too many branches, some may reflect anomalies and noises

  • Tan,Steinbach, Kumar

24 24

slide-25
SLIDE 25

25

slide-26
SLIDE 26
  • Two approaches to avoid overfitting

Prepruning: Halt tree construction early—do not split a node if this

would result in the goodness measure falling below a threshold

Difficult to choose an appropriate threshold

Postpruning: Remove branches from a “fully grown” tree

Data Mining: Concepts and Techniques 26

Postpruning: Remove branches from a “fully grown” tree

Use a set of data different from the training data to decide

which is the “best pruned tree”

Occam's razor: prefers smaller decision trees (simpler theories)

  • ver larger ones

26

slide-27
SLIDE 27

(7*

Allow for continuous8valued attributes

Dynamically define new discrete8valued attributes that

partition the continuous attribute value into a discrete set of intervals

Handle missing attribute values

Data Mining: Concepts and Techniques 27

Handle missing attribute values

Assign the most common value of the attribute Assign probability to each of the possible values

Attribute construction

Create new attributes based on existing ones that are

sparsely represented

This reduces fragmentation, repetition, and replication

27

slide-28
SLIDE 28

!,*(

SLIQ (EDBT’96 — Mehta et al.)

Builds an index for each attribute and only class list and

the current attribute list reside in memory

SPRINT (VLDB’96 — J. Shafer et al.)

Constructs an attribute list data structure

Data Mining: Concepts and Techniques 28

Constructs an attribute list data structure

PUBLIC (VLDB’98 — Rastogi & Shim)

Integrates tree splitting and tree pruning: stop growing

the tree earlier

RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

Builds an AVC8list (attribute, value, class label)

BOAT (PODS’99 — Gehrke, Ganti, Ramakrishnan & Loh)

Uses bootstrapping to create several small samples

28

slide-29
SLIDE 29

3

Separates the scalability aspects from the criteria that

determine the quality of the tree

Builds an AVC.list (of an attribute )

Data Mining: Concepts and Techniques 29

Projection of training dataset onto the attribute and

class label where counts of individual class label are aggregated

(of a node )

Set of AVC.sets of all predictor attributes at the node

29

slide-30
SLIDE 30

3

Age Buy_Computer yes no <=30 3 2 31..40 4

age income studentcredit_rating uys_compu <=30 high no fair no <=30 high no excellent no 3140 high no fair yes >40 medium no fair yes

AVC8set on income AVC8set on Age

Training Examples

income Buy_Computer yes no high 2 2 medium 4 2

Data Mining: Concepts and Techniques 30

student Buy_Computer yes no yes 6 1 no 3 4 >40 3 2 Credit rating Buy_Computer yes no fair 6 2 excellent 3 3

>40 low yes fair yes >40 low yes excellent no 3140 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 3140 medium no excellent yes 3140 high yes fair yes >40 medium no excellent no

AVC8set on Student

low 3 1

AVC8set on credit_rating

30

slide-31
SLIDE 31

76"76(

$

Use a statistical technique called to create

several smaller samples (subsets), each fits in memory

Each subset is used to create a tree, resulting in several

trees

Data Mining: Concepts and Techniques 31

These trees are examined and used to construct a new

tree

It turns out that is very close to the tree that would

be generated using the whole data set together

Adv: requires only two scans of DB, an incremental alg.

31

slide-32
SLIDE 32

*%

Relatively faster learning speed (than other classification

methods)

Convertible to simple and easy to understand classification

rules

Comparable classification accuracy with other methods

Data Mining: Concepts and Techniques 32 32

slide-33
SLIDE 33
  • Overview
  • Classification algorithms and methods

Decision tree induction Bayesian classification Data Mining: Concepts and Techniques 33 kNN classification Support Vector Machines (SVM) Others

  • Evaluation and measures
  • Ensemble methods

33

slide-34
SLIDE 34

7-

A statistical classifier: performs probabilistic prediction,

i.e., predicts class membership probabilities

Foundation: Based on Bayes’ Theorem. Naïve Bayesian Independence assumption

Bayesian network

Data Mining: Concepts and Techniques 34

Bayesian network Concept Using Bayesian network Training/learning Bayesian network

34

slide-35
SLIDE 35

7-8(

Bayes' theorem/rule/law relates the conditional and

marginal probabilities of stochastic events

P(H) is the prior probability of H. P(H|X) is the conditional probability (posteriori probability) of H

given X.

P(X|H) is the conditional probability of X given H. P(X|H) is the conditional probability of X given H. P(X) is the prior probability of X

  • Cookie example:

Bowl A: 10 chocolate + 30 plain; Bowl B: 20 chocolate + 20 plain Pick a bowl, and then pick a cookie If it’s a plain cookie, what’s the probability the cookie is picked out

  • f bowl A?

Data Mining: Concepts and Techniques 35

  • =

35

slide-36
SLIDE 36

9:7-

Naïve Bayesian / idiot Bayesian / simple Bayesian Let D be a training set of tuples and their associated class

labels, and each tuple is represented by an n8D attribute vector ; = (x1, x2, …, xn)

Suppose there are m classes C1, C2, …, Cm.

Data Mining: Concepts and Techniques 36

1 2 m

Classification is to derive the maximum posteriori, i.e., the

maximal P(Ci|;)

Since P(X) is constant for all classes, maximal

  • =
  • 36
slide-37
SLIDE 37

*9:7-

A simplified assumption: attributes are conditionally

independent (i.e., no dependence relation between attributes):

If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having

  • (((
  • ×

× × = ∏ = =

  • Data Mining: Concepts and Techniques

37

If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having

value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)

If Ak is continous8valued, P(xk|Ci) is usually computed

based on Gaussian distribution with a mean Z and standard deviation σ and P(xk|Ci) is

  • "

"

  • σ
  • σ

π σ

=

  • "

"

  • σ
  • =
  • 37
slide-38
SLIDE 38

9:7-%

Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’

age income student credit_rating uys_compu <=30 high no fair no <=30 high no excellent no 3140 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no

Data Mining: Concepts and Techniques 38

Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)

>40 low yes excellent no 3140 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 3140 medium no excellent yes 3140 high yes fair yes >40 medium no excellent no

38

slide-39
SLIDE 39

9:7-%

  • P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643

P(buys_computer = “no”) = 5/14= 0.357

  • Compute P(X|Ci) for each class

P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667

Data Mining: Concepts and Techniques 39

P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

  • ;<"=<1>?<?<-?.<$

";@$% P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 ";@$A"$%P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 (?;,"+,-.<-/$

39

slide-40
SLIDE 40

9:7-%

Advantages

Fast to train and use Can be highly effective in most of the cases

Disadvantages

Based on a false assumption: class conditional

Data Mining: Concepts and Techniques 40

Based on a false assumption: class conditional

independence 8 practically, dependencies exist among variables

Idiot’s Bayesian, not so stupid after all? David J. Hand,

Keming Yu, International Statistical Review, 2001

How to deal with dependencies? Bayesian Belief Networks 40

slide-41
SLIDE 41

7-79B

  • ./01
  • 2/03
  • 4
  • 56
  • 56

7

41

slide-42
SLIDE 42

7-79B

  • Bayesian belief networks (belief

networks, Bayesian networks, probabilistic networks) is a graphical model that represents a set of variables and their probabilistic independencies

  • Data Mining: Concepts and Techniques

42

  • One of the most significant contribution

in AI

  • Trained Bayesian networks can be used

for classification and reasoning

  • Many applications: spam filtering, speech

recognition, diagnostic systems

  • !

"#$ %#

42

slide-43
SLIDE 43

7-9B%*

893/0

  • 8

(:/

  • "$

false 0.6 true 0.4

  • 7

"7@$ false false 0.01 false true 0.99 true false 0.7 true true 0.3 7

  • "@7$

false false 0.4 false true 0.6 true false 0.9 true true 0.1 7 * "*@7$ false false 0.02 false true 0.98 true false 0.05 true true 0.95

(/111/

43

slide-44
SLIDE 44

*-0(

  • ;/

1 / ! 9 ! ( /" 8

  • <"9

! !

44

slide-45
SLIDE 45

,,-,

; /11 12 2 = /

  • "$

false 0.6 true 0.4

  • 7

"7@$ false false 0.01 false true 0.99 true false 0.7 true true 0.3 7

  • "@7$

false false 0.4 false false 0.4 false true 0.6 true false 0.9 true true 0.1 7 * "*@7$ false false 0.02 false true 0.98 true false 0.05 true true 0.95

  • 8
  • add up to 1

For a Boolean variable with k Boolean parents, how many probabilities need to be stored?

45

slide-46
SLIDE 46

7-9B%

1.

Encodes the conditional independence relationships between the variables in the graph structure

2.

Is a compact representation of the joint probability distribution over the variables probability distribution over the variables

46

slide-47
SLIDE 47
  • The Markov condition: given its parents (P1, P2),

a node (X) is conditionally independent of its non8 descendants (ND1, ND2) 2 2 5 2 2

  • >

>

47

slide-48
SLIDE 48

C,,-*,

Due to the Markov condition, we can compute the joint probability distribution over all the variables X1, ?, Xn in the Bayesian net using the formula:

= = = =

  • "((("

=

= = = =

  • "((("
  • Example:

P(A = true, B = true, C = true, D = true) = P(A = true) * P(B = true | A = true) * P(C = true | B = true) P( D = true | B = true) = (0.4)*(0.3)*(0.1)*(0.95)

  • 8
  • 48
slide-49
SLIDE 49

7-9B%

  • "#$
  • &

'()* '()&* '&()* '&()&*

+ ,

  • .
  • /

The ,,-, () for variable LungCancer:

49

  • !

"#$ %#

& ,

  • /

7-79B

Using the Bayesian Network: P(LungCancer | Smoker, PXRay, Dyspnea)?

49

slide-50
SLIDE 50

'7-9B

Using a Bayesian network to compute probabilities is

called inference

General form: P( X | E )

;@1

Exact inference is feasible in small to medium8sized

networks

Exact inference in large networks takes a very long time

Approximate inference techniques which are much

faster and give pretty good results 5@=1

50

slide-51
SLIDE 51

P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)

Joint probability:

  • where

P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R)

Joint probability: P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R) Suppose the grass is wet, which is more likely?

51

slide-52
SLIDE 52

7-9B

Several scenarios:

Given both the network structure and all variables

  • bservable: learn only the CPTs

Network structure known, some hidden variables:

gradient descent (greedy hill8climbing) method, analogous to neural network learning

February 12, 2008 Data Mining: Concepts and Techniques 52

analogous to neural network learning

Network structure unknown, all variables observable:

search through the model space to reconstruct network topology

Unknown structure, all hidden variables: No good

algorithms known for this purpose

  • Ref. D. Heckerman: Bayesian networks for data mining

52

slide-53
SLIDE 53

30(

Bayesian networks (directed graphical model) Markov networks (undirected graphical model)

Conditional random field

Applications:

Sequential data

Natural language text Protein sequences 53

slide-54
SLIDE 54
  • Overview
  • Classification algorithms and methods

Decision tree induction Bayesian classification Data Mining: Concepts and Techniques 54 kNN classification Support Vector Machines (SVM) Neural Networks

  • Regression
  • Evaluation and measures
  • Ensemble methods

54