1 - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 - - PDF document


slide-1
SLIDE 1

1

  • Data Mining Lecture 4: Classification 2

2

  • Data Mining Lecture 4: Classification 2

3

  • !"

# $!!

%! ! &!&!! '&! &!$! !!&! !()*+',+-./+

Data Mining Lecture 4: Classification 2 4

  • 0$(

# 123+4+5 &! 163+4+!7 # ! 23+8+4+!5 # 123+4.+5 &!!! # %!&!+ # %!&! &!! # %!&!+9

Data Mining Lecture 4: Classification 2 5

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

1 0

categorical categorical continuous class

Data Mining Lecture 4: Classification 2 6

!

Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

1 0

categorical categorical continuous class MarSt Refund TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

There could be more than one tree that fits the same data!

slide-2
SLIDE 2

2

Data Mining Lecture 4: Classification 2 7

"

Apply Model Induction Deduction Learn Model Model

Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes

1

Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ?

1

Test Set Tree Induction algorithm Training Set Decision Tree Data Mining Lecture 4: Classification 2 8

! #$%

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ?

1

Test Data Start from the root of tree.

Data Mining Lecture 4: Classification 2 9

! #$%

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ?

1

Test Data

Data Mining Lecture 4: Classification 2 10

! #$%

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ?

1

Test Data

Data Mining Lecture 4: Classification 2 11

! #$%

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ?

1

Test Data

Data Mining Lecture 4: Classification 2 12

! #$%

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ?

1

Test Data

slide-3
SLIDE 3

3

Data Mining Lecture 4: Classification 2 13

! #$%

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Refund Marital Status Taxable Income Cheat No Married 80K ?

1

Assign Cheat to “No” Test Data

Data Mining Lecture 4: Classification 2 14

"

Apply Model Induction Deduction Learn Model Model

Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes

1

Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ?

1

Test Set Tree Induction algorithm Training Set Decision Tree Data Mining Lecture 4: Classification 2 15

&'()!

  • : !

!!

  • 0(

# ' ! !+ ! # ' + ! !+ # ' ! !+! ! . $ ! !.

Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

1 0

Dt

?

Data Mining Lecture 4: Classification 2 16

()!

Don’t Cheat

Refund

Don’t Cheat Don’t Cheat Yes No

Refund

Don’t Cheat Yes No

Marital Status

Don’t Cheat

Cheat

Single, Divorced Married

Taxable Income

Don’t Cheat < 80K >= 80K

Refund

Don’t Cheat Yes No

Marital Status

Don’t Cheat

Cheat

Single, Divorced Married Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

1 0

Data Mining Lecture 4: Classification 2 17

*%

Data Mining Lecture 4: Classification 2 18

*%

# ; !! <.

'

# !& !

)& != )&! =

# &!

slide-4
SLIDE 4

4

Data Mining Lecture 4: Classification 2 19

' !

Gender Height M F 1.0 2.5

Data Mining Lecture 4: Classification 2 20

(' #%+

# # > #

&

# 8& # ?&

Data Mining Lecture 4: Classification 2 21

' ,%-!. ?& ( @ $. A ( $$&. .

CarType

Family Sports Luxury

CarType

{Family, Luxury} {Sports}

CarType

{Sports, Luxury} {Family}

OR

Data Mining Lecture 4: Classification 2 22

' ,%%!. ?& ( @ $. A ( $$&. . B!! =

Size

Small Medium Large

Size

{Medium, Large} {Small}

Size

{Small, Medium} {Large}

OR

Size

{Small, Large} {Medium}

Data Mining Lecture 4: Classification 2 23

&!

# <

  • ;# < !

# "$ C+""C +.

# A(6$≥ $

! $

' ,%!.

Data Mining Lecture 4: Classification 2 24

' ,%!.

Taxable Income > 80K?

Yes No

Taxable Income? (i) Binary split (ii) Multi-way split

< 10K [10K,25K) [25K,50K) [50K,80K) > 80K

slide-5
SLIDE 5

5

Data Mining Lecture 4: Classification 2 25

Balanced Deep

Data Mining Lecture 4: Classification 2 26

*%* !; >; ; ; ; <

Data Mining Lecture 4: Classification 2 27

(%,'

Own Car? C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 Car Type? C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Student ID?

...

Yes No Family Sports Luxury c1 c10 c20 C0: 0 C1: 1

...

c11

Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best?

Data Mining Lecture 4: Classification 2 28

(%,' 0 !(

# &!!

  • (

C0: 5 C1: 5 C0: 9 C1: 1

Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

Data Mining Lecture 4: Classification 2 29

(/%,'

B?

Yes No Node N3 Node N4

A?

Yes No Node N1 Node N2 Before Splitting:

  • M0

M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34

Data Mining Lecture 4: Classification 2 30

$* #0&*-*

& *% $(

>%(p( j | t) !$"9.

# ?D3 3E&!" + # ?F.F&!+

− =

j

t j p t GINI

2

)] | ( [ 1 ) (

slide-6
SLIDE 6

6

Data Mining Lecture 4: Classification 2 31

&*-*

  • P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

− =

j

t j p t GINI

2

)] | ( [ 1 ) (

P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Data Mining Lecture 4: Classification 2 32

' ,%&*-* @ B! C !+!" ( &!+ 1!+ 1 .

=

=

k i i split

i GINI n n GINI

1

) (

Data Mining Lecture 4: Classification 2 33

; & %B! (

# :!. B?

Yes No Node N1 Node N2

  • Gini(N1)

= 1 – (5/6)2 – (2/6)2 = 0.194 Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528 Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333

,#!.0 &*-**%

Data Mining Lecture 4: Classification 2 34

G!$+!! ! @!DC

CarType {Sports, Luxury} {Family} C1

  • C2
  • CarType

{Sports} {Family, Luxury} C1

  • C2
  • CarType

Family Sports Luxury C1

  • C2
  • Multi-way split

Two-way split (find best partition of values)

!.0 &*-**%

Data Mining Lecture 4: Classification 2 35

@A $ ;$!! $

# $ 1$

%! $! D&!

# !! +6$≥ $

; !!$

# G!$+! !D 0 D # 'H &C.

Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

1

Taxable Income > 80K?

Yes No

!.0 &*-**%

Data Mining Lecture 4: Classification 2 36

  • G (!+

# ;!$ # :!$+! !D D # !! !!! D

Cheat No No No Yes Yes Yes No No No No Taxable Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 3 3 3 3 1 2 2 1 3 3 3 3 3 No 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Split Positions Sorted Values

!.0 &*-**%

slide-7
SLIDE 7

7

Data Mining Lecture 4: Classification 2 37

*

*%.%*#

Data Mining Lecture 4: Classification 2 38

*% B!!!&D + $. B!!!& !!!& !+$. *%1

Data Mining Lecture 4: Classification 2 39

*2 #

0$ 3+ 8+..+ &!3+ # ( % ! .

# # 1F

Data Mining Lecture 4: Classification 2 40

#

log (1/p) H(p,1-p)

Data Mining Lecture 4: Classification 2 41

*3 ! ! D . ',! ! $!!!(

Data Mining Lecture 4: Classification 2 42

(

Nam e Gender Height Output1 Output2 Kristina F 1.60 Short Medium Jim M 2.02 Tall Medium Maggie F 1.90 Medium Tall Martha F 1.88 Medium Tall Stephanie F 1.71 Short Medium Bob M 1.85 Medium Medium Kathy F 1.60 Short Medium Dave M 1.72 Short Medium W orth M 2.12 Tall Tall Steven M 2.10 Tall Tall Debbie F 1.78 Medium Medium Todd M 1.95 Medium Medium Kim F 1.89 Medium Tall Amy F 1.81 Medium Medium W ynette F 1.75 Medium Medium

slide-8
SLIDE 8

8

Data Mining Lecture 4: Classification 2 43

*3 4

  • ; (
  • E3/3/E-IJE3/3/EJI,E3/3/E,1F.-,J-
  • 0(

# G(,EKKE,ILEKKEL1F.8ML- # ?(3ELLE3I8ELLE8I,ELLE,1F.-,K8 # B!(KE3/F.8ML-ILE3/F.-,K81F.,-3/8 # 0(F.-,J-# F.,-3/81F.FKLJJ

  • 0!!(

F.-,J-# 8E3/F.,F31F.,KJ,

  • !!!

Data Mining Lecture 4: Classification 2 44

567! ',$&!

$$$

' $$',(

# ? # # # # 0(

C!!

Data Mining Lecture 4: Classification 2 45

!80%8

A @ !! G! ++( :+ ! !& !!!.

Data Mining Lecture 4: Classification 2 46

!8 !+!D! !!"(

# Φ018LE3/KE3/8E3/I-E3/I,E3/1F.88- # Φ3.L1F # Φ3.M188E3/3,E3/FIJE3/I,E3/1F.3LK # Φ3.J18/E3/3FE3/-E3/ILE3/I,E3/1F.,J/ # Φ3.K18KE3/LE3/-E3/I8E3/I,E3/1F.8/L # Φ8.F1838E3/,E3/-E3/IJE3/I,E3/1F.,8

; 3.J

Data Mining Lecture 4: Classification 2 47

,% $(

# 'D $ # %DC& # % < # ! !"

Data Mining Lecture 4: Classification 2 48

,%#

y < 0.33? : 0 : 3 : 4 : 0 y < 0.47? : 4 : 0 : 0 : 4 x < 0.43? Yes Yes No No Yes No

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x y

  • Border line between two neighboring regions of different classes is

known as decision boundary

  • Decision boundary is parallel to axes because test condition involves

a single attribute at-a-time

slide-9
SLIDE 9

9

Data Mining Lecture 4: Classification 2 49

.

x + y < 1

Class = + Class =

  • Test condition may involve multiple attributes
  • More expressive representation
  • Finding optimal test condition is computationally expensive

Data Mining Lecture 4: Classification 2 50

8

P Q R S 1 1 Q S 1

  • Same subtree appears in multiple branches

Data Mining Lecture 4: Classification 2 51

8 '! (16+7

# +"

?!!" +. !(0+N+3+';?

Data Mining Lecture 4: Classification 2 52

&8

Data Mining Lecture 4: Classification 2 53

&8

Data Mining Lecture 4: Classification 2 54

48!

slide-10
SLIDE 10

10

Data Mining Lecture 4: Classification 2 55

48

Data Mining Lecture 4: Classification 2 56

8*'$!

Data Mining Lecture 4: Classification 2 57

8*'$

Data Mining Lecture 4: Classification 2 58

68

! &!! . C. !$ . >C

  • .

Data Mining Lecture 4: Classification 2 59

$9 ?&(

  • G

1

  • G
  • 1O

1 1O

@: :;; %'%:;;

FN FP TN TP TN TP d c b a d a + + + + = + + + + = Accuracy

Data Mining Lecture 4: Classification 2 60

!# 8

# FD 1KKKF # 3D 13F

' $!F+ KKKFE3FFFF1KK.KP

# 3D

slide-11
SLIDE 11

11

Data Mining Lecture 4: Classification 2 61

!# '%( !

  • (%%0

# C&&

8E, 3E,

# %!

8%.: 0

# !!!CQ # >$C!$ ! .

Data Mining Lecture 4: Classification 2 62

!# ;:%:%0

# C&;+CD$ RS;3+;8+4+;C D" <Q # @!; # !$ $! !

::0

# $&!CT;T.

Data Mining Lecture 4: Classification 2 63

*!# ,0

# !R$SQ # &&.

,0

# !R$SQ # $&! !!*Q # &&.

Data Mining Lecture 4: Classification 2 64

*!#<%+ ' +!! ; ;'E> '