MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017

Methods to Learn Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN SVM; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 2

Vector Data: Trees • Tree-based Prediction and Classification • Classification Trees • Regression Trees* • Random Forest • Summary 3

Tree-based Models • Use trees to partition the data into different regions and make predictions Root node age? <=30 overcast 31..40 >40 Internal nodes student? yes credit rating? excellent fair no yes yes no yes Leaf nodes 4

Easy to Interpret • A path from root to a leaf node corresponds to a rule • E.g., if if age<=30 and student=no th then en target value=no age? <=30 31..40 >40 student? credit rating? yes excellent fair no yes yes no yes 5

Vector Data: Trees • Tree-based Prediction and Classification • Classification Trees • Regression Trees* • Random Forest • Summary 6

Decision Tree Induction: An Example age income student credit_rating buys_Xbox  Training data set: Buys_xbox <=30 high no fair no <=30 high no excellent no  The data set follows an example of 31…40 high no fair yes >40 medium no fair yes Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes  Resulting tree: >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no age? <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 <=30 high yes fair yes overcast 31..40 >40 >40 medium no excellent no student? yes credit rating? excellent fair no yes yes no yes 7

How to choose attributes? Ages Credit_Rating 31…40 >40 <=30 Excellent Fair Yes Yes Yes Yes Yes VS. Yes Yes Yes Yes Yes No Yes Yes Yes Yes No Yes No Yes No No No Yes No Yes No No No Q: Which attribute is better for the classification task? 8

Brief Review of Entropy • Entropy (Information Theory) • A measure of uncertainty (impurity) associated with a random variable • Calculation: For a discrete random variable Y taking m distinct values { 𝑧 1 , … , 𝑧 𝑛 } , 𝑛 𝑞 𝑗 log(𝑞 𝑗 ) , where 𝑞 𝑗 = 𝑄(𝑍 = 𝑧 𝑗 ) • 𝐼 𝑍 = − σ 𝑗=1 • Interpretation: • Higher entropy => higher uncertainty • Lower entropy => lower uncertainty m = 2 9

Conditional Entropy • How much uncertainty of 𝑍 if we know an attribute 𝑌 ? • 𝐼 𝑍 𝑌 = σ 𝑦 𝑞 𝑦 𝐼(𝑍|𝑌 = 𝑦) Ages <=30 31…40 >40 Yes Yes Yes Yes Yes Yes No Yes Yes No Yes No No No Weighted average of entropy at each branch! 10

Attribute Selection Measure: Information Gain (ID3/C4.5)  Select the attribute with the highest information gain  Let p i be the probability that an arbitrary tuple in D belongs to class C i , estimated by |C i , D |/|D|  Expected information (entropy) needed to classify a tuple in D: m    ( ) log ( ) Info D p p 2 i i  i 1  Information needed (after using A to split D into v partitions) to classify D (conditional entropy):   | | D v  j ( ) ( ) Info D Info D A j | | D  1 j  Information gained by branching on attribute A   Gain(A) Info(D) Info (D) A 11

Attribute Selection: Information Gain 5 4  Class P: buys_xbox = “yes”   ( ) ( 2 , 3 ) ( 4 , 0 ) Info age D I I 14 14  Class N: buys_xbox = “no” 5 9 9 5 5    I     ( 3 , 2 ) 0 . 694 ( ) ( 9 , 5 ) log ( ) log ( ) 0 . 940 I Info D 2 2 14 14 14 14 14 age p i n i I(p i , n i ) 5 I ( 2 , 3 ) means “age <=30” has 5 out of <=30 2 3 0.971 14 14 samples, with 2 yes’es and 3 31…40 4 0 0 no’s. Hence >40 3 2 0.971    age income student credit_rating buys_xbox ( ) ( ) ( ) 0 . 246 Gain age Info D Info D <=30 high no fair no age <=30 high no excellent no Similarly, 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no  ( ) 0 . 029 Gain income 31…40 low yes excellent yes <=30 medium no fair no  <=30 low yes fair yes ( ) 0 . 151 Gain student >40 medium yes fair yes <=30 medium yes excellent yes  ( _ ) 0 . 048 31…40 Gain credit rating medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no 12

Attribute Selection for a Branch 2 2 3 3 • age? 𝐽𝑜𝑔𝑝 𝐸 𝑏𝑕𝑓≤30 = − 5 log 2 5 − 5 log 2 5 = 0.971 • • 𝐻𝑏𝑗𝑜 𝑏𝑕𝑓≤30 𝑗𝑜𝑑𝑝𝑛𝑓 = 𝐽𝑜𝑔𝑝 𝐸 𝑏𝑕𝑓≤30 − 𝐽𝑜𝑔𝑝 𝑗𝑜𝑑𝑝𝑛𝑓 𝐸 𝑏𝑕𝑓≤30 = 0.571 • <=30 𝐻𝑏𝑗𝑜 𝑏𝑕𝑓≤30 𝑡𝑢𝑣𝑒𝑓𝑜𝑢 = 0.971 overcast 31..40 >40 • 𝐻𝑏𝑗𝑜 𝑏𝑕𝑓≤30 𝑑𝑠𝑓𝑒𝑗𝑢_𝑠𝑏𝑢𝑗𝑜𝑕 = 0.02 ? yes ? age? Which attribute next? <=30 overcast 31..40 >40 age income student credit_rating buys_xbox <=30 high no fair no <=30 high no excellent no student? yes ? <=30 medium no fair no <=30 low yes fair yes <=30 medium yes excellent yes 𝐸 𝑏𝑕𝑓≤30 no yes no yes 13

Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Examples are partitioned recursively based on selected attributes • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left – use majority voting in the parent partition 14

Computing Information-Gain for Continuous-Valued Attributes • Let attribute A be a continuous-valued attribute • Must determine the best split point for A • Sort the value A in increasing order • Typically, the midpoint between each pair of adjacent values is considered as a possible split point • (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 • The point with the minimum expected information requirement for A is selected as the split-point for A • Split: • D1 is the set of tuples in D satisfying A ≤ split -point, and D2 is the set of tuples in D satisfying A > split-point 15

Gain Ratio for Attribute Selection (C4.5) • Information gain measure is biased towards attributes with a large number of values • C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)   | | | | D D v   j j ( ) log ( ) SplitInfo D 2 A | | | | D D  1 j • GainRatio(A) = Gain(A)/SplitInfo(A) • Ex. • gain_ratio(income) = 0.029/1.557 = 0.019 • The attribute with the maximum gain ratio is selected as the splitting attribute 16

Gini Index (CART, IBM IntelligentMiner) • If a data set D contains examples from n classes, gini index, gini ( D ) is defined as v 2    ( ) 1 gini D p j  1 j where p j is the relative frequency of class j in D • If a data set D is split on A into two subsets D 1 and D 2 , the gini index gini ( D ) is defined as | | | | D D   ( ) 1 ( ) 2 ( ) D gini gini gini A D D 1 2 | | | | D D • Reduction in Impurity:    ( ) ( ) ( ) gini A gini D gini D A • The attribute provides the smallest gini split ( D ) (or the largest reduction in impurity) is chosen to split the node ( need to enumerate all the possible splitting points for each attribute ) 17

Computation of Gini Index • Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” 2 2     9 5         ( ) 1 0 . 459 gini D     14 14 • Suppose the attribute income partitions D into 10 in D 1 : {low, medium} and 4 in D 2 : {high}     10 4       ( ) ( ) ( ) gini D Gini D Gini D  { , } 1 2 income low medium     14 14 Gini {low,high} is 0.458; Gini {medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index 18

Comparing Attribute Selection Measures • The three measures, in general, return good results but • Inf nformat mation ion gai ain: • biased towards multivalued attributes • Gai ain n rat atio io: • tends to prefer unbalanced splits in which one partition is much smaller than the others (why?) • Gin ini in index: • biased to multivalued attributes 19

MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction What is data mining? to Data mining functionalities Data Mining Major

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

NANO MINING POOL CLOUD CONTRACTS AND MINING SERVICES OUR PRODUCTS Cloud cards are mining cards

Ninon Burgos CNRS Researcher Brain and Spine Institute Aramis Lab Paris, France 2 Research

Miguel A. Miguel A. Snchez nchez Conde Conde (Instituto Instituto de de Astr strofsica

Chapter 13: Design Principles Overview Principles Least Privilege Fail-Safe

Automated tissue type imaging with OCT Ton van Leeuwen DJ Faber, DM de Bruin, NM Weiss, DV

Attribute-Based Cryptography Lecture 21 And Pairing-Based Cryptography 1 Identity-Based

The SDP Content Attribute draft-ietf-mmusic-sdp-media-content-02.txt Jani.Hautakorpi@ericsson.com

Query Processing Lecture # 10 Database Systems Andy Pavlo AP AP Computer Science

Attribute-Based Signatures [Maji et al. 2008] . Users have attributes (e.g. Departmental