supervised learning classifica4on
play

Supervised Learning: Classifica4on Sept. 24, 2018 Classification: - PowerPoint PPT Presentation

Supervised Learning: Classifica4on Sept. 24, 2018 Classification: Basic concepts Classifica4on: Basic Concepts Decision Tree Induc4on Bayes Classifica4on Methods Model Evalua4on and Selec4on Techniques to Improve Classifica4on


  1. Supervised Learning: Classifica4on Sept. 24, 2018

  2. Classification: Basic concepts • Classifica4on: Basic Concepts • Decision Tree Induc4on • Bayes Classifica4on Methods • Model Evalua4on and Selec4on • Techniques to Improve Classifica4on Accuracy: Ensemble Methods • Summary

  3. Supervised vs. Unsupervised Learning • Supervised learning (classifica4on) – Supervision: The training data (observa4ons, measurements, etc.) are accompanied by labels indica4ng the class of the observa4ons – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data is unknown – Given a set of measurements, observa4ons, etc. with the aim of establishing the existence of classes or clusters in the data

  4. Prediction Problems: Classification vs. Numeric Prediction Classifica4on • • predicts categorical class labels (discrete or nominal) • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying aRribute and uses it in classifying new data Numeric Predic4on • • models con4nuous-valued func4ons, i.e., predicts unknown or missing values Typical applica4ons • • Credit/loan approval: • Medical diagnosis: if a tumor is cancerous or benign • Fraud detec4on: if a transac4on is fraudulent • Web page categoriza4on: which category it is

  5. Classification—A Two-Step Process Model construc4on: describing a set of predetermined classes • • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label aRribute • The set of tuples used for model construc4on is training set • The model is represented as classifica4on rules, decision trees, or mathema4cal formulae Model usage: for classifying future or unknown objects • • Es4mate accuracy of the model • The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified • by the model Test set is independent of training set (otherwise over-fiZng) • If the accuracy is acceptable, use the model to classify new data • Note: If the test set is used to select models, it is called valida4on (test) set •

  6. Process (1): Model Construction Classification Algorithms Training Data Classifier NAME RANK YEARS TENURED (Model) Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes IF rank = ‘ professor ’ Dave Assistant Prof 6 no OR years > 6 Anne Associate Prof 3 no THEN tenured = ‘ yes ’ 6

  7. Process (2): Using the Model in Prediction Classifier Testing Unseen Data Data (Jeff, Professor, 4) NAME RANK YEARS TENURED Tenured? Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes 7

  8. Chapter 8. Classification: Basic Concepts • Classifica4on: Basic Concepts • Decision Tree Induc4on • Bayes Classifica4on Methods • Model Evalua4on and Selec4on • Techniques to Improve Classifica4on Accuracy: Ensemble Methods • Summary 8

  9. Decision Tree Induction: An Example age income student credit_rating buys_computer • Training data set: Buys_computer <=30 high no fair no • The data set follows an example <=30 high no excellent no 31 … 40 high no fair yes of Quinlan ’ s ID3 (Playing Tennis) >40 medium no fair yes >40 low yes fair yes • Resul4ng tree: >40 low yes excellent no 31 … 40 low yes excellent yes <=30 medium no fair no age? <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 … 40 medium no excellent yes <=30 31 … 40 high yes fair yes overcast 31..40 >40 >40 medium no excellent no student? yes credit rating? excellent fair no yes no yes no yes 9

  10. Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – ARributes are categorical (if con4nuous-valued, they are discre4zed in advance) – Examples are par44oned recursively based on selected aRributes – Test aRributes are selected on the basis of a heuris4c or sta4s4cal measure (e.g., informa4on gain) • Condi4ons for stopping par44oning – All samples for a given node belong to the same class – There are no remaining aRributes for further par44oning – majority vo4ng is employed for classifying the leaf – There are no samples le` 10

  11. Brief Review of Entropy m = 2

  12. Attribute Selection Measure: Information Gain (ID3/C4.5) Select the aRribute with the highest informa4on gain • Let p i be the probability that an arbitrary tuple in D belongs to • class C i , es4mated by |C i , D |/|D| Expected informa4on (entropy) needed to classify a tuple in D: • m Info ( D ) p log ( p ) ∑ = − i 2 i i 1 = Informa4on needed (a`er using A to split D into v par44ons) to • classify D: | D | v = ∑ j Info ( D ) Info ( D ) × A j | D | j 1 = Informa4on gained by branching on aRribute A • Gain(A) Info(D) Info (D) = − A

  13. Attribute Selection: Information Gain Class P: buys_computer = “ yes ” 5 4 • Info age ( D ) I ( 2 , 3 ) I ( 4 , 0 ) = + 14 14 Class N: buys_computer = “ no ” • 5 I ( 3 , 2 ) 0 . 694 + = 9 9 5 5 14 Info ( D ) = I ( 9 , 5 ) log ( ) log ( ) 0 . 940 = − − = 2 2 14 14 14 14 5 I age p i n i I(p i , n i ) means “ age <=30 ” has 5 out of ( 2 , 3 ) 14 14 samples, with 2 yes ’ es and 3 <=30 2 3 0.971 no ’ s. Hence, 31 … 40 4 0 0 >40 3 2 0.971 Gain ( age ) Info ( D ) Info ( D ) 0 . 246 = − = age Similarly, age income student credit_rating buys_computer <=30 high no fair no Gain ( income ) 0 . 029 = <=30 high no excellent no Gain ( student ) 0 . 151 31 … 40 high no fair yes = >40 medium no fair yes Gain ( credit _ rating ) 0 . 048 = >40 low yes fair yes >40 low yes excellent no 31 … 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 … 40 medium no excellent yes 31 … 40 high yes fair yes 13 >40 medium no excellent no

  14. Computing Information-Gain for Continuous- Valued Attributes • Let aRribute A be a con4nuous-valued aRribute • Must determine the best split point for A – Sort the value A in increasing order – Typically, the midpoint between each pair of adjacent values is considered as a possible split point • (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 – The point with the minimum expected informa4on requirement for A is selected as the split-point for A • Split: – D1 is the set of tuples in D sa4sfying A ≤ split-point, and D2 is the set of tuples in D sa4sfying A > split-point

  15. Gain Ratio for Attribute Selection (C4.5) • Informa4on gain measure is biased towards aRributes with a large number of values • C4.5 (a successor of ID3) uses gain ra4o to overcome the problem (normaliza4on to informa4on gain) v ⎛ ⎞ ⎛ ⎞ | D j | | D j | ∑ SplitInfo A ( D ) = − × log 2 ⎜ ⎟ ⎜ ⎟ | D | | D | ⎝ ⎠ ⎝ ⎠ j = 1 – GainRa4o(A) = Gain(A)/SplitInfo(A) • Ex. – gain_ra4o(income) = 0.029/1.557 = 0.019 • The aRribute with the maximum gain ra4o is selected as the spliZng aRribute

  16. Gini Index (CART, IBM IntelligentMiner) • If a data set D contains examples from n classes, gini index, gini ( D ) is defined as n 2 gini ( D ) = 1 − p ∑ j j = 1 where p j is the rela4ve frequency of class j in D • If a data set D is split on A into two subsets D 1 and D 2 , the gini index gini ( D ) is defined as gini ( D ) = | D 1 | ) + | D 2 | | D | gini | D | gini ( D ) ( D 1 2 A • Reduc4on in Impurity: gini ( A ) gini ( D ) gini ( D ) Δ = − A • The aRribute provides the smallest gini split ( D ) (or the largest reduc4on in impurity) is chosen to split the node ( need to enumerate all the possible spli;ng points for each a<ribute ) 16

  17. Computation of Gini Index • Ex. D has 9 tuples in buys_computer = “ yes ” and 5 in “ no ” 2 2 9 5 ⎛ ⎞ ⎛ ⎞ gini ( D ) 1 0 . 459 = − − = ⎜ ⎟ ⎜ ⎟ 14 14 ⎝ ⎠ ⎝ ⎠ • Suppose the aRribute income par44ons D into 10 in D 1 : {low, medium} and 4 in D 2 10 4 ⎛ ⎞ ⎛ ⎞ gini ( D ) Gini ( D ) Gini ( D ) = + ⎜ ⎟ ⎜ ⎟ income { low , medium } 1 2 ∈ 14 14 ⎝ ⎠ ⎝ ⎠ Gini {low,high} is 0.458; Gini {medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index • All aRributes are assumed con4nuous-valued • May need other tools, e.g., clustering, to get the possible split values • Can be modified for categorical aRributes 17

  18. Comparing Attribute Selection Measures • The three measures, in general, return good results but – Informa4on gain : • biased towards mul4valued aRributes – Gain ra4o : • tends to prefer unbalanced splits in which one par44on is much smaller than the others – Gini index : • biased to mul4valued aRributes • has difficulty when # of classes is large • tends to favor tests that result in equal-sized par44ons and purity in both par44ons

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend