transforming a continuous attribute into a discrete
play

Transforming a continuous attribute into a discrete (ordinal) - PowerPoint PPT Presentation

Transforming a continuous attribute into a discrete (ordinal) attribute Ricco RAKOTOMALALA Universit Lumire Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ Outline 1. What is the discretization


  1. Transforming a continuous attribute into a discrete (ordinal) attribute Ricco RAKOTOMALALA Université Lumière Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  2. Outline 1. What is the discretization process? Why discretize? 2. Unsupervised approaches 3. Supervised approaches 4. Conclusion 5. References Ricco Rakotomalala 2 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  3. Why and how to discretize a continuous attribute? Ricco Rakotomalala 3 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  4. Converting a continuous attribute into a discrete one (with a small set of values) X Classes X is a quantitative (continuous) variable 14 C1 It is converted into an ordinal (discrete) variable 15 C1 17 C1 19 C1 21 C1 2 steps: 23 C1 Define (calculate) the cut points 25 C1 26 C1 Encode the values according to the corresponding interval 27 C1 28 C1 29 C1 31 C1 C1 C2 C3 38 C2 39 C2 41 C2 44 C2 45 C2 49 C3 10 20 30 40 50 60 51 C3 53 C3 (1) How to choose the number of intervals K Main issues (2) How to define the cut points … which are relevant according to the studied problem…. Ricco Rakotomalala 4 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  5. Why do we need to discretize? Some statistical method or machine E.g. Rule induction methods (predictive rules, association learning algorithms can handle rules,…) discrete attributes only. E.g. Discretize the continuous Harmonization of the type of variables in attributes and perform a multiple heterogeneous data sets. correspondence analysis on the (NB: We can also harmonize the data set by turning whole database (including the categorical variables in a numerical ones e.g. dummy coding) original nominal variables) Transforming the characteristics of the data set E.g. To correct skewed distribution, to reduce the in order to make the subsequent statistical influence of outliers, to handle algorithm more efficient. nonlinearities. Ricco Rakotomalala 5 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  6. The reference method - Discretization based on the domain knowledge A domain expert may adapt the discretization to the context and the goal of the study. He takes into account other information than those only provided by the available dataset. Age Majorité 14 Mineur 15 Mineur 17 Mineur X = Age 19 Majeur 21 Majeur 23 Majeur The majority age is a possible cut 25 Majeur 26 Majeur point. 27 Majeur 28 Majeur Adult Minor 29 Majeur X < 18 : minor 31 Majeur X  18 : adult 38 Majeur 39 Majeur 41 Majeur 44 Majeur 45 Majeur 49 Majeur 51 Majeur 53 Majeur  But this solution is not obvious on the most of cases, because the expertise is rare.  We need statistical approach based on the dataset characteristics. Ricco Rakotomalala 6 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  7. Methods that use only the characteristics of the variable to discretize Ricco Rakotomalala 7 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  8. A few statistical indicators calculated from data X n 20 14 15 Min 14 17 Max 53 19 21 1st quartile 22.5 23 Median 28.5 25 3rd quartile 41.75 26 27 Mean 31.75 28 Standard deviation 12.07 29 31 38 39 10 20 30 40 50 60 41 44 45 49 51 53 How to use such information as guides to obtain a relevant discretization? Ricco Rakotomalala 8 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  9. Ricco Rakotomalala 9 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  10. Equal width intervals method (1) max  min  K : the number of interval is a parameter a K Divide the dataset into K ranges of equal size   b min a 1 We calculate from the data the width of the intervals (a)      b b a min 2 a 2 1 We derive the (K-1) cut points (b 1 , b 2 , etc.)  X Classes 14 C1 15 C1 K 3 17 C1 19 C1 max 53 a = 13 a = 13 a = 13 21 C1 min 14 23 C1 25 C1 a 13 C1 C2 C3 26 C1 27 C2 b1 27.0 28 C2 b2 40.0 29 C2 31 C2 10 20 30 40 50 60 38 C2 39 C2 41 C3 44 C3 45 C3 C1 : x < 27 Strict or not inequalities 49 C3 C2 : 27  x < 40 51 C3 are defined arbitrarily. C3 : x  40 53 C3 Ricco Rakotomalala 10 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  11. Equal width intervals method (2) Simplicity and quickness The shape of the distribution is not modified density.default(x = x) 600 0.12 500 400 0.08 Density 300 0.04 200 100 0.00 0 0 5 10 15 20 [0.226,5.94) [5.94,11.6) [11.6,17.3) N = 1000 Bandwidth = 0.6965 The choice of K (the number of intervals) is not obvious Sensitivity to the outliers Possibility of obtaining intervals with a very few instances or even empty The cut points do not take into account the proximities of the values Ricco Rakotomalala 11 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  12. Equal frequency intervals (1) K : the number of interval is a parameter Ex. quantile of order 0.25 = The same (roughly) number of instances in each interval 1 st quartile ; quantile of order 0.5 = median ; etc. We calculate the quantiles from the data (q 1 , q 2 ) Quantiles = cut points (q 1 , q 2 , etc.) X Classes 14 C1 15 C1 17 C1 q(0.33) 25.33 19 C1 q(0.66) 38.67  n/3  n/3  n/3 21 C1 23 C1 C1 C2 C3 25 C1 26 C2 27 C2 28 C2 29 C2 10 20 30 40 50 60 31 C2 38 C2 39 C3 41 C3 44 C3 45 C3 C1 : x < 25.33 Strict or not inequalities 49 C3 C2 : 25.33  x < 38.67 are defined arbitrarily. 51 C3 C3 : x  38.67 53 C3 Ricco Rakotomalala 12 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  13. Equal frequency intervals (2) Simplicity and quickness (we need to sort the instances) Not sensitive to the outliers The number of instances in each interval is defined a priori Balancing the shape of the distribution density.default(x = x) 300 0.12 250 0.08 Density 200 150 0.04 100 50 0.00 0 0 5 10 15 20 [0.243,3.08) [3.08,5.74) [5.74,17.3] N = 1000 Bandwidth = 0.6965 The choice of K (the number of intervals) is not obvious The cut points do not take into account the proximities of the values Ricco Rakotomalala 13 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  14. Some formulas to determine the "right" number of intervals K (http://www.info.univ-angers.fr/~gh/wstat/discr.php) K Approach Formula (without rounding) Brooks-Carruthers 5 x log 10 (n) 6.51 Huntsberger 1 + 3.332 x log 10 (n) 5.34 Sturges log2(n + 1) 4.39 (max – min)/(3.5 x  x n -1/3 ) Scott 2.50 (max – min)/(2 x IQ x n -1/3 ) Freedman-Diaconis 2.75  : standard deviation IQ : interquartile range The two last formulas use more information from the data (range, dispersion). 10 20 30 40 50 60 Ricco Rakotomalala 14 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  15. Other unsupervised approaches Nested means . Top-down algorithm. The first cut point is the mean computed from the whole sample. Then, we subdivide the sample with this cut point. We calculate the local average in each subsample. We subdivide again. Etc. The number of intervals is a power of 2. Large relative difference . We sort the data in increasing order. We identify the large difference between 2 successive values. We define a cut point if the gap is upper than a threshold expressed in % of the standard deviation of the values (or in percentage of MAD - median absolute deviation - if you want to avoid the outliers problem). Difference to the average . K is a parameter. If K is even, on both side of the average, the first intervals are a range  [or m x  , m is a parameter], etc. until we have K intervals [last intervals have a different width]. If K is odd, the first interval is around the average. Ricco Rakotomalala 15 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  16. Ricco Rakotomalala 16 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  17. Take into account the proximity between the values Data can be organized into "clusters". We are interested in the characteristics of x x x 1 2 3 dispersion of data. 10 20 30 40 50 60   T B W ANOVA equation n n K K          k      2 2 2 x x n x x x x (analysis of variance) i k k ik k     i 1 k 1 k 1 i 1  is the Goal: Maximizing the differences B   2 correlation ratio between the conditional means W An optimal approach exists (Fisher algorithm), but…  The number of intervals K remains a parameter  The algorithm requires quadratic time O(n 2 ).  It is not available into data mining tools. Ricco Rakotomalala 17 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

  18. Top down strategy (1) (1) Find the best binary separation Continue recursively in each subgroup Until stopping rules are met (2) Etc. What stopping rules? • A new partition does not imply a significant difference between conditional means (  significance level, pay attention about the multiple comparisons problem) • Minimum number of instances before or after a splitting process • Maximum number of intervals Ricco Rakotomalala 18 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend