Transforming a continuous attribute into a discrete (ordinal) - PowerPoint PPT Presentation

Transforming a continuous attribute into a discrete (ordinal) attribute Ricco RAKOTOMALALA Université Lumière Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Outline 1. What is the discretization process? Why discretize? 2. Unsupervised approaches 3. Supervised approaches 4. Conclusion 5. References Ricco Rakotomalala 2 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Why and how to discretize a continuous attribute? Ricco Rakotomalala 3 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Converting a continuous attribute into a discrete one (with a small set of values) X Classes X is a quantitative (continuous) variable 14 C1 It is converted into an ordinal (discrete) variable 15 C1 17 C1 19 C1 21 C1 2 steps: 23 C1 Define (calculate) the cut points 25 C1 26 C1 Encode the values according to the corresponding interval 27 C1 28 C1 29 C1 31 C1 C1 C2 C3 38 C2 39 C2 41 C2 44 C2 45 C2 49 C3 10 20 30 40 50 60 51 C3 53 C3 (1) How to choose the number of intervals K Main issues (2) How to define the cut points … which are relevant according to the studied problem…. Ricco Rakotomalala 4 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Why do we need to discretize? Some statistical method or machine E.g. Rule induction methods (predictive rules, association learning algorithms can handle rules,…) discrete attributes only. E.g. Discretize the continuous Harmonization of the type of variables in attributes and perform a multiple heterogeneous data sets. correspondence analysis on the (NB: We can also harmonize the data set by turning whole database (including the categorical variables in a numerical ones e.g. dummy coding) original nominal variables) Transforming the characteristics of the data set E.g. To correct skewed distribution, to reduce the in order to make the subsequent statistical influence of outliers, to handle algorithm more efficient. nonlinearities. Ricco Rakotomalala 5 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

The reference method - Discretization based on the domain knowledge A domain expert may adapt the discretization to the context and the goal of the study. He takes into account other information than those only provided by the available dataset. Age Majorité 14 Mineur 15 Mineur 17 Mineur X = Age 19 Majeur 21 Majeur 23 Majeur The majority age is a possible cut 25 Majeur 26 Majeur point. 27 Majeur 28 Majeur Adult Minor 29 Majeur X < 18 : minor 31 Majeur X  18 : adult 38 Majeur 39 Majeur 41 Majeur 44 Majeur 45 Majeur 49 Majeur 51 Majeur 53 Majeur  But this solution is not obvious on the most of cases, because the expertise is rare.  We need statistical approach based on the dataset characteristics. Ricco Rakotomalala 6 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Methods that use only the characteristics of the variable to discretize Ricco Rakotomalala 7 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

A few statistical indicators calculated from data X n 20 14 15 Min 14 17 Max 53 19 21 1st quartile 22.5 23 Median 28.5 25 3rd quartile 41.75 26 27 Mean 31.75 28 Standard deviation 12.07 29 31 38 39 10 20 30 40 50 60 41 44 45 49 51 53 How to use such information as guides to obtain a relevant discretization? Ricco Rakotomalala 8 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Ricco Rakotomalala 9 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Equal width intervals method (1) max  min  K : the number of interval is a parameter a K Divide the dataset into K ranges of equal size   b min a 1 We calculate from the data the width of the intervals (a)      b b a min 2 a 2 1 We derive the (K-1) cut points (b 1 , b 2 , etc.)  X Classes 14 C1 15 C1 K 3 17 C1 19 C1 max 53 a = 13 a = 13 a = 13 21 C1 min 14 23 C1 25 C1 a 13 C1 C2 C3 26 C1 27 C2 b1 27.0 28 C2 b2 40.0 29 C2 31 C2 10 20 30 40 50 60 38 C2 39 C2 41 C3 44 C3 45 C3 C1 : x < 27 Strict or not inequalities 49 C3 C2 : 27  x < 40 51 C3 are defined arbitrarily. C3 : x  40 53 C3 Ricco Rakotomalala 10 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Equal width intervals method (2) Simplicity and quickness The shape of the distribution is not modified density.default(x = x) 600 0.12 500 400 0.08 Density 300 0.04 200 100 0.00 0 0 5 10 15 20 [0.226,5.94) [5.94,11.6) [11.6,17.3) N = 1000 Bandwidth = 0.6965 The choice of K (the number of intervals) is not obvious Sensitivity to the outliers Possibility of obtaining intervals with a very few instances or even empty The cut points do not take into account the proximities of the values Ricco Rakotomalala 11 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Equal frequency intervals (1) K : the number of interval is a parameter Ex. quantile of order 0.25 = The same (roughly) number of instances in each interval 1 st quartile ; quantile of order 0.5 = median ; etc. We calculate the quantiles from the data (q 1 , q 2 ) Quantiles = cut points (q 1 , q 2 , etc.) X Classes 14 C1 15 C1 17 C1 q(0.33) 25.33 19 C1 q(0.66) 38.67  n/3  n/3  n/3 21 C1 23 C1 C1 C2 C3 25 C1 26 C2 27 C2 28 C2 29 C2 10 20 30 40 50 60 31 C2 38 C2 39 C3 41 C3 44 C3 45 C3 C1 : x < 25.33 Strict or not inequalities 49 C3 C2 : 25.33  x < 38.67 are defined arbitrarily. 51 C3 C3 : x  38.67 53 C3 Ricco Rakotomalala 12 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Equal frequency intervals (2) Simplicity and quickness (we need to sort the instances) Not sensitive to the outliers The number of instances in each interval is defined a priori Balancing the shape of the distribution density.default(x = x) 300 0.12 250 0.08 Density 200 150 0.04 100 50 0.00 0 0 5 10 15 20 [0.243,3.08) [3.08,5.74) [5.74,17.3] N = 1000 Bandwidth = 0.6965 The choice of K (the number of intervals) is not obvious The cut points do not take into account the proximities of the values Ricco Rakotomalala 13 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Some formulas to determine the "right" number of intervals K (http://www.info.univ-angers.fr/~gh/wstat/discr.php) K Approach Formula (without rounding) Brooks-Carruthers 5 x log 10 (n) 6.51 Huntsberger 1 + 3.332 x log 10 (n) 5.34 Sturges log2(n + 1) 4.39 (max – min)/(3.5 x  x n -1/3 ) Scott 2.50 (max – min)/(2 x IQ x n -1/3 ) Freedman-Diaconis 2.75  : standard deviation IQ : interquartile range The two last formulas use more information from the data (range, dispersion). 10 20 30 40 50 60 Ricco Rakotomalala 14 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Other unsupervised approaches Nested means . Top-down algorithm. The first cut point is the mean computed from the whole sample. Then, we subdivide the sample with this cut point. We calculate the local average in each subsample. We subdivide again. Etc. The number of intervals is a power of 2. Large relative difference . We sort the data in increasing order. We identify the large difference between 2 successive values. We define a cut point if the gap is upper than a threshold expressed in % of the standard deviation of the values (or in percentage of MAD - median absolute deviation - if you want to avoid the outliers problem). Difference to the average . K is a parameter. If K is even, on both side of the average, the first intervals are a range  [or m x  , m is a parameter], etc. until we have K intervals [last intervals have a different width]. If K is odd, the first interval is around the average. Ricco Rakotomalala 15 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Ricco Rakotomalala 16 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Take into account the proximity between the values Data can be organized into "clusters". We are interested in the characteristics of x x x 1 2 3 dispersion of data. 10 20 30 40 50 60   T B W ANOVA equation n n K K          k      2 2 2 x x n x x x x (analysis of variance) i k k ik k     i 1 k 1 k 1 i 1  is the Goal: Maximizing the differences B   2 correlation ratio between the conditional means W An optimal approach exists (Fisher algorithm), but…  The number of intervals K remains a parameter  The algorithm requires quadratic time O(n 2 ).  It is not available into data mining tools. Ricco Rakotomalala 17 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Top down strategy (1) (1) Find the best binary separation Continue recursively in each subgroup Until stopping rules are met (2) Etc. What stopping rules? • A new partition does not imply a significant difference between conditional means (  significance level, pay attention about the multiple comparisons problem) • Minimum number of instances before or after a splitting process • Maximum number of intervals Ricco Rakotomalala 18 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/

Transforming a continuous attribute into a discrete (ordinal) - PowerPoint PPT Presentation

Transforming a continuous attribute into a discrete (ordinal) attribute Ricco RAKOTOMALALA Universit Lumire Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ Outline 1. What is the discretization

TRANSFORMING TRANSFORMING TRANSFORMING TRANSFORMING FINANCIAL SERVICES FINANCIAL SERVICES FOR

CMSC 222: Discrete Mathematics Prof S Fall 2018 What is Discrete Mathematics? Discrete

Attribute Grammars Wilhelm/Seidl/Hack: Compiler Design, Syntactic and Semantic Analysis

Incorporating Off-Line Attribute Delegation into Hierarchical Group and Attribute-Based Access

MSJC Equity Pledge Board of Trustees Meeting November 14, 2019 Equity @ MSJC Equity @ MSJC

Discrete Mathematics Jeremy Siek Spring 2010 Jeremy Siek Discrete Mathematics 1 / 118 Jeremy

Cyber-Physical Systems Discrete Dynamics IECE 553/453 Fall 2019 Prof. Dola Saha 1 Discrete

Cyber-Physical Systems Discrete Dynamics ICEN 553/453 Fall 2018 Prof. Dola Saha 1 Discrete

Plan Discrete paths as Heyting algebras Discrete paths as categories Discrete paths as quantales

Discrete-time Systems in the Time Domain Chaiwoot Boonyasiriwat August 21, 2020 Discrete-time

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Discrete Structures of Computer Science Amanda Watson What is Discrete Mathematics?

Proposal to add masking function to GFM Proposal part 1 Adding a masking reference attribute on

Decorated Attribute Grammars Attribute Evaluation Meets Strategic Programming CC 2009, York, UK

Attribute Dependencies Wilhelm/Seidl/Hack: Compiler Design, Syntactic and Semantic Analysis

The Syntax of Classes and Objects in Python Defining a Class - "Inventing a Composite Data

PSS718 - Data Mining Lecture 5 - Transforming Data Asst.Prof.Dr. Burkay Gen Hacettepe

Benchmarking Methodology for IPv6 Transition Technologies

Atypical Behavior Identification in Large Scale Network Traffic Daniel Best {daniel.best@pnnl.gov}

M2S2 - Distributions Professor Jarad Niemi STAT 226 - Iowa State University August 29, 2018

Medians MPM2D: Principles of Mathematics Consider ABD below. Centroid of a Triangle J.

A Fast Algorithm for Rectilinear Steiner Trees with Length Restrictions on Obstacles Stephan Held

Algorithms with provable guarantees for clustering problems Ola Svensson Where to place rescue

Location of a line in the three-dimensional space Daniel Scholz Institute for Numerical and

Sambuz

Useful Links

Newsletter

Mail Us

Transforming a continuous attribute into a discrete (ordinal) - PowerPoint PPT Presentation

Transforming a continuous attribute into a discrete (ordinal) attribute Ricco RAKOTOMALALA Universit Lumire Lyon 2 Ricco Rakotomalala 1 Tutoriels Tanagra - http://tutoriels-data-mining.blogspot.fr/ Outline 1. What is the discretization

TRANSFORMING TRANSFORMING TRANSFORMING TRANSFORMING FINANCIAL SERVICES FINANCIAL SERVICES FOR

CMSC 222: Discrete Mathematics Prof S Fall 2018 What is Discrete Mathematics? Discrete

Attribute Grammars Wilhelm/Seidl/Hack: Compiler Design, Syntactic and Semantic Analysis

Incorporating Off-Line Attribute Delegation into Hierarchical Group and Attribute-Based Access

MSJC Equity Pledge Board of Trustees Meeting November 14, 2019 Equity @ MSJC Equity @ MSJC

Discrete Mathematics Jeremy Siek Spring 2010 Jeremy Siek Discrete Mathematics 1 / 118 Jeremy

Cyber-Physical Systems Discrete Dynamics IECE 553/453 Fall 2019 Prof. Dola Saha 1 Discrete

Cyber-Physical Systems Discrete Dynamics ICEN 553/453 Fall 2018 Prof. Dola Saha 1 Discrete

Plan Discrete paths as Heyting algebras Discrete paths as categories Discrete paths as quantales

Discrete-time Systems in the Time Domain Chaiwoot Boonyasiriwat August 21, 2020 Discrete-time

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Discrete Structures of Computer Science Amanda Watson What is Discrete Mathematics?

Proposal to add masking function to GFM Proposal part 1 Adding a masking reference attribute on

Decorated Attribute Grammars Attribute Evaluation Meets Strategic Programming CC 2009, York, UK

Attribute Dependencies Wilhelm/Seidl/Hack: Compiler Design, Syntactic and Semantic Analysis

The Syntax of Classes and Objects in Python Defining a Class - &quot;Inventing a Composite Data

PSS718 - Data Mining Lecture 5 - Transforming Data Asst.Prof.Dr. Burkay Gen Hacettepe

Benchmarking Methodology for IPv6 Transition Technologies

Atypical Behavior Identification in Large Scale Network Traffic Daniel Best {daniel.best@pnnl.gov}

M2S2 - Distributions Professor Jarad Niemi STAT 226 - Iowa State University August 29, 2018

Medians MPM2D: Principles of Mathematics Consider ABD below. Centroid of a Triangle J.

A Fast Algorithm for Rectilinear Steiner Trees with Length Restrictions on Obstacles Stephan Held

Algorithms with provable guarantees for clustering problems Ola Svensson Where to place rescue

Location of a line in the three-dimensional space Daniel Scholz Institute for Numerical and

Sambuz

Useful Links

Newsletter

Mail Us

The Syntax of Classes and Objects in Python Defining a Class - "Inventing a Composite Data