SLIDE 5 17
Discretization
- Divide the range of a continuous attribute into intervals
- Some methods require discrete values, e.g. most versions of Naïve
Bayes, CHAID
- Reduce data size by discretization
- Prepare for further analysis
- Discretization is very useful for generating a summary of data
- Also called “binning”
18
Top-down (Splitting) versus Bottom-up (Merging)
- Top-down methods start with an empty list of cut-points (or
split-points) and keep on adding new ones to the list by ‘splitting’ intervals as the discretization progresses.
- Bottom-up methods start with the complete list of all the
continuous values of the feature as cut-points and remove some
- f them by ‘merging’ intervals as the discretization progresses.
19
Equal-width Binning
- It divides the range into N intervals of equal size (range): uniform grid
- If A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B -A)/N.
Equal Width, bins Low <= value < High
[64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85] Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85 2 2 Count 4 2 2 2
20
Equal-width Binning
Disadvantage (a) Unsupervised (b) Where does N come from? (c) Sensitive to outliers Advantage (a) simple and easy to implement (b) produce a reasonable abstraction of data [0 – 200,000) … ….
1
Count
Salary in a corporation [1,800,000 – 2,000,000]