Data Preparation Discretization Data cleaning (Data - PowerPoint PPT Presentation

Data Preparation • Why prepare the data? Data Preparation • Discretization • Data cleaning (Data pre-processing) • Data integration and transformation • Data reduction, Feature selection 2 Why Prepare Data? Why Prepare Data? • Some data preparation is needed for all mining tools • Preparing data also prepares the miner so that when using prepared data the miner produces better • The purpose of preparation is to transform data sets models, faster so that their information content is best exposed to the mining tool • Error prediction rate should be lower (or the same) • GIGO - good data is a prerequisite for producing after the preparation as before it effective models of any type 3 4

Major Tasks in Data Preparation Why Prepare Data? • Data discretization • Data need to be formatted for a given software tool • Part of data reduction but with particular importance, especially for numerical data • Data cleaning • Data in the real world is dirty • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • incomplete: lacking attribute values, lacking certain attributes of • Data integration interest, or containing only aggregate data • Integration of multiple databases, data cubes, or files • e.g., occupation=“” • Data transformation • noisy: containing errors or outliers • Normalization and aggregation • e.g., Salary=“-10”, Age=“222” • Data reduction • inconsistent: containing discrepancies in codes or names • Obtains reduced representation in volume but produces the same or similar analytical • e.g., Age=“42” Birthday=“03/07/1997” results • e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records 5 6 Data Preparation as a step in the Types of Data Measurements Knowledge Discovery Process Knowledge Evaluation and Presentation • Measurements differ in their nature and the Data Mining Data preparation amount of information they give Selection and Transformation • Qualitative vs. Quantitative Cleaning and DW Integration DB 7 8

Types of Measurements Types of Measurements • Nominal scale • Nominal scale • Gives unique names to objects - no other information deducible • Categorical scale • Names of people • Names categories of objects • Although maybe numerical, not ordered • ZIP codes • Hair color • Gender: Male, Female • Marital Status: Single, Married, Divorcee, Widower 9 10 Types of Measurements Types of Measurements • Nominal scale • Nominal scale • Categorical scale • Categorical scale • Ordinal scale • Ordinal scale • Measured values can be ordered naturally • Interval scale • Transitivity: (A > B) and (B > C) ⇒ (A > C) • The scale has a means to indicate the distance that separates • “blind” tasting of wines measured values • Classifying students as: Very Good, Good, Sufficient,... • Temperature • Temperature: Cool, Mild, Hot 11 12

Types of Measurements Types of Measurements • Nominal scale • Nominal scale More information content • Categorical scale • Categorical scale Qualitative • Ordinal scale • Ordinal scale • Interval scale • Interval scale Quantitative • Ratio scale • Ratio scale • measurement values can be used to determine a meaningful ratio between them Discrete or Continuous • Bank account balance • Weight • Salary 13 14 Data Conversion Conversion: Nominal, Many Values • Some tools can deal with nominal values but other need fields to • Examples: be numeric • US State Code (50 values) • Profession Code (7,000 values, but only few frequent) • Convert ordinal fields to numeric to be able to use “>” and “<“ comparisons on such fields. • Ignore ID-like fields whose values are unique for each record • A � 4.0 • A- � 3.7 • For other fields, group values “naturally”: • B+ � 3.3 • e.g. 50 US States � 3 or 5 regions • B � 3.0 • Profession – select most frequent ones, group the rest • Multi-valued, unordered attributes with small no. of values • Create binary flag-fields for selected values • e.g. Color=Red, Orange, Yellow, …, Violet • for each value v create a binary “flag” variable C_ v , which is 1 if Color= v , 0 otherwise 15 16

Discretization Top-down (Splitting) versus Bottom-up (Merging) • Divide the range of a continuous attribute into intervals • Top-down methods start with an empty list of cut-points (or split-points) and keep on adding new ones to the list by ‘splitting’ • Some methods require discrete values, e.g. most versions of Naïve intervals as the discretization progresses. Bayes, CHAID • Reduce data size by discretization • Prepare for further analysis • Bottom-up methods start with the complete list of all the continuous values of the feature as cut-points and remove some of them by ‘merging’ intervals as the discretization progresses. • Discretization is very useful for generating a summary of data • Also called “binning” 17 18 Equal-width Binning Equal-width Binning • It divides the range into N intervals of equal size (range): uniform grid • If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = ( B - A )/ N. Count Temperature values: 64 65 68 69 70 71 72 72 75 75 80 81 83 85 1 [0 – 200,000) … …. Count [1,800,000 – 2,000,000] 4 Salary in a corporation 0 2 2 2 2 2 Disadvantage [64,67) [67,70) [70,73) [73,76) [76,79) [79,82) [82,85] Advantage (a) Unsupervised Equal Width, bins Low <= value < High (a) simple and easy to implement (b) Where does N come from? (b) produce a reasonable abstraction of data (c) Sensitive to outliers 19 20

Equal-depth (or height) Binning Equal-depth (or height) Binning Temperature values: • It divides the range into N intervals, each containing 64 65 68 69 70 71 72 72 75 75 80 81 83 85 approximately the same number of samples • Generally preferred because avoids clumping • In practice, “almost-equal” height binning is used to give more intuitive Count breakpoints 4 4 4 • Additional considerations: 2 • don’t split frequent values across bins [64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85] • create separate bins for special values (e.g. 0) • readable breakpoints (e.g. round breakpoints Equal Height = 4, except for the last bin 21 22 Discretization considerations Method 1R • Class-independent methods • Developed by Holte (1993). • It is a supervised discretization method using binning. • Equal Width is simpler, good for many classes • can fail miserably for unequal distributions • After sorting the data, the range of continuous values is divided into a • Equal Height gives better results number of disjoint intervals and the boundaries of those intervals are adjusted based on the class labels associated with the values of the • Class-dependent methods can be better for classification feature. • Note: decision tree methods build discretization on the fly • Each interval should contain a given minimum of instances ( 6 by default) • Naïve Bayes requires initial discretization with the exception of the last one. • Many other methods exist … • The adjustment of the boundary continues until the next values belongs to a class different to the majority class in the adjacent interval. 23 24

1R Example Entropy Based Discretization Class dependent (classification) 1. Sort examples in increasing order 2. Each value forms an interval (‘m’ intervals) 3. Calculate the entropy measure of this discretization 4. Find the binary split boundary that minimizes the entropy function over all possible boundaries. The split is selected as a binary discretization. S S | | | | S S E S T ( , ) Ent ( ) Ent ( ) = + 1 2 | S | | S | 1 2 5. Apply the process recursively until some stopping criterion is met, e.g., Ent S ( ) − E T S ( , ) > δ 25 26 Entropy Entropy/Impurity p 1-p Ent 0.2 0.8 0.72 0.4 0.6 0.97 0.5 0.5 1 • S - training set, C 1 ,...,C N classes 0.6 0.4 0.97 0.8 0.2 0.72 • Entropy E(S) - measure of the impurity in a group of examples log 2 (2) • p c - proportion of C c in S N ∑ Impurity( ) S p log p = − ⋅ c c 2 c = 1 p1 p2 p3 Ent 0.1 0.1 0.8 0.92 0.2 0.2 0.6 1.37 0.1 0.45 0.45 1.37 N ∑ Ent p log p = − ⋅ 0.2 0.4 0.4 1.52 c c log 2 (3) 2 c = 1 0.3 0.3 0.4 1.57 0.33 0.33 0.33 1.58 27 28

Data Preparation Discretization Data cleaning (Data - PowerPoint PPT Presentation

Data Preparation Why prepare the data? Data Preparation Discretization Data cleaning (Data pre-processing) Data integration and transformation Data reduction, Feature selection 2 Why Prepare Data? Why Prepare Data?

Data Preparation Data Preparation Types of Data and Basic statistics Discretization of

Preparation Test Preparation Practice Makes Perfect: Students should take numerous practice

Preparation for Sonship Mike Parsons 23. Frequency (2) Preparation for Sonship How can we be

Preparation for Sonship Mike Parsons 12. Deconstruction (1) eg.freedomarc.org Preparation for

Preparation for Sonship Mike Parsons 6. Angelic Realm (2) eg.freedomarc.org Preparation for

Joe Seibert, R.Ph. Matt Seibert, C.Ph.T. Medical Emergencies at Sea Skipper Preparation Crew

Paul Roberts Wood Preparation 8.4.2017 Wood Preparation 8.4.2017 Accurate machining

asking Why ? Preparation of the Gifts : Changes for Congregation: Suscipiat Dominus Preparation

PREPARATION OF FORCE ACCOUNTS PREPARATION OF FORCE ACCOUNTS WORKSHOP District 6 Resource Center

1 Receiver -1 Receiver Field Preparation Field Preparation June 2009 Topcon GRS- Topcon GRS

Paul Roberts Wood Preparation 9.9.2018 Wood Preparation 9.9.2018 Accurate machining

Cooking Academy Holistic Food Preparation Cooking Academy Holistic Food Preparation Module #3

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Data Preparation Data cleaning Data integration and transformation (Data

Choosing a Focus Analyzing Qualitative Data Uta Hinrichs data preparation & familiarization

Multilevel domain decomposition at extreme scales S. Badia, A. Martin, J. Principe Universitat

Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 1 / 13 Todays Biz 1.

TEM for magnetism: challenges and competitors Olivier Fruchart Institut Nel (CNRS-UJF-INPG)

Exploiting Locality in Distributed SDN Control Stefan Schmid (TU Berlin & T-Labs) Jukka

Measuring Sustainability CodeGreen Approach 5/20/2014 1 1 About

SCS Scorecard System V3.0 Super Admin (SHRU) Setup agency, category, location, period type

Dr. Olivia Leung January 2014 1 Agenda (1)Balanced Scorecard (BSC) and its 4 perspective (2)Key

CIMA Paper P2 Advanced Management Accounting Ian Kusano and Nathi Thela 7 Chapter Alternative

Sambuz

Useful Links

Newsletter

Mail Us

Data Preparation Discretization Data cleaning (Data - PowerPoint PPT Presentation

Data Preparation Why prepare the data? Data Preparation Discretization Data cleaning (Data pre-processing) Data integration and transformation Data reduction, Feature selection 2 Why Prepare Data? Why Prepare Data?

Data Preparation Data Preparation Types of Data and Basic statistics Discretization of

Preparation Test Preparation Practice Makes Perfect: Students should take numerous practice

Preparation for Sonship Mike Parsons 23. Frequency (2) Preparation for Sonship How can we be

Preparation for Sonship Mike Parsons 12. Deconstruction (1) eg.freedomarc.org Preparation for

Preparation for Sonship Mike Parsons 6. Angelic Realm (2) eg.freedomarc.org Preparation for

Joe Seibert, R.Ph. Matt Seibert, C.Ph.T. Medical Emergencies at Sea Skipper Preparation Crew

Paul Roberts Wood Preparation 8.4.2017 Wood Preparation 8.4.2017 Accurate machining

asking Why ? Preparation of the Gifts : Changes for Congregation: Suscipiat Dominus Preparation

PREPARATION OF FORCE ACCOUNTS PREPARATION OF FORCE ACCOUNTS WORKSHOP District 6 Resource Center

1 Receiver -1 Receiver Field Preparation Field Preparation June 2009 Topcon GRS- Topcon GRS

Paul Roberts Wood Preparation 9.9.2018 Wood Preparation 9.9.2018 Accurate machining

Cooking Academy Holistic Food Preparation Cooking Academy Holistic Food Preparation Module #3

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Data Preparation Data cleaning Data integration and transformation (Data

Choosing a Focus Analyzing Qualitative Data Uta Hinrichs data preparation &amp; familiarization

Multilevel domain decomposition at extreme scales S. Badia, A. Martin, J. Principe Universitat

Graph Processing Frameworks Lecture 24 CSCI 4974/6971 5 Dec 2016 1 / 13 Todays Biz 1.

TEM for magnetism: challenges and competitors Olivier Fruchart Institut Nel (CNRS-UJF-INPG)

Exploiting Locality in Distributed SDN Control Stefan Schmid (TU Berlin &amp; T-Labs) Jukka

Measuring Sustainability CodeGreen Approach 5/20/2014 1 1 About

SCS Scorecard System V3.0 Super Admin (SHRU) Setup agency, category, location, period type

Dr. Olivia Leung January 2014 1 Agenda (1)Balanced Scorecard (BSC) and its 4 perspective (2)Key

CIMA Paper P2 Advanced Management Accounting Ian Kusano and Nathi Thela 7 Chapter Alternative

Sambuz

Useful Links

Newsletter

Mail Us

Choosing a Focus Analyzing Qualitative Data Uta Hinrichs data preparation & familiarization

Exploiting Locality in Distributed SDN Control Stefan Schmid (TU Berlin & T-Labs) Jukka