 
              Data Preparation Data Preparation • Introduction to Data Preparation Data Preparation Data Preparation • Types of Data and Basic statistics • Discretization of Continuous Variables • Working in the R environment (Data pre-processing) (D t i ) • Outliers • Data Transformation f m • Missing Data • Data Integration Data Integration • Data Reduction 2 Why Prepare Data? Why Prepare Data? • Some data preparation is needed for all mining tools • The purpose of preparation is to transform data sets so that their information content is best exposed to f m p the mining tool INTRODUCTION TO DATA • Error prediction rate should be lower (or the same) after the preparation as before it PREPARATION 3 4
Why Prepare Data? Why Prepare Data? Why Prepare Data? Why Prepare Data? • Data need to be formatted for a given software tool • Data need to be made adequate for a given method • Preparing data also prepares the miner so that when using prepared data the miner produces better i d d t th i d b tt • Data in the real world is dirty D h l ld d models, faster • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data containing only aggregate data • e.g., occupation=“” • noisy: containing errors or outliers • GIGO - good data is a prerequisite for producing • e.g., Salary=“-10”, Age=“222” • inconsistent: containing discrepancies in codes or names effective models of any type • • e g Age=“42” Birthday=“03/07/1997” e.g., Age= 42 Birthday= 03/07/1997 e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records • e.g., Endereço: travessa da Igreja de Nevogilde Freguesia: Paranhos • 5 6 Major Tasks in Data Preparation Major Tasks in Data Preparation Data Preparation as a step in the Data Preparation as a step in the Knowledge Discovery Process Knowledge Evaluation and • Data discretization Data discretization Presentation P t ti • Part of data reduction but with particular importance, especially for numerical data • Data cleaning Data Mining • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • • Data integration Data integration Selection and Transformation • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation Cleaning and DW • Data reduction Integration • Obtains reduced representation in volume but produces the same or similar analytical results DB 7 8
CRISP-DM Phases and Tasks CRISP DM CRISP-DM Business Data Data Modelling Evaluation Deploymen t Understanding g Understanding g Preparation p CRISP-DM is a comprehensive data Determine Collect Select mining methodology and g gy Select Evaluate Plan Business Business Initial Initial Modeling Modeling process model that Data Results Deployment Objectives Data Technique provides anyone—from novices to data mining Generate Plan Assess Assess Describe Describe Clean Clean Review Review experts—with a complete Test Monitoring & Situation Data Data Process Design Maintenance blueprint for conducting a data mining project. Determine Determine Produce Produce Explore Construct Build Determine Data Mining Final Data Data Model Next Steps Goals Report CRISP-DM breaks down the life cycle of a data the life cycle of a data mining project into six Produce Verify Integrate Assess Review Project Data phases. Data Model Project Plan Quality Format Data 9 10 CRISP-DM Phases and Tasks CRISP DM: Data Understanding CRISP-DM: Data Understanding • Collect data Business Data Data Modelling Evaluation Deploymen t Understanding g Understanding g Preparation p • List the datasets acquired (locations, methods used to acquire, problems encountered and solutions achieved). Determine Collect Select Select Evaluate Plan Business Business Initial Initial Modeling Modeling Data Results Deployment Objectives Data Technique • Describe data Generate Plan Assess Assess Describe Describe Clean Clean Review Review • Check data volume and examine its gross properties. h k l Test Monitoring & Situation Data Data Process Design Maintenance • Accessibility and availability of attributes. Attribute types, range, c rrel ti ns the identities correlations, the identities. Determine Determine Produce Produce Explore Construct Build Determine Data Mining Final Data Data Model Next Steps Goals Report • Understand the meaning of each attribute and attribute value in business terms. business terms Produce Verify Integrate Assess Review Project Data • For each attribute, compute basic statistics (e.g., distribution, Data Model Project Plan Quality average, max, min, standard deviation, variance, mode, skewness). a rag , ma , m n, tan ar at n, ar anc , m , wn ). Format Data 11 12
CRISP-DM: Data Understanding g CRISP-DM: Data Preparation CRISP DM: Data Preparation • Explore data • Select data • Analyze properties of interesting attributes in detail • Analyze properties of interesting attributes in detail. • Reconsider data selection criteria. • Distribution, relations between pairs or small numbers of attributes, properties of significant sub-populations, simple statistical analyses . • Decide which dataset will be used. • Verify data quality • Collect appropriate additional data (internal or external). • Consider use of sampling techniques. • Identify special values and catalogue their meaning. Identify special values and catalogue their meaning. • Explain why certain data was included or excluded. • Does it cover all the cases required? Does it contain errors and how • Clean data common are they? • Identify missing attributes and blank fields. Meaning of missing data. • Correct, remove or ignore noise. • Do the meanings of attributes and contained values fit together? • Decide how to deal with special values and their meaning (99 for marital status). • Check spelling of values (e.g., same value but sometime beginning with a lower case letter, sometimes with an upper case letter). • Aggregation level, missing values, etc. • Check for plausibility of values e g all fields have the same or nearly the Check for plausibility of values, e.g. all fields have the same or nearly the • • Outliers? Outliers? same values. 13 14 CRISP DM: Data Preparation CRISP-DM: Data Preparation • Construct data • Derived attributes. D i d tt ib t • Background knowledge. • How can missing attributes be constructed or imputed? How can missing attributes be constructed or imputed? • Integrate data • Integrate sources and store result (new tables and records). I d l ( bl d d ) • Format Data TYPES OF DATA AND BASIC • Rearranging attributes (Some tools have requirements on the order of the attributes, e.g. first field being a unique identifier for each record or last field being the outcome field the model is to predict). STATISTICS • Reordering records (Perhaps the modelling tool requires that the records be sorted according to the value of the outcome attribute). • Reformatted within-value (These are purely syntactic changes made to satisfy Reformatted within value (These are purely syntactic changes made to satisfy the requirements of the specific modelling tool, remove illegal characters, uppercase lowercase). 15 16
Recommend
More recommend