Data Preparation Data Preparation Types of Data and Basic - PowerPoint PPT Presentation

Data Preparation Data Preparation • Introduction to Data Preparation Data Preparation Data Preparation • Types of Data and Basic statistics • Discretization of Continuous Variables • Working in the R environment (Data pre-processing) (D t i ) • Outliers • Data Transformation f m • Missing Data • Data Integration Data Integration • Data Reduction 2 Why Prepare Data? Why Prepare Data? • Some data preparation is needed for all mining tools • The purpose of preparation is to transform data sets so that their information content is best exposed to f m p the mining tool INTRODUCTION TO DATA • Error prediction rate should be lower (or the same) after the preparation as before it PREPARATION 3 4

Why Prepare Data? Why Prepare Data? Why Prepare Data? Why Prepare Data? • Data need to be formatted for a given software tool • Data need to be made adequate for a given method • Preparing data also prepares the miner so that when using prepared data the miner produces better i d d t th i d b tt • Data in the real world is dirty D h l ld d models, faster • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data containing only aggregate data • e.g., occupation=“” • noisy: containing errors or outliers • GIGO - good data is a prerequisite for producing • e.g., Salary=“-10”, Age=“222” • inconsistent: containing discrepancies in codes or names effective models of any type • • e g Age=“42” Birthday=“03/07/1997” e.g., Age= 42 Birthday= 03/07/1997 e.g., Was rating “1,2,3”, now rating “A, B, C” • e.g., discrepancy between duplicate records • e.g., Endereço: travessa da Igreja de Nevogilde Freguesia: Paranhos • 5 6 Major Tasks in Data Preparation Major Tasks in Data Preparation Data Preparation as a step in the Data Preparation as a step in the Knowledge Discovery Process Knowledge Evaluation and • Data discretization Data discretization Presentation P t ti • Part of data reduction but with particular importance, especially for numerical data • Data cleaning Data Mining • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • • Data integration Data integration Selection and Transformation • Integration of multiple databases, data cubes, or files • Data transformation • Normalization and aggregation Cleaning and DW • Data reduction Integration • Obtains reduced representation in volume but produces the same or similar analytical results DB 7 8

CRISP-DM Phases and Tasks CRISP DM CRISP-DM Business Data Data Modelling Evaluation Deploymen t Understanding g Understanding g Preparation p CRISP-DM is a comprehensive data Determine Collect Select mining methodology and g gy Select Evaluate Plan Business Business Initial Initial Modeling Modeling process model that Data Results Deployment Objectives Data Technique provides anyone—from novices to data mining Generate Plan Assess Assess Describe Describe Clean Clean Review Review experts—with a complete Test Monitoring & Situation Data Data Process Design Maintenance blueprint for conducting a data mining project. Determine Determine Produce Produce Explore Construct Build Determine Data Mining Final Data Data Model Next Steps Goals Report CRISP-DM breaks down the life cycle of a data the life cycle of a data mining project into six Produce Verify Integrate Assess Review Project Data phases. Data Model Project Plan Quality Format Data 9 10 CRISP-DM Phases and Tasks CRISP DM: Data Understanding CRISP-DM: Data Understanding • Collect data Business Data Data Modelling Evaluation Deploymen t Understanding g Understanding g Preparation p • List the datasets acquired (locations, methods used to acquire, problems encountered and solutions achieved). Determine Collect Select Select Evaluate Plan Business Business Initial Initial Modeling Modeling Data Results Deployment Objectives Data Technique • Describe data Generate Plan Assess Assess Describe Describe Clean Clean Review Review • Check data volume and examine its gross properties. h k l Test Monitoring & Situation Data Data Process Design Maintenance • Accessibility and availability of attributes. Attribute types, range, c rrel ti ns the identities correlations, the identities. Determine Determine Produce Produce Explore Construct Build Determine Data Mining Final Data Data Model Next Steps Goals Report • Understand the meaning of each attribute and attribute value in business terms. business terms Produce Verify Integrate Assess Review Project Data • For each attribute, compute basic statistics (e.g., distribution, Data Model Project Plan Quality average, max, min, standard deviation, variance, mode, skewness). a rag , ma , m n, tan ar at n, ar anc , m , wn ). Format Data 11 12

CRISP-DM: Data Understanding g CRISP-DM: Data Preparation CRISP DM: Data Preparation • Explore data • Select data • Analyze properties of interesting attributes in detail • Analyze properties of interesting attributes in detail. • Reconsider data selection criteria. • Distribution, relations between pairs or small numbers of attributes, properties of significant sub-populations, simple statistical analyses . • Decide which dataset will be used. • Verify data quality • Collect appropriate additional data (internal or external). • Consider use of sampling techniques. • Identify special values and catalogue their meaning. Identify special values and catalogue their meaning. • Explain why certain data was included or excluded. • Does it cover all the cases required? Does it contain errors and how • Clean data common are they? • Identify missing attributes and blank fields. Meaning of missing data. • Correct, remove or ignore noise. • Do the meanings of attributes and contained values fit together? • Decide how to deal with special values and their meaning (99 for marital status). • Check spelling of values (e.g., same value but sometime beginning with a lower case letter, sometimes with an upper case letter). • Aggregation level, missing values, etc. • Check for plausibility of values e g all fields have the same or nearly the Check for plausibility of values, e.g. all fields have the same or nearly the • • Outliers? Outliers? same values. 13 14 CRISP DM: Data Preparation CRISP-DM: Data Preparation • Construct data • Derived attributes. D i d tt ib t • Background knowledge. • How can missing attributes be constructed or imputed? How can missing attributes be constructed or imputed? • Integrate data • Integrate sources and store result (new tables and records). I d l ( bl d d ) • Format Data TYPES OF DATA AND BASIC • Rearranging attributes (Some tools have requirements on the order of the attributes, e.g. first field being a unique identifier for each record or last field being the outcome field the model is to predict). STATISTICS • Reordering records (Perhaps the modelling tool requires that the records be sorted according to the value of the outcome attribute). • Reformatted within-value (These are purely syntactic changes made to satisfy Reformatted within value (These are purely syntactic changes made to satisfy the requirements of the specific modelling tool, remove illegal characters, uppercase lowercase). 15 16

Data Preparation Data Preparation Types of Data and Basic - PowerPoint PPT Presentation

Data Preparation Data Preparation Introduction to Data Preparation Data Preparation Data Preparation Types of Data and Basic statistics Discretization of Continuous Variables Working in the R environment (Data

Preparation Test Preparation Practice Makes Perfect: Students should take numerous practice

Preparation for Sonship Mike Parsons 23. Frequency (2) Preparation for Sonship How can we be

Preparation for Sonship Mike Parsons 12. Deconstruction (1) eg.freedomarc.org Preparation for

Preparation for Sonship Mike Parsons 6. Angelic Realm (2) eg.freedomarc.org Preparation for

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Joe Seibert, R.Ph. Matt Seibert, C.Ph.T. Medical Emergencies at Sea Skipper Preparation Crew

Paul Roberts Wood Preparation 8.4.2017 Wood Preparation 8.4.2017 Accurate machining

asking Why ? Preparation of the Gifts : Changes for Congregation: Suscipiat Dominus Preparation

PREPARATION OF FORCE ACCOUNTS PREPARATION OF FORCE ACCOUNTS WORKSHOP District 6 Resource Center

1 Receiver -1 Receiver Field Preparation Field Preparation June 2009 Topcon GRS- Topcon GRS

Paul Roberts Wood Preparation 9.9.2018 Wood Preparation 9.9.2018 Accurate machining

Cooking Academy Holistic Food Preparation Cooking Academy Holistic Food Preparation Module #3

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Choosing a Focus Analyzing Qualitative Data Uta Hinrichs data preparation & familiarization

Choosing a Focus Analyzing Qualitative Data Assembled by Uta Hinrichs data preparation &

Perl: Regular expressions A powerful tool for searching and transform ing text. SENG 265:

Description Dr. Ken Alabi TTC Technologies Inc. New York. Outline Convergence Data Flow

The Next Industrial Revolution: Manufacturing and the Industrial Internet Joseph J. Salvo, Ph.D.

Getting Real with Unreal Data: Lessons Learned and the Way Ahead Thore Graepel Royal Holloway,

CS-5630 / CS-6630 Visualization for Data Science Data Alexander Lex alex@sci.utah.edu [xkcd]

MiniBooNE beam simulation Kendall Mahn on behalf of those who did all this work primary p+Be

Large-Scale Flow and Structure Formation in Stellar Atmospheres - II Nana L. Shatashvili 1,2 with

How to write a good research paper Bill Freeman MIT CSAIL and Google June 22, 2018 Two informal

Data Preparation Data Preparation Types of Data and Basic - PowerPoint PPT Presentation

Data Preparation Data Preparation Introduction to Data Preparation Data Preparation Data Preparation Types of Data and Basic statistics Discretization of Continuous Variables Working in the R environment (Data

Preparation Test Preparation Practice Makes Perfect: Students should take numerous practice

Preparation for Sonship Mike Parsons 23. Frequency (2) Preparation for Sonship How can we be

Preparation for Sonship Mike Parsons 12. Deconstruction (1) eg.freedomarc.org Preparation for

Preparation for Sonship Mike Parsons 6. Angelic Realm (2) eg.freedomarc.org Preparation for

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Joe Seibert, R.Ph. Matt Seibert, C.Ph.T. Medical Emergencies at Sea Skipper Preparation Crew

Paul Roberts Wood Preparation 8.4.2017 Wood Preparation 8.4.2017 Accurate machining

asking Why ? Preparation of the Gifts : Changes for Congregation: Suscipiat Dominus Preparation

PREPARATION OF FORCE ACCOUNTS PREPARATION OF FORCE ACCOUNTS WORKSHOP District 6 Resource Center

1 Receiver -1 Receiver Field Preparation Field Preparation June 2009 Topcon GRS- Topcon GRS

Paul Roberts Wood Preparation 9.9.2018 Wood Preparation 9.9.2018 Accurate machining

Cooking Academy Holistic Food Preparation Cooking Academy Holistic Food Preparation Module #3

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Choosing a Focus Analyzing Qualitative Data Uta Hinrichs data preparation &amp; familiarization

Choosing a Focus Analyzing Qualitative Data Assembled by Uta Hinrichs data preparation &amp;

Perl: Regular expressions A powerful tool for searching and transform ing text. SENG 265:

Description Dr. Ken Alabi TTC Technologies Inc. New York. Outline Convergence Data Flow

The Next Industrial Revolution: Manufacturing and the Industrial Internet Joseph J. Salvo, Ph.D.

Getting Real with Unreal Data: Lessons Learned and the Way Ahead Thore Graepel Royal Holloway,

CS-5630 / CS-6630 Visualization for Data Science Data Alexander Lex alex@sci.utah.edu [xkcd]

MiniBooNE beam simulation Kendall Mahn on behalf of those who did all this work primary p+Be

Large-Scale Flow and Structure Formation in Stellar Atmospheres - II Nana L. Shatashvili 1,2 with

How to write a good research paper Bill Freeman MIT CSAIL and Google June 22, 2018 Two informal

Choosing a Focus Analyzing Qualitative Data Uta Hinrichs data preparation & familiarization

Choosing a Focus Analyzing Qualitative Data Assembled by Uta Hinrichs data preparation &