Data Preprocessing Data Mining and Exploration: Preprocessing Data - PowerPoint PPT Presentation

� � Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for data mining. Cabena et al (1998) Amos Storkey, School of Informatics estimate that data preparation accounts for 60% of the effort in a data mining application. ◮ Data cleaning January 23, 2006 ◮ Data integration and transformation ◮ Data reduction http://www.inf.ed.ac.uk/teaching/courses/dme/ Reading: Han and Kamber, chapter 3 These lecture slides are based extensively on previous versions of the course written by Chris Williams. 1 / 1 2 / 1 Why Data Preprocessing? Major Tasks in Data Preprocessing Data cleaning Data in the real world is dirty. It is: Data integration ◮ incomplete, e.g. lacking attribute values ◮ Data cleaning ◮ noisy, e.g. containing errors or outliers ◮ Data integration ◮ inconsistent, e.g. containing discrepancies in codes or ◮ Data transformation names Data transformation 2, 32, 100, 59, 48 0.02, 0.32, 1.00, 0.59, 0.48 ◮ Data reduction Data reduction attributes attributes GIGO: need quality data to get quality results A1 A2 A3 ... A126 A1 A3 ... A115 transactions T1� T1� transactions T2� T4� T3� ...� T4� T1456 ...� T2000 Figure from Han and Kamber 3 / 1 4 / 1

Data Cleaning Tasks Missing Data and Outliers ◮ What happens if input data is missing? Is it missing at random (MAR) or is there a systematic reason for its absence? Let x m denote those values missing, and x p those values that are present. If MAR, some “solutions” are ◮ Handle missing values ◮ Model P ( x m | x p ) and average (correct, but hard) ◮ Identify outliers, smooth out noisy data ◮ Replace data with its mean value (?) ◮ Look for similar (close) input patterns and use them to infer ◮ Correct inconsistent data missing values (crude version of density model) ◮ Reference: Statistical Analysis with Missing Data R. J. A. Little, D. B. Rubin, Wiley (1987) ◮ Outliers detected by clustering, or combined computer and human inspection 5 / 1 6 / 1 Data Integration Data Transformation ◮ Normalization, e.g. to zero mean, unit standard deviation new data = old data − mean Combines data from multiple sources into a coherent store std deviation ◮ Entity identification problem: identify real-world entities or max-min normalization to [ 0 , 1 ] from multiple data sources, e.g. A.cust-id ≡ B.cust-num ◮ Detecting and resolving data value conflicts: for the same new data = old data − min real-world entity, attribute values are different, e.g. max − min measurement in different units ◮ Normalization useful for e.g. k nearest neighbours, or for neural networks ◮ New features constructed, e.g. with PCA or with hand-crafted features 7 / 1 8 / 1

Data Reduction Feature Selection Usually as part of supervised learning ◮ Stepwise strategies ◮ Feature selection: Select a minimum set of features ˜ x from x so that: ◮ (a) Forward selection: Start with no features. Add the one which is the best predictor. Then add a second one to maximize ◮ P ( class | ˜ x ) closely approximates P ( class | x ) performance using first feature and new one; and so on until a ◮ The classification accuracy does not significantly decrease stopping criterion is satisfied ◮ Data Compression (lossy) ◮ (b) Backwards elimination: Start with all features, delete the one ◮ PCA, Canonical variates which reduces performance least, recursively until a stopping criterion is satisfied ◮ Sampling: choose a representative subset of the data ◮ Forward selection is unable to anticipate interactions ◮ Simple random sampling vs stratified sampling ◮ Backward selection can suffer from problems of overfitting ◮ Hierarchical reduction: e.g. country-county-town ◮ They are heuristics to avoid considering all subsets of size k of d features 9 / 1 10 / 1

Data Preprocessing Data Mining and Exploration: Preprocessing Data - PowerPoint PPT Presentation

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for data mining. Cabena et al (1998) Amos Storkey, School of Informatics estimate that data preparation accounts for 60% of the effort in a

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

THE DATA MINING PIPELINE What is data? The data mining pipeline: collection, preprocessing,

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of

Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1396 Hamid

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 Data Preprocessing Exploratory Analysis Post-processing What is Data

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta

Data Mining and Exploration Data Mining and Exploration: Introduction Course Introduction Amos

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Web Usage Mining Reference : http://maya.cs.depaul.edu/~classes/ect584/papers/srivastava.pdf Dr

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining Techniques: Statistical Decision Theory Nearest Neighbor Classification and

Scalable and Memory-Efficient Clustering of Large-Scale Social Networks Joyce Jiyoung Whang, Xin

Physics Analysis with Advanced Data Mining Techniques Hai-Jun Yang University of Michigan, Ann

Data Mining meets Football (soccer) Ulf Brefeld Knowledge Mining & Assessment TU Darmstadt

Data Replication and Power Consumption in Data Grids Karl Smith, Susan Vrbsky, Ming Lei, Jeff

Distributed Databases 1 19.1 Distributed Database System A distributed database system

A Cloud-native Architecture for Replicated Data Services Hemant Saxena, Jeffery Pound University

Wide Area Placement of Data Replicas for Fast and Highly Available Data Access Fan Ping Xiaohu

Data Preprocessing Data Mining and Exploration: Preprocessing Data - PowerPoint PPT Presentation

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for data mining. Cabena et al (1998) Amos Storkey, School of Informatics estimate that data preparation accounts for 60% of the effort in a

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

THE DATA MINING PIPELINE What is data? The data mining pipeline: collection, preprocessing,

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of

Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1396 Hamid

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 Data Preprocessing Exploratory Analysis Post-processing What is Data

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta

Data Mining and Exploration Data Mining and Exploration: Introduction Course Introduction Amos

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Web Usage Mining Reference : http://maya.cs.depaul.edu/~classes/ect584/papers/srivastava.pdf Dr

Acacia Mining plc Exploration Roundtable 11.12.2015 Exploration roundtable Investment in

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining Techniques: Statistical Decision Theory Nearest Neighbor Classification and

Scalable and Memory-Efficient Clustering of Large-Scale Social Networks Joyce Jiyoung Whang, Xin

Physics Analysis with Advanced Data Mining Techniques Hai-Jun Yang University of Michigan, Ann

Data Mining meets Football (soccer) Ulf Brefeld Knowledge Mining &amp; Assessment TU Darmstadt

Data Replication and Power Consumption in Data Grids Karl Smith, Susan Vrbsky, Ming Lei, Jeff

Distributed Databases 1 19.1 Distributed Database System A distributed database system

A Cloud-native Architecture for Replicated Data Services Hemant Saxena, Jeffery Pound University

Wide Area Placement of Data Replicas for Fast and Highly Available Data Access Fan Ping Xiaohu

Data Mining meets Football (soccer) Ulf Brefeld Knowledge Mining & Assessment TU Darmstadt