data preprocessing data mining and exploration
play

Data Preprocessing Data Mining and Exploration: Preprocessing Data - PowerPoint PPT Presentation

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for data mining. Cabena et al (1998) Amos Storkey, School of Informatics estimate that data preparation accounts for 60% of the effort in a


  1. � � Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for data mining. Cabena et al (1998) Amos Storkey, School of Informatics estimate that data preparation accounts for 60% of the effort in a data mining application. ◮ Data cleaning January 23, 2006 ◮ Data integration and transformation ◮ Data reduction http://www.inf.ed.ac.uk/teaching/courses/dme/ Reading: Han and Kamber, chapter 3 These lecture slides are based extensively on previous versions of the course written by Chris Williams. 1 / 1 2 / 1 Why Data Preprocessing? Major Tasks in Data Preprocessing Data cleaning Data in the real world is dirty. It is: Data integration ◮ incomplete, e.g. lacking attribute values ◮ Data cleaning ◮ noisy, e.g. containing errors or outliers ◮ Data integration ◮ inconsistent, e.g. containing discrepancies in codes or ◮ Data transformation names Data transformation 2, 32, 100, 59, 48 0.02, 0.32, 1.00, 0.59, 0.48 ◮ Data reduction Data reduction attributes attributes GIGO: need quality data to get quality results A1 A2 A3 ... A126 A1 A3 ... A115 transactions T1� T1� transactions T2� T4� T3� ...� T4� T1456 ...� T2000 Figure from Han and Kamber 3 / 1 4 / 1

  2. Data Cleaning Tasks Missing Data and Outliers ◮ What happens if input data is missing? Is it missing at random (MAR) or is there a systematic reason for its absence? Let x m denote those values missing, and x p those values that are present. If MAR, some “solutions” are ◮ Handle missing values ◮ Model P ( x m | x p ) and average (correct, but hard) ◮ Identify outliers, smooth out noisy data ◮ Replace data with its mean value (?) ◮ Look for similar (close) input patterns and use them to infer ◮ Correct inconsistent data missing values (crude version of density model) ◮ Reference: Statistical Analysis with Missing Data R. J. A. Little, D. B. Rubin, Wiley (1987) ◮ Outliers detected by clustering, or combined computer and human inspection 5 / 1 6 / 1 Data Integration Data Transformation ◮ Normalization, e.g. to zero mean, unit standard deviation new data = old data − mean Combines data from multiple sources into a coherent store std deviation ◮ Entity identification problem: identify real-world entities or max-min normalization to [ 0 , 1 ] from multiple data sources, e.g. A.cust-id ≡ B.cust-num ◮ Detecting and resolving data value conflicts: for the same new data = old data − min real-world entity, attribute values are different, e.g. max − min measurement in different units ◮ Normalization useful for e.g. k nearest neighbours, or for neural networks ◮ New features constructed, e.g. with PCA or with hand-crafted features 7 / 1 8 / 1

  3. Data Reduction Feature Selection Usually as part of supervised learning ◮ Stepwise strategies ◮ Feature selection: Select a minimum set of features ˜ x from x so that: ◮ (a) Forward selection: Start with no features. Add the one which is the best predictor. Then add a second one to maximize ◮ P ( class | ˜ x ) closely approximates P ( class | x ) performance using first feature and new one; and so on until a ◮ The classification accuracy does not significantly decrease stopping criterion is satisfied ◮ Data Compression (lossy) ◮ (b) Backwards elimination: Start with all features, delete the one ◮ PCA, Canonical variates which reduces performance least, recursively until a stopping criterion is satisfied ◮ Sampling: choose a representative subset of the data ◮ Forward selection is unable to anticipate interactions ◮ Simple random sampling vs stratified sampling ◮ Backward selection can suffer from problems of overfitting ◮ Hierarchical reduction: e.g. country-county-town ◮ They are heuristics to avoid considering all subsets of size k of d features 9 / 1 10 / 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend