Data Preprocessing Data Mining and Exploration: Preprocessing Data - - PowerPoint PPT Presentation

data preprocessing data mining and exploration
SMART_READER_LITE
LIVE PREVIEW

Data Preprocessing Data Mining and Exploration: Preprocessing Data - - PowerPoint PPT Presentation

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for data mining. Cabena et al (1998) Amos Storkey, School of Informatics estimate that data preparation accounts for 60% of the effort in a


slide-1
SLIDE 1

Data Mining and Exploration: Preprocessing

Amos Storkey, School of Informatics January 23, 2006 http://www.inf.ed.ac.uk/teaching/courses/dme/

These lecture slides are based extensively on previous versions of the course written by Chris Williams.

1 / 1

Data Preprocessing

Data preparation is a big issue for data mining. Cabena et al (1998) estimate that data preparation accounts for 60% of the effort in a data mining application.

◮ Data cleaning ◮ Data integration and transformation ◮ Data reduction

Reading: Han and Kamber, chapter 3

2 / 1

Why Data Preprocessing?

Data in the real world is dirty. It is:

◮ incomplete, e.g. lacking attribute values ◮ noisy, e.g. containing errors or outliers ◮ inconsistent, e.g. containing discrepancies in codes or

names GIGO: need quality data to get quality results

3 / 1

Major Tasks in Data Preprocessing

Data cleaning Data integration Data transformation Data reduction attributes attributes A1 A2 A3 ... A126

  • 2, 32, 100, 59, 48
  • 0.02, 0.32, 1.00, 0.59, 0.48

T1 T2 T3 T4 ... T2000 transactions transactions A1 A3 ... T1 T4 ... T1456 A115

◮ Data cleaning ◮ Data integration ◮ Data transformation ◮ Data reduction

Figure from Han and Kamber

4 / 1

slide-2
SLIDE 2

Data Cleaning Tasks

◮ Handle missing values ◮ Identify outliers, smooth out noisy data ◮ Correct inconsistent data

5 / 1

Missing Data and Outliers

◮ What happens if input data is missing? Is it missing at random

(MAR) or is there a systematic reason for its absence? Let xm denote those values missing, and xp those values that are

  • present. If MAR, some “solutions” are

◮ Model P(xm|xp) and average (correct, but hard) ◮ Replace data with its mean value (?) ◮ Look for similar (close) input patterns and use them to infer

missing values (crude version of density model)

◮ Reference: Statistical Analysis with Missing Data R. J. A.

Little, D. B. Rubin, Wiley (1987)

◮ Outliers detected by clustering, or combined computer and

human inspection

6 / 1

Data Integration

Combines data from multiple sources into a coherent store

◮ Entity identification problem: identify real-world entities

from multiple data sources, e.g. A.cust-id ≡ B.cust-num

◮ Detecting and resolving data value conflicts: for the same

real-world entity, attribute values are different, e.g. measurement in different units

7 / 1

Data Transformation

◮ Normalization, e.g. to zero mean, unit standard deviation

new data = old data − mean std deviation

  • r max-min normalization to [0, 1]

new data = old data − min max − min

◮ Normalization useful for e.g. k nearest neighbours, or for

neural networks

◮ New features constructed, e.g. with PCA or with

hand-crafted features

8 / 1

slide-3
SLIDE 3

Data Reduction

◮ Feature selection: Select a minimum set of features ˜

x from x so that:

◮ P(class|˜

x) closely approximates P(class|x)

◮ The classification accuracy does not significantly decrease

◮ Data Compression (lossy) ◮ PCA, Canonical variates ◮ Sampling: choose a representative subset of the data

◮ Simple random sampling vs stratified sampling

◮ Hierarchical reduction: e.g. country-county-town

9 / 1

Feature Selection

Usually as part of supervised learning

◮ Stepwise strategies ◮ (a) Forward selection: Start with no features. Add the one which

is the best predictor. Then add a second one to maximize performance using first feature and new one; and so on until a stopping criterion is satisfied

◮ (b) Backwards elimination: Start with all features, delete the one

which reduces performance least, recursively until a stopping criterion is satisfied

◮ Forward selection is unable to anticipate interactions ◮ Backward selection can suffer from problems of overfitting ◮ They are heuristics to avoid considering all subsets of size k of d

features

10 / 1