Data Mining Data preprocessing Hamid Beigy Sharif University of - - PowerPoint PPT Presentation

data mining
SMART_READER_LITE
LIVE PREVIEW

Data Mining Data preprocessing Hamid Beigy Sharif University of - - PowerPoint PPT Presentation

Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 15 Table of contents Introduction 1 Data preprocessing 2 Data cleaning 3


slide-1
SLIDE 1

Data Mining

Data preprocessing Hamid Beigy

Sharif University of Technology

Fall 1396

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 15

slide-2
SLIDE 2

Table of contents

1

Introduction

2

Data preprocessing

3

Data cleaning

4

Data integration

5

Data transformation

6

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 2 / 15

slide-3
SLIDE 3

Outline

1

Introduction

2

Data preprocessing

3

Data cleaning

4

Data integration

5

Data transformation

6

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 15

slide-4
SLIDE 4

Data mining process

Real-world data bases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources. Low-quality data will lead to low-quality mining results. Data have quality if they satisfy the requirements of the intended use. Factors comprising data quality are

Accuracy (Does not contain errors) Completeness (All interesting attributes are filled). Consistency Timeliness Believability Interpretability

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 3 / 15

slide-5
SLIDE 5

Outline

1

Introduction

2

Data preprocessing

3

Data cleaning

4

Data integration

5

Data transformation

6

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 15

slide-6
SLIDE 6

Data preprocessing

How can the data be preprocessed in order to improve the quality of the data and, consequently, of the mining results? How can the data be preprocessed so as to improve the efficiency and ease of the mining process? There are several data preprocessing techniques.

Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store such as a data warehouse. Data reduction can reduce data size by aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and efficiency of mining algorithms involving distance measurements.

These techniques are not mutually exclusive; they may work together.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 4 / 15

slide-7
SLIDE 7

Outline

1

Introduction

2

Data preprocessing

3

Data cleaning

4

Data integration

5

Data transformation

6

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 15

slide-8
SLIDE 8

Data cleaning

Data cleaning routines attempt to clean the data by

Fill in missing values. Smooth out noisy data Identifying or removing outliers Correct inconsistencies in the data (For ex. the attribute for customer identification may be referred at as customer-id in one data store and cust-id in another one.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 15

slide-9
SLIDE 9

Filling missing values

In real-world data, many tuples have no recorded value for several attributes. How can you go about filling in the missing values for this attribute? Ignore the tuple Fill in the missing value manually Use a global constant to fill in the missing value such as unknown and ∞. Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the missing value. Use the attribute mean or median for all samples belonging to the same class as the given tuple Use the most probable value to fill in the missing value (using regression, inference-based tools using a Bayesian formalism, or decision tree).

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 15

slide-10
SLIDE 10

Smooth out noisy data

What is noise? Noise is a random error or variance in a measured variable. Given a numeric attribute. How can we smooth out the data to remove the noise?

Binning: Binning methods smooth a sorted data value by consulting its neighborhood. Data partitioning equal-frequency versus equal-width smoothing methods smoothing by bin means versus bin medians and bin boundaries

Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 15

slide-11
SLIDE 11

Smooth out noisy data (cont.)

How can we smooth out the data to remove the noise?

Regression: Data smoothing can also be done by regression. Outlier analysis: Outliers may be detected by clustering. Intuitively, values that fall outside

  • f the set of clusters may be considered outliers.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 15

slide-12
SLIDE 12

Data cleaning as a process

Missing values, noise, and inconsistencies contribute to inaccurate data. Data cleaning process

Discrepancy detection Discrepancies can be caused by several factors including

Poorly designed data entry forms with many optional fields Human error in data entry Data decay (e.g., outdated addresses) Inconsistent data representation Inconsistent use of codes Error in instrumentation devices

As a starting point, use any domain knowledge, for example date format. Data should also be examined regarding uniqe-rule (Attribute values most be unique) Data should also be examined regarding consecutive-rule (no missing values between the lowest and highest values for the attribute, and that all values must also be unique) Data should also be examined regarding null-rule (specifies the use of blanks, question marks, special characters, and how such values should be handled) Some data inconsistencies may be corrected manually using external references (ex. using a paper trace) Most errors will require data transformation (define and apply a series of transformations to correct the given attribute)

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 15

slide-13
SLIDE 13

Outline

1

Introduction

2

Data preprocessing

3

Data cleaning

4

Data integration

5

Data transformation

6

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 15

slide-14
SLIDE 14

Data integration

Data mining often requires data integration (the merging of data from multiple data stores). Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. This can help to improve the accuracy and speed of the subsequent data mining process. Issues in data integration Entity identification Schema integration and object matching can be tricky. Redundancy and correlation analysis An attribute (such as annual revenue, for instance) may be redundant if it can be derived from another attribute or set of attributes. Tuple duplication Two or more records may refer to the same object. Data value conflict detection and resolution For the same real-world entity, attribute values from different sources may differ (ex. telephone no.). This may be due to differences in representation, scaling, or encoding.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 15

slide-15
SLIDE 15

Data reduction

The given dataset may be huge and data analysis may take a long time. Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.

Data reduction Attributes Attributes A1 A2 A3 ... A126 T1 T2 T3 T4 ... T2000 Transactions Transactions T1 T4 ... T1456 A1 A3 ... A115

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 15

slide-16
SLIDE 16

Data reduction (cont.)

Data reduction strategies Dimensionality reduction This is the process of reducing the number of attributes under consideration.

Feature extraction (PCA, MDS, ...) Feature selection

Numerosity reduction These techniques replace the original data volume by alternative and smaller form of data representation.

Linear regression Histograms clustering sampling Data cube aggregation

Data compression In data compression, transformations are applied so as to obtain a reduced or compressed representation of the original data.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 15

slide-17
SLIDE 17

Outline

1

Introduction

2

Data preprocessing

3

Data cleaning

4

Data integration

5

Data transformation

6

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 15

slide-18
SLIDE 18

Data transformation

In this step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand. Data transformation strategies

Smoothing (binning, regression, clustering) Attribute contraction (new attributes are constructed to help the mining process) Aggregation Normalization (min-max normalization, ...) Discretization (binning, histogram, decision tree, clustering) Concept hierarchy generation for nominal data

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 15

slide-19
SLIDE 19

Concept hierarchy generation

Attributes such as street can be generalized to higher-level concepts, like city or country. The following four methods for the generation of concept hierarchies for nominal data

Specification of a partial ordering of attributes explicitly at the schema level by users or

  • experts. A user or expert can easily define a concept hierarchy by specifying a partial or total
  • rdering of the attributes at the schema level.

Specification of a portion of a hierarchy by explicit data grouping. In a large database, it is unrealistic to define an entire concept hierarchy by explicit value enumeration. Specification of a set of attributes, but not of their partial ordering. A user may specify a set

  • f attributes forming a concept hierarchy, but omit to explicitly state their partial ordering.

The system can then try to automatically generate the attribute ordering so as to construct a meaningful concept hierarchy. Specification of only a partial set of attributes. Sometimes a user can be careless when defining a hierarchy, or have only a vague idea about what should be included in a hierarchy.

country 15 distinct values province_or_state city street 365 distinct values 3567 distinct values 674,339 distinct values Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 14 / 15

slide-20
SLIDE 20

Outline

1

Introduction

2

Data preprocessing

3

Data cleaning

4

Data integration

5

Data transformation

6

Reading

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 15

slide-21
SLIDE 21

Reading

Read chapter 3 of the following book

  • J. Han, M. Kamber, and Jian Pei, Data Mining: Concepts and Techniques, Morgan

Kaufmann, 2012.

Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 15