Data Preprocessing Week 2 Topics Topics Data Types Data - - PowerPoint PPT Presentation

data preprocessing
SMART_READER_LITE
LIVE PREVIEW

Data Preprocessing Week 2 Topics Topics Data Types Data - - PowerPoint PPT Presentation

Data Preprocessing Week 2 Topics Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Team Homework Assignment #2 Read pp 227 Read pp. 227 240, pp. 250


slide-1
SLIDE 1

Data Preprocessing

Week 2

slide-2
SLIDE 2

Topics Topics

  • Data Types
  • Data Repositories
  • Data Preprocessing
  • Present homework assignment #1
slide-3
SLIDE 3

Team Homework Assignment #2 Team Homework Assignment #2

  • Read pp 227

240 pp 250 250 and pp 259 263 the text

  • Read pp. 227 – 240, pp. 250 – 250, and pp. 259 – 263 the text

book.

  • Do Examples 5.3, 5.4, 5.8, 5.9, and Exercise 5.5.

p , , , ,

  • Write an R program to verify your answer for Exercise 5.5.

Refer to pp. 453 – 458 of the lab book.

  • Explore frequent pattern mining tools and play them for

Exercise 5.5

  • Prepare for the results of the homework assignment

Prepare for the results of the homework assignment.

  • Due date

– beginning of the lecture on Friday February 11th. g g y y

slide-4
SLIDE 4

Team Homework Assignment #3 Team Homework Assignment #3

P f h d i i f j

  • Prepare for the one‐page description of your group project

topic

  • Prepare for presentation using slides

Prepare for presentation using slides

  • Due date

– beginning of the lecture on Friday February 11th.

slide-5
SLIDE 5

Figur disco re 1.4 Data

  • very

a Mining as a step in the process of knowledge

slide-6
SLIDE 6

Why Data Preprocessing Is Important? Why Data Preprocessing Is Important?

  • Welcome to the Real World!
  • No quality data, no quality mining results!
  • Preprocessing is one of the most critical steps in a data mining

process

6

slide-7
SLIDE 7

Major Tasks in Data P i Preprocessing

7

Figure 2.1 Forms of data preprocessing

slide-8
SLIDE 8

Why Data Preprocessing is Beneficial to D Mi i ? Data Mining?

  • Less data

– data mining methods can learn faster Hi h

  • Higher accuracy

– data mining methods can generalize better

  • Simple results
  • Simple results

– they are easier to understand

  • Fewer attributes

– For the next round of data collection, saving can be made by removing redundant and irrelevant features

8

slide-9
SLIDE 9

Data Cleaning Data Cleaning

9

slide-10
SLIDE 10

Remarks on Data Cleaning Remarks on Data Cleaning

  • “Data cleaning is one of the biggest problems in data

warehousing” ‐‐ Ralph Kimball

  • “Data cleaning is the number one problem in data

warehousing” ‐‐ DCI survey

10

slide-11
SLIDE 11

Why Data Is “Dirty”?

I l i d i i d

  • Incomplete, noisy, and inconsistent data are

commonplace properties of large real‐world databases (p 48) databases …. (p. 48)

  • There are many possible reasons for noisy data …. (p.

48) 48)

11

slide-12
SLIDE 12

Types of Dirty Data Cleaning Methods Types of Dirty Data Cleaning Methods

  • Missing values

– Fill in missing values

  • Noisy data (incorrect values)

– Identify outliers and smooth out noisy data

12

slide-13
SLIDE 13

Methods for Missing Values (1) Methods for Missing Values (1)

  • Ignore the tuple
  • Fill in the missing value manually

Fill in the missing value manually

  • Use a global constant to fill in the missing value

13

slide-14
SLIDE 14

Methods for Missing Values (2) Methods for Missing Values (2)

  • Use the attribute mean to fill in the missing value
  • Use the attribute mean for all samples belonging to the same

class as the given tuple

  • Use the most probable value to fill in the missing value

14

slide-15
SLIDE 15

Methods for Noisy Data Methods for Noisy Data

  • Binning
  • Regression
  • Clustering

15

slide-16
SLIDE 16

Binning Binning

16

slide-17
SLIDE 17

Regression Regression

17

slide-18
SLIDE 18

Clustering Clustering

Figure 2.12 A 2‐D plot of customer data with respect to customer locations in a city, showing three data clusters. Each cluster centroid is marked with a “+”, representing the i t th t l t O tli b d t t d l th t f ll t id f average point on space that cluster. Outliers may be detected as values that fall outside of the sets of clusters.

18

slide-19
SLIDE 19

Data Integration Data Integration

19

slide-20
SLIDE 20

Data Integration Data Integration

  • Schema integration and object matching

– Entity identification problem

  • Redundant data (between attributes) occur often when

integration of multiple databases – Redundant attributes may be able to be detected by l ti l i d hi th d correlation analysis, and chi‐square method

20

slide-21
SLIDE 21

Schema Integration and Object Matching Schema Integration and Object Matching

  • custom_id and cust_number

– Schema conflict

  • “H” and ”S”, and 1 and 2 for pay_type in one database

– Value conflict

  • Solutions

t d t (d t b t d t ) – meta data (data about data)

21

slide-22
SLIDE 22

Detecting Redundancy (1) Detecting Redundancy (1)

  • If an attributed can be “derived” from another attribute or a

set of attributes, it may be redundant

22

slide-23
SLIDE 23

Detecting Redundancy (2) Detecting Redundancy (2)

  • Some redundancies can be detected by correlation analysis

– Correlation coefficient for numeric data – Chi‐square test for categorical data

  • These can be also used for data reduction

23

slide-24
SLIDE 24

Chi-square Test Chi square Test

  • For categorical (discrete) data, a correlation relationship

between two attributes, A and B, can be discovered by a χ2 test test

  • Given the degree of freedom, the value of χ2 is used to

decide correlation based on a significance level

24

slide-25
SLIDE 25

Chi-square Test for Categorical D t Data

− = Expected Expected Observed

2 2

) ( χ Expected

∑∑

− =

c r ij ij

e

  • 2

) ( 2 χ

∑∑

= =

=

i j ij

e

1 1

2 χ

N b B count a A count e

j i ij

) ( ) ( = × = =

  • p. 68

25

The larger the Χ2 value, the more likely the variables are related.

slide-26
SLIDE 26

Chi-square Test Chi square Test

male female Total male female Total fiction 250 200 450 non fiction 50 1000 1050 _f Total 300 1200 1500

T a ble 2.2 A 2 X 2 contingency table for the data of Example 2.1.

Are ge nde rand pre fe rre d_re ading correlated? The χ2 statistic tests the hypothesis that gender and preferred_reading are

  • independent. The test is based on a significant level, with (r ‐ 1) x (c ‐ 1) degree of

freedom.

26

slide-27
SLIDE 27

Table of Percentage Points of th 2 Di t ib ti the χ2 Distribution

27

slide-28
SLIDE 28

Correlation Coefficient Correlation Coefficient

) ( ) )( (

1 1

− − −

∑ ∑

N i i i N i i i

B A N b a B b A a

1 1 ,

= =

= = B A i B A i B A

N N r σ σ σ σ 1 1

,

+ ≤ ≤ −

B A

r

  • p. 68

28

slide-29
SLIDE 29

http://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png

29

slide-30
SLIDE 30

Data Transformation Data Transformation

30

slide-31
SLIDE 31

Data Transformation/Consolidation Data Transformation/Consolidation

  • Smoothing √
  • Aggregation
  • Generalization
  • Normalizaon √
  • Attribute construcon √

31

slide-32
SLIDE 32

Smoothing Smoothing

  • Remove noise from the data
  • Binning, regression, and clustering

32

slide-33
SLIDE 33

Data Normalization Data Normalization

  • Min-max normalization

Min max normalization

A A A A A A

min new min new max new min max min v v _ ) _ _ ( ' + − − − =

  • z-score normalization

A A

min max −

A A

v v σ μ − = '

  • Normalization by decimal scaling

v v'=

where jis the smallest integer such that

33

j

10

where jis the smallest integer such that Max(|ν′|) < 1

slide-34
SLIDE 34

Data Normalization Data Normalization

  • Suppose that the minimum and maximum values for attribute

income are $12,000 and $98,000, respectively. We would like to map income to the range [0 0 1 0] Do Min‐max to map income to the range [0.0, 1.0]. Do Min max normalization, z‐score normalization, and decimal scaling for the attribute income

34

slide-35
SLIDE 35

Attribution Construction Attribution Construction

  • New attributes are constructed from given attributes and

New attributes are constructed from given attributes and added in order to help improve accuracy and understanding

  • f structure in high‐dimension data
  • Example

– Add the attribute area based on the attributes height and width width

35

slide-36
SLIDE 36

Data Reduction Data Reduction

36

slide-37
SLIDE 37

Data Reduction Data Reduction

  • Data reduction techniques can be applied to obtain a reduced

representation of the data set that is much smaller in volume representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data

37

slide-38
SLIDE 38

Data Reduction Data Reduction

  • (Data Cube)Aggregation
  • Attribute (Subset) Selection
  • Dimensionality Reduction
  • Numerosity Reduction
  • Data Discretization

C t Hi h G ti

  • Concept Hierarchy Generation

38

slide-39
SLIDE 39

“The Curse of Dimensionality”(1) The Curse of Dimensionality (1)

  • Size

– The size of a data set yielding the same density of data points in an n‐dimensional space increase exponentially with dimensions with dimensions

  • Radius

– A larger radius is needed to enclose a faction of the data points in a high‐dimensional space

39

slide-40
SLIDE 40

“The Curse of Dimensionality”(2) The Curse of Dimensionality (2)

  • Distance

Distance – Almost every point is closer to an edge than to another sample point in a high‐dimensional space

  • Outlier

– Almost every point is an outlier in a high‐dimensional space space

40

slide-41
SLIDE 41

Data Cube Aggregation Data Cube Aggregation

  • Summarize (aggregate) data based on dimensions
  • The resulting data set is smaller in volume, without loss of

information necessary for analysis task information necessary for analysis task

  • Concept hierarchies may exist for each attribute, allowing the

analysis of data at multiple levels of abstraction

41

slide-42
SLIDE 42

Data Aggregation Data Aggregation

F ig ure 2.13 Sales data for a given branch of AllE

le c tro nic s for the years 2002 to 2004. On the left, the sales are shown per quarter. On the right, the data are aggregated to provide the annual sales the right, the data are aggregated to provide the annual sales

42

slide-43
SLIDE 43

Data Cube Data Cube

  • Provide fast access to pre‐computed, summarized data,

thereby benefiting on‐line analytical processing as well as thereby benefiting on line analytical processing as well as data mining

43

slide-44
SLIDE 44

Data Cube - Example Data Cube Example

F ig ure 2.14 A data cube for sales at AllE

le c tro nic s

F ig ure 2.14 A data cube for sales at AllE

le c tro nic s

44

slide-45
SLIDE 45

Attribute Subset Selection (1) Attribute Subset Selection (1)

  • Attribute selection can help in the phases of data mining

(knowledge discovery) process – By attribute selection,

  • we can improve data mining performance (speed of

l i di i i li i f l ) learning, predictive accuracy, or simplicity of rules)

  • we can visualize the data for model selected
  • we reduce dimensionality and remove noise
  • we reduce dimensionality and remove noise.

45

slide-46
SLIDE 46

Attribute Subset Selection (2) Attribute Subset Selection (2)

  • Attribute (Feature) selection is a search problem

– Search directions

  • (Sequential) Forward selection
  • (Sequential) Backward selection (elimination)
  • Bidirectional selection

D i i t l ith (i d ti )

  • Decision tree algorithm (induction)

46

slide-47
SLIDE 47

Attribute Subset Selection (3) Attribute Subset Selection (3)

  • Attribute (Feature) selection is a search problem

– Search strategies E h ti h

  • Exhaustive search
  • Heuristic search

– Selection criteria Selection criteria

  • Statistic significance
  • Information gain

g

  • etc.

47

slide-48
SLIDE 48

Attribute Subset Selection (4) Attribute Subset Selection (4)

F ig ure 2.15. Greedy (heuristic) methods for attribute subset selection

48

slide-49
SLIDE 49

Data Discretization Data Discretization

R d th b f l f i

  • Reduce the number of values for a given

continuous attribute by dividing the range

  • f the attribute into intervals
  • f the attribute into intervals
  • Interval labels can then be used to replace

t l d t l actual data values

  • Split (top‐down) vs. merge (bottom‐up)
  • Discretization can be performed recursively
  • n an attribute

49/51

slide-50
SLIDE 50

Why Discretization is Used? Why Discretization is Used?

  • Reduce data size.
  • Transforming quantitative data to qualitative data.

50

slide-51
SLIDE 51

Interval Merge by χ2 Analysis Interval Merge by χ Analysis

  • Merging‐based (bottom‐up)
  • Merge: Find the best neighboring intervals and merge them to

form larger intervals recursively

  • ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]

51/51

slide-52
SLIDE 52
  • Initially, each distinct value of a numerical attribute A is

considered to be one interval considered to be one interval

  • χ 2 tests are performed for every pair of adjacent intervals
  • Adjacent intervals with the least χ 2 values are merged

together, since low χ 2 values for a pair indicate similar class distributions

  • This merge process proceeds recursively until a predefined
  • This merge process proceeds recursively until a predefined

stopping criterion is met

52

slide-53
SLIDE 53

Entropy-Based Discretization Entropy Based Discretization

  • The goal of this algorithm is to find the split with the

maximum information gain.

  • The boundary that minimizes the entropy over all

possible boundaries is selected

  • The process is recursively applied to partitions
  • btained until some stopping criterion is met

h b d d d d

  • Such a boundary may reduce data size and improve

classification accuracy

53/51

slide-54
SLIDE 54

What is Entropy? What is Entropy?

  • The entropy is a
  • The entropy is a

measure of the uncertainty associated ith d i bl with a random variable

  • As uncertainty and or

randomness increases for a result set so does the entropy Values range from 0 1

  • Values range from 0 – 1

to represent the entropy

  • f information

54

slide-55
SLIDE 55

Entropy Example Entropy Example

55

slide-56
SLIDE 56

Entropy Example py p

56

slide-57
SLIDE 57

Entropy Example (cont’d) py p ( )

57

slide-58
SLIDE 58

Calculating Entropy g py

For m classes:

=

− =

m i i i

p p S Entropy

1 2

log ) (

For 2 classes:

2 2 2 1 2 1

log log ) ( p p p p S Entropy − − =

For 2 classes:

  • Calculated based on the class distribution of the samples

in set S.

  • pi is the probability of class i in S
  • m is the number of classes (class values)

58

slide-59
SLIDE 59

Calculating Entropy From Split g py p

  • Entropy of subsets S and S2 are calculated

Entropy of subsets S1 and S2 are calculated.

  • The calculations are weighted by their probability of being in

set S and summed.

  • In formula below,

– S is the set – T is the value used to split S into S1 and S2 T is the value used to split S into S1 and S2

S S ) ( ) ( ) , (

2 2 1 1

S Entropy S S S Entropy S S T S E + =

59

slide-60
SLIDE 60

Calculating Information Gain g

I f ti G i Diff i t b t

  • Information Gain = Difference in entropy between
  • riginal set (S) and weighted split (S1 + S2)

) , ( ) ( ) , ( T S E S Entopy T S Gain − =

0.766289 0.991076 ) 56 , ( − = S Gain 0 224788 ) 56 ( = S Gain 0.224788 ) 56 , ( = S Gain 0.091091 ) 46 , ( = S Gain

compare to

60

slide-61
SLIDE 61

Numeric Concept Hierarchy Numeric Concept Hierarchy

  • A concept hierarchy for a given numerical attribute

defines a discretization of the attribute

  • Recursively reduce the data by collecting and

replacing low level concepts by higher level concepts

61

slide-62
SLIDE 62

A Concept Hierarchy for the Attribute Pric e Attribute Pric e

F ig ure 2.22. A concept hierarchy for the attribute price.

62/51

slide-63
SLIDE 63

Segmentation by Natural Partitioning Segmentation by Natural Partitioning

  • A simply 3‐4‐5 rule can be used to segment numeric data into

l l f “ l” l relatively uniform, “natural” intervals

– If an interval covers 3, 6, 7 or 9 distinct values at the most significant digit, partition the range into 3 equi‐width intervals – If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals – If it covers 1, 5, or 10 distinct values at the most significant digit, , , g g , partition the range into 5 intervals

63/51

slide-64
SLIDE 64

F ig

Hie

g ure 2.23.

erarchy fo Automat

  • r profit b

tic gener based on ation of a 3-4-5 rule a concep e. pt

64

slide-65
SLIDE 65

Concept Hierarchy Generation for C t i l D t Categorical Data

  • Specification of a partial ordering of attributes explicitly at the

Specification of a partial ordering of attributes explicitly at the schema level by users or experts

  • Specification of a portion of a hierarchy by explicit data

grouping

  • Specification of a set of attributes, but not of their partial
  • rdering
  • rdering

65

slide-66
SLIDE 66

Automatic Concept Hierarchy Generation Generation

country 15 distinct valules province or state 365 distinct values city 3,567 distinct values street 674,339 distinct values

Based on the number of distinct values per attributes p 95 Based on the number of distinct values per attributes, p.95

66

slide-67
SLIDE 67

Data preprocessing Data cleaning Missing values Use the most probable value to fill in the missing value (and five other methods) Noisy data Binning; Regression; Clusttering Data integration Entity ID problem Metadata Redundancy Correlation analysis (Correlation coefficient chi square test) Correlation analysis (Correlation coefficient, chi‐square test) Data trasnformation Smoothing Data cleaning Aggregation Data reduction Data reduction Generailization Data reduction Normalization Min‐max; z‐score; decimal scaling Attribute Construction Data reduction Data cube aggregation Data cube store multidimensional aggregated information Attribute subset selection Stepwise forward selection; stepwise backward selection; combination; decision tree induction Dimensionality reduction Discrete wavelet trasnforms (DWT); Principle components analysis (PCA); Numerosity Reduction Regression and log‐linear models; histograms; clustering; sampling Data discretization Bi i hi t l i t b d di ti ti 67 Binning; historgram analysis; entropy‐based discretization; Interval merging by chi‐square analysis; cluster analysis; intuitive partitioning Concept hierarchy Concept hierarchy generation