Data Preprocessing Week 2 Topics Topics Data Types Data - PowerPoint PPT Presentation

Data Preprocessing Week 2

Topics Topics • Data Types • Data Repositories • Data Preprocessing • Present homework assignment #1

Team Homework Assignment #2 Team Homework Assignment #2 • Read pp 227 • Read pp. 227 – 240, pp. 250 – 250, and pp. 259 – 263 the text 240 pp 250 250 and pp 259 263 the text book. • Do Examples 5.3, 5.4, 5.8, 5.9, and Exercise 5.5. p , , , , • Write an R program to verify your answer for Exercise 5.5. Refer to pp. 453 – 458 of the lab book. • Explore frequent pattern mining tools and play them for Exercise 5.5 • Prepare for the results of the homework assignment Prepare for the results of the homework assignment. • Due date – beginning of the lecture on Friday February 11 th . g g y y

Team Homework Assignment #3 Team Homework Assignment #3 • Prepare for the one ‐ page description of your group project P f h d i i f j topic • Prepare for presentation using slides Prepare for presentation using slides • Due date – beginning of the lecture on Friday February 11 th .

Figur re 1.4 Data a Mining as a step in the process of knowledge disco overy

Why Data Preprocessing Is Important? Why Data Preprocessing Is Important? • Welcome to the Real World! • No quality data, no quality mining results! • Preprocessing is one of the most critical steps in a data mining process 6

Major Tasks in Data P Preprocessing i 7 Figure 2.1 Forms of data preprocessing

Why Data Preprocessing is Beneficial to D Data Mining? Mi i ? • Less data – data mining methods can learn faster • Higher accuracy Hi h – data mining methods can generalize better • Simple results • Simple results – they are easier to understand • Fewer attributes – For the next round of data collection, saving can be made by removing redundant and irrelevant features 8

Data Cleaning Data Cleaning 9

Remarks on Data Cleaning Remarks on Data Cleaning • “Data cleaning is one of the biggest problems in data warehousing” ‐‐ Ralph Kimball • “Data cleaning is the number one problem in data warehousing” ‐‐ DCI survey 10

Why Data Is “Dirty”? • Incomplete, noisy, and inconsistent data are I l i d i i d commonplace properties of large real ‐ world databases databases …. (p. 48) (p 48) • There are many possible reasons for noisy data …. (p. 48) 48) 11

Types of Dirty Data Cleaning Methods Types of Dirty Data Cleaning Methods • Missing values – Fill in missing values • Noisy data (incorrect values) – Identify outliers and smooth out noisy data 12

Methods for Missing Values (1) Methods for Missing Values (1) • Ignore the tuple • Fill in the missing value manually Fill in the missing value manually • Use a global constant to fill in the missing value 13

Methods for Missing Values (2) Methods for Missing Values (2) • Use the attribute mean to fill in the missing value • Use the attribute mean for all samples belonging to the same class as the given tuple • Use the most probable value to fill in the missing value 14

Methods for Noisy Data Methods for Noisy Data • Binning • Regression • Clustering 15

Binning Binning 16

Regression Regression 17

Clustering Clustering Figure 2.12 A 2 ‐ D plot of customer data with respect to customer locations in a city, showing three data clusters. Each cluster centroid is marked with a “+”, representing the average point on space that cluster. Outliers may be detected as values that fall outside of i t th t l t O tli b d t t d l th t f ll t id f the sets of clusters. 18

Data Integration Data Integration 19

Data Integration Data Integration • Schema integration and object matching – Entity identification problem • Redundant data (between attributes) occur often when integration of multiple databases – Redundant attributes may be able to be detected by correlation analysis, and chi ‐ square method l ti l i d hi th d 20

Schema Integration and Object Matching Schema Integration and Object Matching • custom_id and cust_number – Schema conflict • “H” and ”S” , and 1 and 2 for pay_type in one database – Value conflict • Solutions – meta data (data about data) t d t (d t b t d t ) 21

Detecting Redundancy (1) Detecting Redundancy (1) • If an attributed can be “derived” from another attribute or a set of attributes, it may be redundant 22

Detecting Redundancy (2) Detecting Redundancy (2) • Some redundancies can be detected by correlation analysis – Correlation coefficient for numeric data – Chi ‐ square test for categorical data • These can be also used for data reduction 23

Chi-square Test Chi square Test • For categorical (discrete) data, a correlation relationship between two attributes, A and B, can be discovered by a χ 2 test test • Given the degree of freedom, the value of χ 2 is used to decide correlation based on a significance level 24

Chi-square Test for Categorical Data D t − 2 ( Observed Expected ) ∑ χ = 2 Expected Expected − 2 c r ( ) o e ∑∑ ∑∑ χ χ = = ij ij 2 2 e = = ij 1 1 i j = × = count ( A a ) count ( B b ) = i j e ij p. 68 N The larger the Χ 2 value, the more likely the variables are related. 25

Chi square Test Chi-square Test male male female female Total Total fiction 250 200 450 non fiction _f 50 1000 1050 Total 300 1200 1500 a ble 2.2 A 2 X 2 contingency table for the data of Example 2.1. T Are ge nde r and pre fe rre d_re ading correlated? The χ 2 statistic tests the hypothesis that gender and preferred_reading are independent. The test is based on a significant level, with (r ‐ 1) x (c ‐ 1) degree of freedom. 26

Table of Percentage Points of th 2 Di t ib ti the χ 2 Distribution 27

Correlation Coefficient Correlation Coefficient N N ∑ ∑ − − − ( a A )( b B ) ( a b ) N A B i i i i = = = = i i 1 1 i i 1 1 r A , B σ σ σ σ N N A B A B − ≤ ≤ + 1 r 1 p. 68 A , B 28

http://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png 29

Data Transformation Data Transformation 30

Data Transformation/Consolidation Data Transformation/Consolidation • Smoothing √ • Aggregation • Generalization • Normaliza � on √ • Attribute construc � on √ 31

Smoothing Smoothing • Remove noise from the data • Binning, regression, and clustering 32

Data Normalization Data Normalization • Min-max normalization Min max normalization − v min = − + A v ' ( new _ max new _ min ) new _ min A A A max − − max min min A A A A • z-score normalization − μ v = A ' v σ A • Normalization by decimal scaling v v ' = where j is the smallest integer such that where j is the smallest integer such that j 10 Max(| ν′ |) < 1 33

Data Normalization Data Normalization • Suppose that the minimum and maximum values for attribute income are $12,000 and $98,000, respectively. We would like to map income to the range [0 0 1 0] Do Min ‐ max to map income to the range [0.0, 1.0]. Do Min max normalization, z ‐ score normalization, and decimal scaling for the attribute income 34

Attribution Construction Attribution Construction • New attributes are constructed from given attributes and New attributes are constructed from given attributes and added in order to help improve accuracy and understanding of structure in high ‐ dimension data • Example – Add the attribute area based on the attributes height and width width 35

Data Reduction Data Reduction 36

Data Reduction Data Reduction • Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data 37

Data Reduction Data Reduction • (Data Cube)Aggregation • Attribute (Subset) Selection • Dimensionality Reduction • Numerosity Reduction • Data Discretization • Concept Hierarchy Generation C t Hi h G ti 38

“The Curse of Dimensionality”(1) The Curse of Dimensionality (1) • Size – The size of a data set yielding the same density of data points in an n ‐ dimensional space increase exponentially with dimensions with dimensions • Radius – A larger radius is needed to enclose a faction of the data points in a high ‐ dimensional space 39

“The Curse of Dimensionality”(2) The Curse of Dimensionality (2) • Distance Distance – Almost every point is closer to an edge than to another sample point in a high ‐ dimensional space • Outlier – Almost every point is an outlier in a high ‐ dimensional space space 40

Data Cube Aggregation Data Cube Aggregation • Summarize (aggregate) data based on dimensions • The resulting data set is smaller in volume, without loss of information necessary for analysis task information necessary for analysis task • Concept hierarchies may exist for each attribute, allowing the analysis of data at multiple levels of abstraction 41

Data Preprocessing Week 2 Topics Topics Data Types Data - PowerPoint PPT Presentation

Data Preprocessing Week 2 Topics Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Team Homework Assignment #2 Read pp 227 Read pp. 227 240, pp. 250

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta

Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1396 Hamid

Preprocessing Data for Machine Learning P R E P R OC E SSIN G FOR MAC H IN E L E AR N IN G

Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Hgskolen i

Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University

Preprocessing input data for machine learning by FCA Jan OUTRATA Dept. Computer Science

Data Preparation Data cleaning Data integration and transformation (Data

Web Usage Mining Reference : http://maya.cs.depaul.edu/~classes/ect584/papers/srivastava.pdf Dr

20-03-28 9. General Improvement Techniques 9.1 Preprocessing There are several

4. Lecture Image enhancement: Filtering 1 Image preprocessing Aims: Improvement of

Motivation Garbage-in, garbage-out Cannot get good mining results from bad data Data

Elementary numerosities and measures notions The main result Open Emanuele Bottazzi,

How to measure the size of sets Vieri Benci Dipartimento di Matematica Applicata U. Dini

COUNTING INFINITE POINT-SETS Marco Forti Dipart. di Matematica Applicata U. Dini -

Bias and Generalization in Deep Generative Models Shengjia Zhao, Hongyu Ren, Arianna Yuan,

Structure: Exposition General problem Purpose / stated goal(s) Experimental setup

Learning quantities from vision and language Raffaella Bernardi University of Trento March 23,

Prototype Selection Using Polyhedron Curvature Benyamin Ghojogh, Fakhri Karray, Mark Crowley

Data Preprocessing Week 2 Topics Topics Data Types Data - PowerPoint PPT Presentation

Data Preprocessing Week 2 Topics Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Team Homework Assignment #2 Read pp 227 Read pp. 227 240, pp. 250

CS6220: DATA MINING TECHNIQUES Chapter 3: Data Preprocessing Instructor: Yizhou Sun

Data Preprocessing Why Data Preprocessing? Chris Williams, School of Informatics University of

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Data cleaning Discretization (Data preprocessing) Data

Preprocessing and Dimensionality Reduction J er emy Fix CentraleSup elec

TRACER TUTORIAL: TEXT REUSE DETECTION PREPROCESSING M arco B uchler, Emily Franzini and Greta

Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1396 Hamid

Preprocessing Data for Machine Learning P R E P R OC E SSIN G FOR MAC H IN E L E AR N IN G

Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Hgskolen i

Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University

Preprocessing input data for machine learning by FCA Jan OUTRATA Dept. Computer Science

Data Preparation Data cleaning Data integration and transformation (Data

Web Usage Mining Reference : http://maya.cs.depaul.edu/~classes/ect584/papers/srivastava.pdf Dr

20-03-28 9. General Improvement Techniques 9.1 Preprocessing There are several

4. Lecture Image enhancement: Filtering 1 Image preprocessing Aims: Improvement of

Motivation Garbage-in, garbage-out Cannot get good mining results from bad data Data

Elementary numerosities and measures notions The main result Open Emanuele Bottazzi,

How to measure the size of sets Vieri Benci Dipartimento di Matematica Applicata U. Dini

COUNTING INFINITE POINT-SETS Marco Forti Dipart. di Matematica Applicata U. Dini -

Bias and Generalization in Deep Generative Models Shengjia Zhao*, Hongyu Ren*, Arianna Yuan,

Structure: Exposition General problem Purpose / stated goal(s) Experimental setup

Learning quantities from vision and language Raffaella Bernardi University of Trento March 23,

Prototype Selection Using Polyhedron Curvature Benyamin Ghojogh, Fakhri Karray, Mark Crowley

Bias and Generalization in Deep Generative Models Shengjia Zhao, Hongyu Ren, Arianna Yuan,