Data Mining and Exploration: Preprocessing
Amos Storkey, School of Informatics January 23, 2006 http://www.inf.ed.ac.uk/teaching/courses/dme/
These lecture slides are based extensively on previous versions of the course written by Chris Williams.
1 / 1
Data Preprocessing
Data preparation is a big issue for data mining. Cabena et al (1998) estimate that data preparation accounts for 60% of the effort in a data mining application.
◮ Data cleaning ◮ Data integration and transformation ◮ Data reduction
Reading: Han and Kamber, chapter 3
2 / 1
Why Data Preprocessing?
Data in the real world is dirty. It is:
◮ incomplete, e.g. lacking attribute values ◮ noisy, e.g. containing errors or outliers ◮ inconsistent, e.g. containing discrepancies in codes or
names GIGO: need quality data to get quality results
3 / 1
Major Tasks in Data Preprocessing
Data cleaning Data integration Data transformation Data reduction attributes attributes A1 A2 A3 ... A126
- 2, 32, 100, 59, 48
- 0.02, 0.32, 1.00, 0.59, 0.48
T1 T2 T3 T4 ... T2000 transactions transactions A1 A3 ... T1 T4 ... T1456 A115
◮ Data cleaning ◮ Data integration ◮ Data transformation ◮ Data reduction
Figure from Han and Kamber
4 / 1