[PPT] - Panorama des mthodes de dtection et de traitement des anomalies PowerPoint Presentation

SLIDE 1

Panorama des méthodes de détection et de traitement des anomalies

Laure Berti-Équille

IRD

AAFD 2012

www.ird.fr laure.berti@ird.fr

SLIDE 2

À la recherche des problèmes… de qualité de données

“Dirty Data” :

– Données malformatées – Données aberrantes (outliers) – Doublons – Données incohérentes – Données obsolètes – Données fausses, incorrectes, erronées – Données incomplètes, tronquées, censurées – Données manquantes

2 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 2

SLIDE 3

Outline

1. Motivating Example
2. Generic Guidelines
3. Methods for Anomaly Detection
4. Techniques for Cleaning Dirty Data
5. Summary and Conclusions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 3

SLIDE 4

Outline

1. Motivating Example
2. Generic Guidelines
3. Methods for Anomaly Detection
4. Techniques for Cleaning Dirty Data
5. Summary and Conclusions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 4

SLIDE 5

IP Data Streams: A Picture

10 Attributes, every 5

minutes, over four weeks

Axes transformed for

plotting

5 *L. Berti-Équille, T. Dasu, D. Srivastava : Discovery of complex glitch patterns : A novel approach to Quantitative Data Cleaning. Proc. of ICDE 2011 , pp. 733-744, Hannover, Germany, 2011.

SLIDE 6

Detection of Patterns of Anomalies

Missing Outliers Duplicate Outliers

Interfaces Utilization_Out Utilization_In Bytes_Out Bytes_In Memory CPU Latency Syslog_Events CPU_Poll 6

SLIDE 7

Detection: Main Issues

A large variety of detection methods with

conflicting results

No benchmark DQ problems are not necessarily rare events DQ problems may be (partially) correlated Mutual masking-effects impair the detection (e.g., - missing values affects the detection of duplicates

duplicate records affects the detection of outliers
imputation methods may mask the presence of duplicates)

Classical assumptions won’t work (e.g., MCAR/MAR, normality, symmetry, uni-modality)

7 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 7

SLIDE 8

Cleaning: What Can Be Done?

Cleaning strategies (ad hoc)

– Impute missing values component-wise median? – De-duplicate retain a random record? – Handle outliers identify and remove? So many methods but contradicting results? – Drop all records that have any imperfection – Add special categories and analyze singularities in isolation

Almost all existing approaches look at one-shot

approaches to univariate glitches. Why?

Cleaning introduces new errors !?

8 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 8

SLIDE 9

Deletion Imputation Modeling

Deletion Fusion Random Selection

Deletion Winsorization Trimming Data Missing Values Duplicates Outliers

So Many Choices…

9 9 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 9

SLIDE 10

Outline

1. Motivating Example
2. Generic Guidelines
3. Methods for Anomaly Detection
4. Techniques for Cleaning Dirty Data
5. Summary and Conclusions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 10

SLIDE 11

Guidelines Step 1 – Explore the data distributions

Goal

– Detect and count missing, extreme and aberrant data values – Decide not to consider some values or variables – Decide the transformation and corrective actions to apply

For continuous variables

– Discretization – Test for normality (essential for small datasets) and normalization – Optional test for homoscedasticity (equality of variance-covariance matrices) – Detect non-linearity and non-monotony

For discrete variables

– Group the variables with small populations – Create new relevant aggregates

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 11

SLIDE 12

Step 1 - Data Distribution Characteristics

( )

∑

− =

N i i

x x N

2

1 σ

µ σ = CV

∑

        − =

N i x i

x x N S

3

1 σ

∑

        − =

N i x i

x x N K

4

1 σ

Dispersion

– Standard deviation – Coefficient of Variation (CV): a normalized measure of dispersion

f a probability distribution

– IQR: Q3-Q1 – Homoscedasticity: equality of variances for a variable on different subsets using Levene, Barlett or Fisher tests (if p<.05 ⇒ heteroscedasticity)

Skewness: measure of the asymmetry of the probability

distribution of a real-valued random variable

= 0 : when the distribution is symmetrical
>0 : the mass of the distribution is concentrated on the left
<0 : the mass of the distribution is concentrated on the right
Kurtosis: measure of the flatness of the distribution
=3 flat like the normal distribution
>3 more concentrated
<3 flatter than the Gaussian

12

SLIDE 13

13

Step 1- Test for Normality

Many DM methods assume multivariate normal distributions
Multivariate normality can be detected by inspecting the indices of

multivariate skewness and kurtosis

Lack of univariate normality occurs when the skewness index > 3.0

and kurtosis index > 10

Non-normal distributions can sometimes be corrected by

transforming variables

Tests:

– Kolmogorov-Smirnov Test: non-parametric test that quantifies the maximum distance between the empirical distribution function of the variable and the cdf of the normal distribution – Anderson-Darling Test: variant of K-S test weighting the tails of distributions – Lilliefors Test: variant of K-S test for unknown mean and standard deviation – Shapiro-Wilk Test : orders the sample values in ascending order and uses the correlation to detect small departures from normality - not suitable for very large sample sizes (SAS proc UNIVARIATE)

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 13

SLIDE 14

Goal

– Detect inconsistencies between 2 or more variables – Determine relationships between one target variable and one or more variables contributing to its explanation in order to eliminate no effect variables – Determine relationships between explanation variables in order to avoid multicollinearity that may causes the failure of regression techniques – Quantify the strength of the relationship and sensitivity in presence

f outliers

– Detect spurious correlations

Methods

– Bivariate statistics measuring pair-wise correlations – Discover FDs

Guidelines Step 2 – Analyze data relationships

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 14

SLIDE 15

MV statistics Model-based methods

Linear, logistic regression Probabilistic methods

MCD, MVE, Robust estimators

Clustering

Distance-based techniques Density-based techniques Subspace-based techniques

Visualization

Graphics Q-Q plot Confusion Matrix Distributional techniques Skewness, Kurtosis Goodness of fit tests: normality, Chi-square tests, analysis of residulas, Kullback-Lieber divergence Control Charts: X-Bar, CUSUM, R

UV statistics Classification

Rule-based techniques SVM, Neural Networks, Bayesian Networks Information theoretic measures Kernel-based methods

Rule & Pattern Discovery

Association Rule Discovery FD, AFD, CFD mining

Guidelines Step 1&2 - Use the toolbox for detection

Ultimate Research Goals

Benchmarking
Optimization
Refinement
Scalability
Tuning
Real-time
Interactivity

15 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 15

SLIDE 16

Guidelines Step 3 - Data Preparation: Major Tasks

Data cleaning

– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration

– Integration of multiple databases, data cubes, or files

Data transformation

– Normalization and aggregation

Data reduction

– Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization

– Part of data reduction but with particular importance, especially for numerical data

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 16

SLIDE 17

Data Preparation: Major Tasks

Data cleaning

– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration

– Integration of multiple databases, data cubes, or files

Data transformation

– Normalization and aggregation

Data reduction

– Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization

– Part of data reduction but with particular importance, especially for numerical data

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 17

SLIDE 18

Outline

1. Motivating Example
2. Methods for Anomaly Detection

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 18

– Non standardized, misfielded/formatted – Duplicates – Outliers – Inconsistencies – Missing, truncated – Out-of-date – Erroneous, contradicting, false

SLIDE 19

Outline

1. Motivating Example
2. Methods for Anomaly Detection

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 19

– Non standardized, misfielded/formatted – Duplicates – Outliers – Inconsistencies – Missing, truncated – Out-of-date – Erroneous, contradicting, false

SLIDE 20

Name Affiliation City, State, Zip, Country Phone

Piatetsky-Shapiro G.,PhD

U. of Massachusetts

617-264-9914 David J. Hand Imperial College London, UK Benjamin W. Wah

Univ. of Illinois

IL 61801, USA (217) 333-6903 Hand D.J. Vippin Kumar

U. of Minnesota, MI, USA

Xindong Wu

U. of Vermont

Burlington-4000 USA NULL Philip S. Yu

U. of Illinois

Chicago IL, USA 999-999-9999 Osmar R. Zaiiane

U. of Alberta

CA 111-111-1111

Example

Misfielded Value Non-standard representation ICDM Steering Committee

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 20

SLIDE 21

Extract-Transform-Load (1/4)

Format detection, verification, and conversion
Standardization of values with loose or predictable structure

e.g., addresses, names, bibliographic entries

Abbreviation enforcing
Data consolidation based on dictionaries and constraints
Declarative language extensions
Machine learning and HMM

for field and record segmentation

Constraint-based method [Fan et al., 2008]

Goals Approaches

[Christen et al., 2002]

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 21

SLIDE 22

22

ETL Operators

Operators Category Application Mapping, Convert, Select, Drop, Add, Merge, Format

Row-level Locally applied to a single row

Copy, Filter, Split, Switch

Router Locally decide, for each row, which

f the many (output) destinations it

should be sent to

Pivot/Unpivot, Aggregate, Clustering

Unary Grouper Transform a set of rows to a single row

Union, Merge, Join, Look-up, Compare, Divide

Binary or N-ary Combine many inputs into one

utput

Sort

Unary Holistic Perform a transformation to the entire dataset

[Vassiliadis et al. 2007]

SLIDE 23

Open Source ETL: 2 of Many

Kettle (PDI) Febrl

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 23 http://cs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/ http://www.pentaho.com/

SLIDE 24

Extract-Transform-Load (4/4)

Design of Ad Hoc scenarios
Performance/scalability issues due to dependencies among ETL

jobs and sequential processing

DB bottleneck for bulk ETL operators
Mainly for structured (relational) data
Optimization of ETL Workflows*
Active data warehousing
Cleaning of data streams

Limitations

Research Directions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 24 *A. Simitsis, P. Vassiliadis, T. K. Sellis. State-Space Optimization of ETL Workflows. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) vol. 17, no. 10, pp. 1404-1419, October 2005.

SLIDE 25

Outline

1. Motivating Example
2. Methods for Anomaly Detection

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 25

– Non standardized, misfielded/formatted – Duplicates – Outliers – Inconsistencies – Missing, truncated – Out-of-date – Erroneous, contradicting, false

SLIDE 26

1. Reduce the search space partitioning the dataset

into mutually exclusive blocks to compare

Hashing, sorted keys, sorted nearest neighbors, (Multiple)

Windowing, Clustering

2. Select and compute a comparison function

measuring the similarity distance between pairs of records

Token-based : N-grams comparison, Jaccard, TF-IDF, cosine

similarity

Edit-based: Jaro distance, Edit distance, Levenshtein, Soundex
Domain-dependent: data types, ad-hoc rules, relationship-

aware similarity measures

3. Select a decision model to classify pairs of records

as matching, non-matching or potentially matching

4. Select the deduplication method

Record Linkage (RL)

Blocking Comparison Classification Fusion

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 26

SLIDE 27

Record Linkage (RL)

Blocking Comparison Classification Fusion

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 27

ELMAGARMID, AHMED K., IPEIROTIS,

PANAGIOTIS G., & VERYKIOS, VASSILIOS S. Duplicate Record Detection: A Survey. IEEE Trans.

Knowl. Data Eng., 19(1), 1–16, 2007.
SimMetrics: Similarity Metric Java Library

http://sourceforge.net/projects/simmetrics/

KOUDAS, NICK, SARAWAGI SUNITA, SRIVASTAVA
DIVESH. Record Linkage: Similarity Measures and
Algorithms. Tutorial of SIGMOD 2006.
DONG, LUNA, NAUMANN, FELIX : Data fusion -

Resolving Data Conflicts for Integration. Tutorial

f VLDB 2009.

SLIDE 28

Chaining or Spurious Linkage

ID Name Address 1 AT&T 180 Park. Av Florham Park 2 ATT 180 park Ave. Florham Park NJ 3 AT&T Labs 180 Park Avenue Florham Park 4 ATT Park Av. 180 Florham Park 5 TAT 180 park Av. NY 6 ATT 180 Park Avenue. NY NY 7 ATT Park Avenue, NY No. 180 8 ATT 180 Park NY NY

Park Av. 180 Florham Park 180 Park Avenue Florham Park 180 Park. Av Florham Park 180 park Ave. Florham Park NJ 180 Park Avenue. NY NY 180 park Av. NY 180 Park NY NY Park Avenue, NY No. 180

1 3 4 5 6 8

Limitations:

Expertise required for method

selection and parameterization

No Benchmark

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 28

SLIDE 29

Outline

1. Motivating Example
2. Methods for Anomaly Detection

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 29

– Non standardized, misfielded/formatted – Duplicates – Outliers – Inconsistencies – Missing, truncated – Out-of-date – Erroneous, contradicting, false

SLIDE 30

Outlier Taxonomy

Anomaly Detection

Contextual Anomaly Detection Collective Anomaly Detection Online Anomaly Detection Distributed Anomaly Detection

Point Anomaly Detection

Classification Based

Rule Based Neural Networks Based SVM Based

Nearest Neighbor Based

Density Based Distance Based

Statistical

Parametric Non-parametric

Clustering Based Others

Information Theory Based Spectral Decomposition Based Visualization Based

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly Detection – A Survey. ACM Computing Surveys, 41(3), 1–58.

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 30

SLIDE 31

Example

N1 and N2 are normal

regions

o1, o2 and o4 are

punctual anomalies

Region O3 is a

collective anomaly

X Z N1 N2

1
2

O3 Y O4

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 31

SLIDE 32

So many detection methods…

X Y Z

Multivariate Analysis Bivariate Analysis

comparison

Rejection area: Data space excluding the area defined between 2% and 98% quantiles for X and Y Rejection area based on: Mahalanobis_dist(cov(X,Y)) > χ2(.98,2) Y X X Y

Legitimate

utliers or

data quality problems?

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 32

SLIDE 33

Contextual Anomaly

aka “conditional anomalies” *

* Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka, Conditional Anomaly Detection, IEEE Transactions

n Data and Knowledge Engineering, 2006.

Normal Anomaly

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 33

SLIDE 34

Collective Anomaly

A collection of abnormal observations
Requires the existence of a certain type of relationship

between the observations:

– Sequential – Spatial – Connectivity (graph)

Each instance of a collective anomaly is not abnormal itself

Subsequence anomaly AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 34

SLIDE 35

Outlier Detection (1/4)

Detection by inspecting frequency distributions and

univariate measures of Skewness and Kurtosis

Numerous Detection Techniques

– Distributional univariate technique: 3σ away from the mean – Goodness of fit tests: tests for normality, χ2 test, analysis of residuals, Q-Q plots, Kullback-Liebler divergence – Control charts (X-Bar, R, CUSUM), error bounds, tolerance limits – Regression-based technique: measures the outlyingness of a model, not an individual data point – Geometric techniques: define layers of increasing depth, outer layers contain the outlying points

SLIDE 36

Outlier Detection Methods (2/4)

Popular methods: LOF, INFLO, LOCI

see Tutorial of [Kriegel et al., 2009] ELKI: http://elki.dbs.ifi.lmu.de/wiki

Mixture distribution: Anomaly detection over noisy data

using learned probability distributions [Eskin, 2000]

Entropy: Discovering cluster-based local outliers [He,

2003]

Projection into higher dimensional space: Kernel methods

for pattern analysis [Shawne-Taylor, Cristiani, 2005]

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 36

SLIDE 37

Limitations

– When normal points do not have sufficient number of neighbours – In high dimensional spaces due to data sparseness – When datasets have modes with varying density – Computationally expensive

Distance-based outliers (3/4)

O d

Nearest Neighbour-based Approaches

A point O in a dataset is an DB(p,d)-outlier if at least fraction p of the points in the data set lies greater than distance d from the point O. [Knorr, Ng, 1998] Outliers are the top n points whose distance to the k-th nearest neighbor is greatest. [Ramaswamy et al., 2000]

O NNd

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 37

SLIDE 38

d1 d2

Goal

Compute local densities of particular regions and declare data points in low density regions as potential anomalies

Methods

Local Outlier Factor (LOF) [Breunig et al., 2000]
Connectivity Outlier Factor (COF) [Tang et al., 2002]
Multi-Granularity Deviation Factor [Papadimitriou et al., 2003]

Density-based outliers (4/4)

O1 O2

NN: O2 is outlier but O1 is not LOF: O1 is outlier but O2 is not

Difficult choice between methods with contradicting results
In high dimensional spaces, factor values will tend to cluster

because density is defined in terms of distance Limitations

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 38

SLIDE 39

Outline

1. Motivating Example
2. Generic Guidelines
3. Methods for Anomaly Detection
4. Techniques for Cleaning Dirty Data
5. Summary and Conclusions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 3 9

SLIDE 40

How to Handle Missing Data?

– Inclusion (applicable for less than 15%)

Anomalies are treated as a specific category

– Deletion

List-wise deletion omits the complete record (for less than 2%)
Pair-wise deletion excludes only the anomaly value from a

calculation – Substitution (applicable for less than 15%)

Single imputation based on mean, mode or median

replacement

Linear regression imputation
Multiple imputation (MI)
Full Information Maximum Likelihood (FIML)

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 40

SLIDE 41

How to Handle Dirty Data?

Binning / Smoothing

– first sort data and partition into bins – then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Clustering

– detect and remove outliers

Combined computer and human inspection

– detect suspicious values and check by human

Regression

– smooth by fitting the data into regression functions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 41

SLIDE 42

Discretization (Binning) (1/3)

Goal Transform continuous variables into a set of ranges treated as (ordered) categories Advantages – Simultaneous analysis of quantitative and qualitative variables – Ability to capture non-linear correlations between continuous variables – Neutralize extreme values – Handle missing values with the creation of a specific category – Cardinality reduction

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 42

SLIDE 43

Discretization (Binning) (2/3)

Recommendations

– Avoid large differences between the numbers of distinct values (categories) per variable – Avoid categories with small population – The appropriate number of categories for a discrete or categorical variable is 4 or 5 – Remember :

the weight of a variable is proportional to its number of

distinct values

the weight of a category is inversely proportional to its

population – Cardinality reduction on observations, variables, and categories

Very few variables implies possible information loss
Too many variables implies very small populations and less

interpretable results

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 43

SLIDE 44

Binning Methods (3/3)

Equal-width (distance) partitioning:

– It divides the range into N intervals of equal size: uniform grid – if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. – The most straightforward – But outliers may dominate presentation – Skewed data is not handled well.

Equal-depth (frequency) partitioning:

– It divides the range into N intervals, each containing the same number of samples – Good data scaling – Managing categorical attributes can be tricky.

SLIDE 45

Data Transformation

Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified range

– min-max normalization – z-score normalization – normalization by decimal scaling

Attribute/feature construction

– New attributes constructed from the given ones

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 45

SLIDE 46

Summary

Data preparation is a big issue for warehousing and data
Data preparation includes:

– Anomaly Detection – Data cleaning – Data transformation – Discretization – Data reduction and feature selection

A lot a methods have been developed: an extremely active

area of research

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 46

SLIDE 47

Conclusions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 47

Still a lot needs to be done

to offer:

– An Iterative process with performance and quality guarantees – Benchmarks – Optimization – Formalized guidelines and rigourous methodologies – User assistance

Iterative Detection and Cleaning

Patterns and Dependencies among Anomalies

Detection Cleaning

Explanation

Duplicates Deduplication Outliers

Uni- and MV- Detection

Missing Data Imputation Inconsistent Data Constraint

SLIDE 48

Any questions ?

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 48

SLIDE 49

Limited Bibliography

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 49

SLIDE 50

References

Tutorials

– BATINI, CARLO, TIZIANA, CATARCI, & SCANNAPIECO, MONICA. 2004. A Survey of Data Quality Issues in Cooperative Systems. Tutorial of the 23rd International Conference on Conceptual Modeling, ER 2004. – KOUDAS, NICK, SARAWAGI SUNITA, SRIVASTAVA DIVESH. Record Linkage: Similarity Measures and Algorithms.Tutorial of SIGMOD 2006.

Books

– NAUMANN, FELIX. Quality-Driven Query Answering for Integrated Information Systems. Lecture Notes in Computer Science, vol. 2261. Springer-Verlag,2002. – BATINI, CARLO, & SCANNAPIECO, MONICA. Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications. Springer-Verlag, 2006. – DASU, TAMRAPARNI, & JOHNSON, THEODORE. Exploratory Data Mining and Data Cleaning. John Wiley, 2003. – WANG, RICHARD Y., ZIAD, MOSTAPHA, & LEE, YANG W. Data Quality.Advances in Database Systems,

vol. 23. Kluwer Academic Publishers, 2002.
Data Profiling

– DASU, TAMRAPARNI, JOHNSON, THEODORE, S. Muthukrishnan, V. Shkapenyuk, Mining Database Structure; Or, How to Build a Data Quality Browser, Proc. SIGMOD Conf. 2002 – CARUSO, FRANCESCO, COCHINWALA, MUNIR, GANAPATHY, UMA, LALK, GAIL, & MISSIER, PAOLO.

2000. Telcordia’s Database Reconciliation and Data Quality Analysis Tool. Pages 615–618 of:

Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000. Cairo, Egypt. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 50

SLIDE 51

References

ETL

– CHRISTEN, PETER: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. KDD 2008: 1065-1068, 2008. – CHRISTEN, PETER, CHURCHES, TIM, ZHU, XI. Probabilistic name and address cleaning and

standardization. Australasian Data Mining Workshop 2002.

– RAHM, E., DO, H.H., Data Cleaning: Problems and Current Approaches, Data Engineering Bulletin 23(4) 3-13, 2000. – GALHARDAS, HELENA, FLORESCU, DANIELA, SHASHA, DENNIS, SIMON, ERIC, SAITA, CRISTIAN-

AUGUSTIN. Declarative Data Cleaning: Language, Model, and Algorithms, Proc. VLDB Conf., pp. 371-

380, 2001. – JOHNSON THEODORE, MARATHE, AMIT, DASU TAMRAPARNI. Database Exploration and Bellman. IEEE Data Eng. Bull. 26(3): 34-39,2003. – VASSILIADIS, PANOS, VAGENA Z., SKIADOPOULOS S., KARAYANNIDIS N. and SELLIS, T. ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments. Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 4, pp. 42-47, December 2000. – VASSILIADIS, PANOS, KARAGIANNIS ANASTASIOS, TZIOVARA, VASILIKI, SIMITSIS, ALKIS. Towards a Benchmark for ETL Workflows. QDB 2007: 49-60, 2007. – ELFEKY, MOHAMED G., ELMAGARMID, AHMED K., & VERYKIOS, VASSILIOS S. TAILOR: A Record Linkage Tool Box. Pages 17–28 of: Proceedings of the 18th International Conference on Data Engineering, ICDE 2002. San Jose, CA, USA, 2002. – ELMAGARMID, AHMED K., IPEIROTIS, PANAGIOTIS G., & VERYKIOS, VASSILIOS S. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng., 19(1), 1–16, 2007. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 51

SLIDE 52

References

ETL

– LIM, EE-PENG, SRIVASTAVA, JAIDEEP, PRABHAKAR, SATYA, & RICHARDSON, JAMES. 1993. Entity Identification in Database Integration. Pages 294–301 of: Proceedings of the 9th International Conference on Data Engineering, ICDE 1993. Vienna, Austria. – LOW, WAI LUP, LEE, MONG-LI, & LING, TOK WANG. 2001. A Knowledge-Based Approach for Duplicate Elimination in Data Cleaning. Inf. Syst., 26(8), 585–606. – SIMITSIS, ALKIS, VASSILIADIS, PANOS, & SELLIS, TIMOS K. 2005. Optimizing ETL Processes in Data

Warehouses. Pages 564–575 of: Proceedings of the 21st International Conference on Data

Engineering, ICDE 2005. Tokyo, Japan. – TEJADA, SHEILA, KNOBLOCK, CRAIG A., & MINTON, STEVEN. 2002. Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification. Pages 350–359 of: Proceedings of the 8thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002. Edmonton, AL, Canada. –

A. Simitsis, P. Vassiliadis, T. K. Sellis. State-Space Optimization of ETL Workflows. IEEE Transactions on

Knowledge and Data Engineering (IEEE TKDE) vol. 17, no. 10, pp. 1404-1419, October 2005.

Approximate String Matching

– NAVARRO, GONZALO. 2001. A Guided Tour to Approximate String Matching. ACM Comput. Surv., 33(1), 31–88. – GRAVANO, LUIS, IPEIROTIS, PANAGIOTIS G., JAGADISH, H. V., KOUDAS, NICK, MUTHUKRISHNAN, S., PIETARINEN, LAURI, & SRIVASTAVA, DIVESH. 2001. Using q-grams in a DBMS for Approximate String

Processing. IEEE Data Eng. Bull., 24(4), 28–34.

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 52

SLIDE 53

References

Record Linkage

– ANANTHAKRISHNA, ROHIT, CHAUDHURI, SURAJIT, & GANTI, VENKATESH. Eliminating Fuzzy Duplicates in Data Warehouses. pp. 586–597, Proc. of VLDB 2002. – BAXTER, ROHAN A., CHRISTEN, PETER, & CHURCHES, TIM. A Comparison of Fast Blocking Methods for Record Linkage. Pages 27–29 of: Proceedings of the KDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, 2003. – BILENKO, MIKHAIL, BASU, SUGATO, & SAHAMI, MEHRAN. 2005. Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping. Pages 58–65 of: Proceedings of the 5th IEEE International Conference on Data Mining, ICDM 2005. Houston, TX, USA, 2005. – BHATTACHARYA, INDRAJIT, & GETOOR, LISE. Iterative Record Linkage for Cleaning and

Integration. Pages 11–18 of: Proceedings of the 9th ACM SIGMOD Workshop on Research

Issues in Data Mining and Knowledge Discovery, DMKD, 2004. – FELLEGI, IVAN P., & SUNTER, A.B. A Theory for Record Linkage. Journal of the American Statistical Association, 64, 1183–1210, 1969. – WINKLER, WILLIAM E. The State of Record Linkage and Current Research Problems. Tech.

Rept. Statistics of Income Division, Internal Revenue Service Publication R99/04. U.S.

Bureau of the Census, Washington, DC, USA, 1999. – WINKLER, WILLIAM E. Methods for Evaluating and Creating Data Quality.Inf. Syst., 29(7), 531–550, 2004. – WINKLER, WILLIAM E., & THIBAUDEAU, YVES. An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census. Tech. Rept. Statistical Research Report Series RR91/09. U.S. Bureau of the Census,Washington,DC, USA, 1991. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 53

SLIDE 54

References

Duplicate Detection

– HERNANDEZ, M., STOLFO, S., The Merge/Purge Problem for Large Databases, Proc. SIGMOD Conf pg 127-135, 1995. – HERNANDEZ, M., STOLFO, S., Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem, Data Mining and Knowledge Discovery, 2(1)9-37, 1998. – BILENKO, MIKHAIL, & MOONEY, RAYMOND J. Adaptive Duplicate Detection Using Learnable String Similarity Measures. Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 39–48, Washington, DC, USA, 2003. – BILKE, ALEXANDER, BLEIHOLDER, JENS, BÖHM, CHRISTOPH, DRABA, KARSTEN, NAUMANN, FELIX, &WEIS, MELANIE. 2005. Automatic Data Fusion with HumMer. of: Proc. of the 31st Intl. Conf. on Very Large Data Bases, VLDB 2005, pp. 1251–1254 Trondheim, Norway. – CHAUDHURI, SURAJIT, GANTI, VENKATESH, &KAUSHIK, RAGHAV. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. Page 5 of: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006. Atlanta, GA, USA. – GRAVANO, LUIS, IPEIROTIS, PANAGIOTIS G., KOUDAS, NICK, & SRIVASTAVA, DIVESH. Text Joins for Data Cleansing and Integration in an RDBMS. Proc.of the 19th Intl. Conf. on Data Engineering, ICDE 2003, pp. 729–731, Bangalore, India, 2003. – MCCALLUM, ANDREW, NIGAM, KAMAL, &UNGAR, LYLE H. 2000. Efficient Clustering of High- Dimensional Data Sets with Application to Reference Matching. Proc. of the 6th ACM SIGKDD Intl.

Conf. on Knowledge Discovery and Data Mining, KDD 2000, pp. 169–178. Boston, MA, USA.

– MONGE, ALVARO E. 2000. Matching Algorithms within a Duplicate Detection System. IEEE Data Eng. Bull., 23(4), 14–20. – WEIS, MELANIE, & NAUMANN, FELIX. 2004. Detecting Duplicate Objects in XML WEIS, MELANIE, NAUMANN, FELIX, & BROSY, FRANZISKA. 2006. A Duplicate Detection Benchmark for XML (and Relational) Data. Proc. of the 3rd Intl. ACM SIGMOD 2006 Workshop on Information Quality in Information Systems, IQIS 2006. Chicago, IL, USA. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 54

SLIDE 55

References

Data Preparation

– STATNOTES: Topics in Multivariate Analysis. Retrieved 10/17/2008 from http://www2.chass.ncsu.edu/garson/pa765/statnote.htm – KLINE, R.B., Data Preparation and Screening, Chapter 3. in Principles and Practice of Structural Equation Modeling, NY: Guilford Press, pp. 45-62, 2005. – BANSAL, NIKHIL, BLUM, AVRIM, and CHAWLA, SHUCHI. Correlation clustering. Machine Learning, 56(1- 3):89–113, 2004. – PARSONS, SIMON. Current Approaches to Handling Imperfect Information in Data and Knowledge

Bases. IEEE Trans. Knowl. Data Eng., 8(3), 353–372, 1996.

– PEARSON, RONALD K. The problem of disguised missing data. SIGKDD Explorations 8(1): 83-92, 2006. – PEARSON, RONALD K. Surveying Data for Patchy Structure. SDM 2005 – PEARSON, RONALD K. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. Philadelphia: SIAM 2005. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 55

SLIDE 56

References

Geometric Outliers

– PREPARATA SHAMOS. Computational Geometry: An Introduction, Springer-Verlag 1988

Distributional Outliers

– KNORR, EDWIN M., & NG, RAYMOND T. Algorithms for Mining Distance-Based Outliers in Large Datasets.

Proc. of 24rd International Conference on Very Large Data Bases, VLDB 1998, pp. 392–403. New York

City, NY, USA, 1998. – BREUNIG, MARKUS M., KRIEGEL, HANS-PETER, NG, RAYMOND T., & SANDER, JÖRG. LOF: Identifying Density-Based Local Outliers. Proc. of the 2000 ACM SIGMOD International Conference on Management

f Data, pp. 93–104. Dallas, TX, USA, 2000.
Missing Value Imputation

– SCHAFER, J. L., Analysis of Incomplete Multivariate Data, New York: Chapman and Hall,1997 – LITTLE, R. J. A. and RUBIN, D. B., Statistical Analysis with Missing Data. New York: John Wiley & Sons, 1987. – Mc KNIGHT, P. E., FIGUEREDO, A. J., SIDANI, S., Missing Data: A Gentle Introduction. Guilford Press, 2007. – DEMPSTER, ARTHUR PENTLAND, LAIRD, NAN M., & RUBIN, DONALD B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39, 1–38,1977. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 56