Panorama des méthodes de détection et de traitement des anomalies
Laure Berti-Équille
IRD
AAFD 2012
www.ird.fr laure.berti@ird.fr
Panorama des mthodes de dtection et de traitement des anomalies - - PowerPoint PPT Presentation
Panorama des mthodes de dtection et de traitement des anomalies Laure Berti-quille IRD www.ird.fr laure.berti@ird.fr AAFD 2012 la recherche des problmes de qualit de donnes Dirty Data : Donnes malformates
AAFD 2012
www.ird.fr laure.berti@ird.fr
– Données malformatées – Données aberrantes (outliers) – Doublons – Données incohérentes – Données obsolètes – Données fausses, incorrectes, erronées – Données incomplètes, tronquées, censurées – Données manquantes
2 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 2
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 3
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 4
5 *L. Berti-Équille, T. Dasu, D. Srivastava : Discovery of complex glitch patterns : A novel approach to Quantitative Data Cleaning. Proc. of ICDE 2011 , pp. 733-744, Hannover, Germany, 2011.
Missing Outliers Duplicate Outliers
Interfaces Utilization_Out Utilization_In Bytes_Out Bytes_In Memory CPU Latency Syslog_Events CPU_Poll 6
A large variety of detection methods with
No benchmark DQ problems are not necessarily rare events DQ problems may be (partially) correlated Mutual masking-effects impair the detection (e.g., - missing values affects the detection of duplicates
Classical assumptions won’t work (e.g., MCAR/MAR, normality, symmetry, uni-modality)
7 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 7
8 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 8
Deletion Imputation Modeling
Deletion Fusion Random Selection
Deletion Winsorization Trimming Data Missing Values Duplicates Outliers
9 9 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 9
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 10
– Detect and count missing, extreme and aberrant data values – Decide not to consider some values or variables – Decide the transformation and corrective actions to apply
– Discretization – Test for normality (essential for small datasets) and normalization – Optional test for homoscedasticity (equality of variance-covariance matrices) – Detect non-linearity and non-monotony
– Group the variables with small populations – Create new relevant aggregates
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 11
( )
− =
N i i
x x N
2
1 σ
µ σ = CV
− =
N i x i
x x N S
3
1 σ
− =
N i x i
x x N K
4
1 σ
– Standard deviation – Coefficient of Variation (CV): a normalized measure of dispersion
– IQR: Q3-Q1 – Homoscedasticity: equality of variances for a variable on different subsets using Levene, Barlett or Fisher tests (if p<.05 ⇒ heteroscedasticity)
distribution of a real-valued random variable
12
13
– Kolmogorov-Smirnov Test: non-parametric test that quantifies the maximum distance between the empirical distribution function of the variable and the cdf of the normal distribution – Anderson-Darling Test: variant of K-S test weighting the tails of distributions – Lilliefors Test: variant of K-S test for unknown mean and standard deviation – Shapiro-Wilk Test : orders the sample values in ascending order and uses the correlation to detect small departures from normality - not suitable for very large sample sizes (SAS proc UNIVARIATE)
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 13
– Detect inconsistencies between 2 or more variables – Determine relationships between one target variable and one or more variables contributing to its explanation in order to eliminate no effect variables – Determine relationships between explanation variables in order to avoid multicollinearity that may causes the failure of regression techniques – Quantify the strength of the relationship and sensitivity in presence
– Detect spurious correlations
– Bivariate statistics measuring pair-wise correlations – Discover FDs
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 14
MV statistics Model-based methods
Linear, logistic regression Probabilistic methods
MCD, MVE, Robust estimators
Clustering
Distance-based techniques Density-based techniques Subspace-based techniques
Visualization
Graphics Q-Q plot Confusion Matrix Distributional techniques Skewness, Kurtosis Goodness of fit tests: normality, Chi-square tests, analysis of residulas, Kullback-Lieber divergence Control Charts: X-Bar, CUSUM, R
UV statistics Classification
Rule-based techniques SVM, Neural Networks, Bayesian Networks Information theoretic measures Kernel-based methods
Rule & Pattern Discovery
Association Rule Discovery FD, AFD, CFD mining
15 AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 15
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
– Integration of multiple databases, data cubes, or files
– Normalization and aggregation
– Obtains reduced representation in volume but produces the same or similar analytical results
– Part of data reduction but with particular importance, especially for numerical data
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 16
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
– Integration of multiple databases, data cubes, or files
– Normalization and aggregation
– Obtains reduced representation in volume but produces the same or similar analytical results
– Part of data reduction but with particular importance, especially for numerical data
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 17
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 18
– Non standardized, misfielded/formatted – Duplicates – Outliers – Inconsistencies – Missing, truncated – Out-of-date – Erroneous, contradicting, false
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 19
– Non standardized, misfielded/formatted – Duplicates – Outliers – Inconsistencies – Missing, truncated – Out-of-date – Erroneous, contradicting, false
Name Affiliation City, State, Zip, Country Phone
Piatetsky-Shapiro G.,PhD
617-264-9914 David J. Hand Imperial College London, UK Benjamin W. Wah
IL 61801, USA (217) 333-6903 Hand D.J. Vippin Kumar
Xindong Wu
Burlington-4000 USA NULL Philip S. Yu
Chicago IL, USA 999-999-9999 Osmar R. Zaiiane
CA 111-111-1111
Misfielded Value Non-standard representation ICDM Steering Committee
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 20
e.g., addresses, names, bibliographic entries
[Christen et al., 2002]
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 21
22
Operators Category Application Mapping, Convert, Select, Drop, Add, Merge, Format
Row-level Locally applied to a single row
Copy, Filter, Split, Switch
Router Locally decide, for each row, which
should be sent to
Pivot/Unpivot, Aggregate, Clustering
Unary Grouper Transform a set of rows to a single row
Union, Merge, Join, Look-up, Compare, Divide
Binary or N-ary Combine many inputs into one
Sort
Unary Holistic Perform a transformation to the entire dataset
[Vassiliadis et al. 2007]
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 23 http://cs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/ http://www.pentaho.com/
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 24 *A. Simitsis, P. Vassiliadis, T. K. Sellis. State-Space Optimization of ETL Workflows. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) vol. 17, no. 10, pp. 1404-1419, October 2005.
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 25
– Non standardized, misfielded/formatted – Duplicates – Outliers – Inconsistencies – Missing, truncated – Out-of-date – Erroneous, contradicting, false
Windowing, Clustering
similarity
aware similarity measures
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 26
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 27
PANAGIOTIS G., & VERYKIOS, VASSILIOS S. Duplicate Record Detection: A Survey. IEEE Trans.
http://sourceforge.net/projects/simmetrics/
Resolving Data Conflicts for Integration. Tutorial
ID Name Address 1 AT&T 180 Park. Av Florham Park 2 ATT 180 park Ave. Florham Park NJ 3 AT&T Labs 180 Park Avenue Florham Park 4 ATT Park Av. 180 Florham Park 5 TAT 180 park Av. NY 6 ATT 180 Park Avenue. NY NY 7 ATT Park Avenue, NY No. 180 8 ATT 180 Park NY NY
Park Av. 180 Florham Park 180 Park Avenue Florham Park 180 Park. Av Florham Park 180 park Ave. Florham Park NJ 180 Park Avenue. NY NY 180 park Av. NY 180 Park NY NY Park Avenue, NY No. 180
1 3 4 5 6 8
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 28
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 29
– Non standardized, misfielded/formatted – Duplicates – Outliers – Inconsistencies – Missing, truncated – Out-of-date – Erroneous, contradicting, false
Anomaly Detection
Contextual Anomaly Detection Collective Anomaly Detection Online Anomaly Detection Distributed Anomaly Detection
Point Anomaly Detection
Classification Based
Rule Based Neural Networks Based SVM Based
Nearest Neighbor Based
Density Based Distance Based
Statistical
Parametric Non-parametric
Clustering Based Others
Information Theory Based Spectral Decomposition Based Visualization Based
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly Detection – A Survey. ACM Computing Surveys, 41(3), 1–58.
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 30
X Z N1 N2
O3 Y O4
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 31
X Y Z
Multivariate Analysis Bivariate Analysis
comparison
Rejection area: Data space excluding the area defined between 2% and 98% quantiles for X and Y Rejection area based on: Mahalanobis_dist(cov(X,Y)) > χ2(.98,2) Y X X Y
Legitimate
data quality problems?
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 32
* Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka, Conditional Anomaly Detection, IEEE Transactions
Normal Anomaly
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 33
– Sequential – Spatial – Connectivity (graph)
Subsequence anomaly AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 34
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 36
– When normal points do not have sufficient number of neighbours – In high dimensional spaces due to data sparseness – When datasets have modes with varying density – Computationally expensive
O d
A point O in a dataset is an DB(p,d)-outlier if at least fraction p of the points in the data set lies greater than distance d from the point O. [Knorr, Ng, 1998] Outliers are the top n points whose distance to the k-th nearest neighbor is greatest. [Ramaswamy et al., 2000]
O NNd
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 37
d1 d2
Compute local densities of particular regions and declare data points in low density regions as potential anomalies
O1 O2
NN: O2 is outlier but O1 is not LOF: O1 is outlier but O2 is not
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 38
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 3 9
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 40
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 41
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 42
– Avoid large differences between the numbers of distinct values (categories) per variable – Avoid categories with small population – The appropriate number of categories for a discrete or categorical variable is 4 or 5 – Remember :
distinct values
population – Cardinality reduction on observations, variables, and categories
interpretable results
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 43
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 45
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 46
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 47
Iterative Detection and Cleaning
Patterns and Dependencies among Anomalies
Detection Cleaning
Explanation
Duplicates Deduplication Outliers
Uni- and MV- Detection
Missing Data Imputation Inconsistent Data Constraint
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 48
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 49
– BATINI, CARLO, TIZIANA, CATARCI, & SCANNAPIECO, MONICA. 2004. A Survey of Data Quality Issues in Cooperative Systems. Tutorial of the 23rd International Conference on Conceptual Modeling, ER 2004. – KOUDAS, NICK, SARAWAGI SUNITA, SRIVASTAVA DIVESH. Record Linkage: Similarity Measures and Algorithms.Tutorial of SIGMOD 2006.
– NAUMANN, FELIX. Quality-Driven Query Answering for Integrated Information Systems. Lecture Notes in Computer Science, vol. 2261. Springer-Verlag,2002. – BATINI, CARLO, & SCANNAPIECO, MONICA. Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications. Springer-Verlag, 2006. – DASU, TAMRAPARNI, & JOHNSON, THEODORE. Exploratory Data Mining and Data Cleaning. John Wiley, 2003. – WANG, RICHARD Y., ZIAD, MOSTAPHA, & LEE, YANG W. Data Quality.Advances in Database Systems,
– DASU, TAMRAPARNI, JOHNSON, THEODORE, S. Muthukrishnan, V. Shkapenyuk, Mining Database Structure; Or, How to Build a Data Quality Browser, Proc. SIGMOD Conf. 2002 – CARUSO, FRANCESCO, COCHINWALA, MUNIR, GANAPATHY, UMA, LALK, GAIL, & MISSIER, PAOLO.
Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000. Cairo, Egypt. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 50
– CHRISTEN, PETER: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. KDD 2008: 1065-1068, 2008. – CHRISTEN, PETER, CHURCHES, TIM, ZHU, XI. Probabilistic name and address cleaning and
– RAHM, E., DO, H.H., Data Cleaning: Problems and Current Approaches, Data Engineering Bulletin 23(4) 3-13, 2000. – GALHARDAS, HELENA, FLORESCU, DANIELA, SHASHA, DENNIS, SIMON, ERIC, SAITA, CRISTIAN-
380, 2001. – JOHNSON THEODORE, MARATHE, AMIT, DASU TAMRAPARNI. Database Exploration and Bellman. IEEE Data Eng. Bull. 26(3): 34-39,2003. – VASSILIADIS, PANOS, VAGENA Z., SKIADOPOULOS S., KARAYANNIDIS N. and SELLIS, T. ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments. Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 4, pp. 42-47, December 2000. – VASSILIADIS, PANOS, KARAGIANNIS ANASTASIOS, TZIOVARA, VASILIKI, SIMITSIS, ALKIS. Towards a Benchmark for ETL Workflows. QDB 2007: 49-60, 2007. – ELFEKY, MOHAMED G., ELMAGARMID, AHMED K., & VERYKIOS, VASSILIOS S. TAILOR: A Record Linkage Tool Box. Pages 17–28 of: Proceedings of the 18th International Conference on Data Engineering, ICDE 2002. San Jose, CA, USA, 2002. – ELMAGARMID, AHMED K., IPEIROTIS, PANAGIOTIS G., & VERYKIOS, VASSILIOS S. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng., 19(1), 1–16, 2007. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 51
– LIM, EE-PENG, SRIVASTAVA, JAIDEEP, PRABHAKAR, SATYA, & RICHARDSON, JAMES. 1993. Entity Identification in Database Integration. Pages 294–301 of: Proceedings of the 9th International Conference on Data Engineering, ICDE 1993. Vienna, Austria. – LOW, WAI LUP, LEE, MONG-LI, & LING, TOK WANG. 2001. A Knowledge-Based Approach for Duplicate Elimination in Data Cleaning. Inf. Syst., 26(8), 585–606. – SIMITSIS, ALKIS, VASSILIADIS, PANOS, & SELLIS, TIMOS K. 2005. Optimizing ETL Processes in Data
Engineering, ICDE 2005. Tokyo, Japan. – TEJADA, SHEILA, KNOBLOCK, CRAIG A., & MINTON, STEVEN. 2002. Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification. Pages 350–359 of: Proceedings of the 8thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002. Edmonton, AL, Canada. –
Knowledge and Data Engineering (IEEE TKDE) vol. 17, no. 10, pp. 1404-1419, October 2005.
– NAVARRO, GONZALO. 2001. A Guided Tour to Approximate String Matching. ACM Comput. Surv., 33(1), 31–88. – GRAVANO, LUIS, IPEIROTIS, PANAGIOTIS G., JAGADISH, H. V., KOUDAS, NICK, MUTHUKRISHNAN, S., PIETARINEN, LAURI, & SRIVASTAVA, DIVESH. 2001. Using q-grams in a DBMS for Approximate String
AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 52
– ANANTHAKRISHNA, ROHIT, CHAUDHURI, SURAJIT, & GANTI, VENKATESH. Eliminating Fuzzy Duplicates in Data Warehouses. pp. 586–597, Proc. of VLDB 2002. – BAXTER, ROHAN A., CHRISTEN, PETER, & CHURCHES, TIM. A Comparison of Fast Blocking Methods for Record Linkage. Pages 27–29 of: Proceedings of the KDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, 2003. – BILENKO, MIKHAIL, BASU, SUGATO, & SAHAMI, MEHRAN. 2005. Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping. Pages 58–65 of: Proceedings of the 5th IEEE International Conference on Data Mining, ICDM 2005. Houston, TX, USA, 2005. – BHATTACHARYA, INDRAJIT, & GETOOR, LISE. Iterative Record Linkage for Cleaning and
Issues in Data Mining and Knowledge Discovery, DMKD, 2004. – FELLEGI, IVAN P., & SUNTER, A.B. A Theory for Record Linkage. Journal of the American Statistical Association, 64, 1183–1210, 1969. – WINKLER, WILLIAM E. The State of Record Linkage and Current Research Problems. Tech.
Bureau of the Census, Washington, DC, USA, 1999. – WINKLER, WILLIAM E. Methods for Evaluating and Creating Data Quality.Inf. Syst., 29(7), 531–550, 2004. – WINKLER, WILLIAM E., & THIBAUDEAU, YVES. An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census. Tech. Rept. Statistical Research Report Series RR91/09. U.S. Bureau of the Census,Washington,DC, USA, 1991. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 53
– HERNANDEZ, M., STOLFO, S., The Merge/Purge Problem for Large Databases, Proc. SIGMOD Conf pg 127-135, 1995. – HERNANDEZ, M., STOLFO, S., Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem, Data Mining and Knowledge Discovery, 2(1)9-37, 1998. – BILENKO, MIKHAIL, & MOONEY, RAYMOND J. Adaptive Duplicate Detection Using Learnable String Similarity Measures. Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 39–48, Washington, DC, USA, 2003. – BILKE, ALEXANDER, BLEIHOLDER, JENS, BÖHM, CHRISTOPH, DRABA, KARSTEN, NAUMANN, FELIX, &WEIS, MELANIE. 2005. Automatic Data Fusion with HumMer. of: Proc. of the 31st Intl. Conf. on Very Large Data Bases, VLDB 2005, pp. 1251–1254 Trondheim, Norway. – CHAUDHURI, SURAJIT, GANTI, VENKATESH, &KAUSHIK, RAGHAV. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. Page 5 of: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006. Atlanta, GA, USA. – GRAVANO, LUIS, IPEIROTIS, PANAGIOTIS G., KOUDAS, NICK, & SRIVASTAVA, DIVESH. Text Joins for Data Cleansing and Integration in an RDBMS. Proc.of the 19th Intl. Conf. on Data Engineering, ICDE 2003, pp. 729–731, Bangalore, India, 2003. – MCCALLUM, ANDREW, NIGAM, KAMAL, &UNGAR, LYLE H. 2000. Efficient Clustering of High- Dimensional Data Sets with Application to Reference Matching. Proc. of the 6th ACM SIGKDD Intl.
– MONGE, ALVARO E. 2000. Matching Algorithms within a Duplicate Detection System. IEEE Data Eng. Bull., 23(4), 14–20. – WEIS, MELANIE, & NAUMANN, FELIX. 2004. Detecting Duplicate Objects in XML WEIS, MELANIE, NAUMANN, FELIX, & BROSY, FRANZISKA. 2006. A Duplicate Detection Benchmark for XML (and Relational) Data. Proc. of the 3rd Intl. ACM SIGMOD 2006 Workshop on Information Quality in Information Systems, IQIS 2006. Chicago, IL, USA. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 54
– STATNOTES: Topics in Multivariate Analysis. Retrieved 10/17/2008 from http://www2.chass.ncsu.edu/garson/pa765/statnote.htm – KLINE, R.B., Data Preparation and Screening, Chapter 3. in Principles and Practice of Structural Equation Modeling, NY: Guilford Press, pp. 45-62, 2005. – BANSAL, NIKHIL, BLUM, AVRIM, and CHAWLA, SHUCHI. Correlation clustering. Machine Learning, 56(1- 3):89–113, 2004. – PARSONS, SIMON. Current Approaches to Handling Imperfect Information in Data and Knowledge
– PEARSON, RONALD K. The problem of disguised missing data. SIGKDD Explorations 8(1): 83-92, 2006. – PEARSON, RONALD K. Surveying Data for Patchy Structure. SDM 2005 – PEARSON, RONALD K. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. Philadelphia: SIAM 2005. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 55
– PREPARATA SHAMOS. Computational Geometry: An Introduction, Springer-Verlag 1988
– KNORR, EDWIN M., & NG, RAYMOND T. Algorithms for Mining Distance-Based Outliers in Large Datasets.
City, NY, USA, 1998. – BREUNIG, MARKUS M., KRIEGEL, HANS-PETER, NG, RAYMOND T., & SANDER, JÖRG. LOF: Identifying Density-Based Local Outliers. Proc. of the 2000 ACM SIGMOD International Conference on Management
– SCHAFER, J. L., Analysis of Incomplete Multivariate Data, New York: Chapman and Hall,1997 – LITTLE, R. J. A. and RUBIN, D. B., Statistical Analysis with Missing Data. New York: John Wiley & Sons, 1987. – Mc KNIGHT, P. E., FIGUEREDO, A. J., SIDANI, S., Missing Data: A Gentle Introduction. Guilford Press, 2007. – DEMPSTER, ARTHUR PENTLAND, LAIRD, NAN M., & RUBIN, DONALD B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39, 1–38,1977. AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 56