Imputation by Gaussian Copula Model with an Application to - PowerPoint PPT Presentation

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik Imputation by Gaussian Copula Model with an Application to Incomplete Customer Satisfaction Data Meelis K¨ a¨ arik, Ene K¨ a¨ arik Institute of Mathematical Statistics, University of Tartu, Estonia COMPSTAT 2010, Paris, France, August 24 –1–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik OVERVIEW 1. Motivating example 2. Imputation. Basic definitions 3. Framework 4. Problem setting 5. Copula. Gaussian copula approach 6. Imputation algorithm 7. Application to Incomplete Customer Satisfaction Data 8. Summary. Remarks COMPSTAT 2010, Paris, France, August 24 –2–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 1. Motivating example MOTIVATING EXAMPLE Customer satisfaction survey Questionnaire – respondents (customers) give scores from least to most satisfied Blocks of similar questions (correlated variables) Each customer represents a company Individual scores are important! COMPSTAT 2010, Paris, France, August 24 –3–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 1. Motivating example MOTIVATING EXAMPLE Customer satisfaction survey Questionnaire – respondents (customers) give scores from least to most satisfied Blocks of similar questions (correlated variables) Each customer represents a company Individual scores are important! ⇒ Finding reasonable substitutes for missing values is of high interest COMPSTAT 2010, Paris, France, August 24 –3–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 2. Imputation. Basic definitions INCOMPLETE DATA Consider correlated incomplete data DEF. Imputation (filling in, substitution) is a strategy for completing missing values in data with plausible estimates. Little & Rubin (1987) • Imputation might seem like an unimportant distinction. • There are many situations where the non-response mechanism needs to be considered explicitly, since it is of scientific interest itself. • It makes sense to consider imputation of missing values separately from mod- elling data. COMPSTAT 2010, Paris, France, August 24 –4–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 3. Framework FRAMEWORK Let Y = ( Y 1 , . . . , Y v ) be the random vector with correlated components Y j Consider data with n subjects   y 1 j .   . Y = ( Y 1 , ..., Y v ) , Y j = j = 1 , . . . , v  ,   .    y nj Ordered missingness: the columns of data matrix are sorted starting from the column with least missing values to the column with most missing values Assume that first k ( k ≥ 2) components are complete, then Y = ( Y c , Y m ) Y c = ( Y 1 , . . . , Y k ) – complete data, Y m = ( Y k +1 , . . . , Y v ) – incomplete data. COMPSTAT 2010, Paris, France, August 24 –5–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 3. Framework. Dependence DEPENDENCE between variables Y c = ( Y 1 , . . . , Y k ) , Y k +1 Correlation matrix: R = ( r ij ) , r ij = corr ( Y i , Y j ) , i, j = 1 , . . . , k + 1 Partition of correlation matrix    R k r R =  r T 1 R k – the correlation matrix of complete part Y c = ( Y 1 , ..., Y k )   r 1 ,k +1 .    – the vector of correlations between Y c and Y k +1 . . r =   .    r k,k +1 COMPSTAT 2010, Paris, France, August 24 –6–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 4. Problem setting PROBLEM SETTING We use the idea of imputing a missing value based on conditional distribution of missing value conditioned to the observed values. The joint distribution may be unknown, but using the copula function it is possible to find approximate joint and conditional distributions. H. Joe (2001): ”... if there is no natural multivariate family with a given parametric family for the univariate margins, a common approach has been through copulas ” COMPSTAT 2010, Paris, France, August 24 –7–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 5. Copula. Basic definitions COPULA In 1959 Sklar introduced a new class of functions which he called copulas. Sklar: if Q is a bivariate distribution function with margins F ( x ) , G ( y ) , then there exist a copula C such that Q ( x, y ) = C ( F ( x ) , G ( y )) . ⇒ copula links joint distribution function to their one-dimensional marginals. A copula is a function C : [0 , 1] 2 → [0 , 1] which satisfies: DEF. • for every u, v in [0 , 1] , C ( u, 0) = 0 = C (0 , v ) , and C ( u, 1) = u, C (1 , v ) = v ; • for every u 1 , u 2 , v 1 , v 2 in [0 , 1] such that u 1 ≤ u 2 , v 1 ≤ v 2 , C ( u 2 , v 2 ) − C ( u 2 , v 1 ) − C ( u 1 , v 2 ) + C ( u 1 , v 1 ) ≥ 0 Example : product copula Π( u, v ) = uv characterizes independent random variables when the distribution functions are continuous. COMPSTAT 2010, Paris, France, August 24 –8–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 5. Copula. Gaussian copula approach GAUSSIAN COPULA APPROACH (1) DEFINITION: Let R be a symmetric, positive definite matrix with diag ( R ) = (1 , 1 , . . . , 1) T and Φ k +1 be the k + 1 -variate normal distribution function with correlation matrix R , then the multivariate GAUSSIAN COPULA is defined as C ( u 1 , . . . , u k +1 ; R ) = Φ k +1 (Φ − 1 ( u 1 ) , . . . , Φ − 1 ( u k +1 ); R ) u j ∈ (0 , 1) , j = 1 , . . . , k + 1 Joint multivariate distribution function: F Y ( y 1 , . . . , y k +1 ; R ) = = [ C [ F 1 ( y 1 ) , . . . , F k +1 ( y k +1 ); R ] = Φ ( k +1) [Φ − 1 ( F 1 ( y 1 )) , . . . , Φ − 1 ( F k +1 ( y k +1 ))] COMPSTAT 2010, Paris, France, August 24 –9–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 5. Copula. Gaussian copula approach GAUSSIAN COPULA APPROACH (2) Conditional probability density function (see K¨ a¨ arik and K¨ a¨ arik (2009)) ( z k +1 − r T R − 1 z k ) 2 exp {− k } 2(1 − r T R − 1 r ) k f Z k +1 | Z 1 ,...,Z k ( z k +1 | z 1 , . . . , z k ; R ) = (1) � 2 π (1 − r T R − 1 k r ) Z j = Φ − 1 [ F j ( Y j )] , – standard normal r.v.-s from Y j j = 1 , . . . , k + 1 z k = ( z 1 , . . . , z k ) T COMPSTAT 2010, Paris, France, August 24 –10–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 5. Copula. Gaussian copula approach GAUSSIAN COPULA APPROACH (2) Conditional probability density function (see K¨ a¨ arik and K¨ a¨ arik (2009)) ( z k +1 − r T R − 1 z k ) 2 exp {− k } 2(1 − r T R − 1 r ) k f Z k +1 | Z 1 ,...,Z k ( z k +1 | z 1 , . . . , z k ; R ) = (1) � 2 π (1 − r T R − 1 k r ) Z j = Φ − 1 [ F j ( Y j )] , – standard normal r.v.-s from Y j j = 1 , . . . , k + 1 z k = ( z 1 , . . . , z k ) T As a result we have the (conditional) probability density function of a normal random variable with expectation r T R − 1 k z k and variance 1 − r T R − 1 k r : E ( Z k +1 | Z 1 = z 1 , . . . , Z k = z k ) = r T R − 1 (2) k z k , V ar ( Z k +1 | Z 1 = z 1 , . . . , Z k = z k ) = 1 − r T R − 1 (3) k r . COMPSTAT 2010, Paris, France, August 24 –10–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 6. Imputation algorithm IMPUTATION FORMULA The formula (2) leads us to the general formula of replacing the missing value z k +1 by the estimate ˆ z k +1 using the conditional mean imputation z k +1 = r T R − 1 (4) ˆ k z k r – the vector of correlations between ( Z 1 , . . . , Z k ) and Z k +1 R − 1 – the inverse of the correlation matrix of ( Z 1 , . . . , Z k ) k z k = ( z 1 , . . . , z k ) T – the vector of complete observations for the subject which has missing value z k +1 . COMPSTAT 2010, Paris, France, August 24 –11–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 6. Imputation algorithm IMPUTATION FORMULA The formula (2) leads us to the general formula of replacing the missing value z k +1 by the estimate ˆ z k +1 using the conditional mean imputation z k +1 = r T R − 1 (4) ˆ k z k r – the vector of correlations between ( Z 1 , . . . , Z k ) and Z k +1 R − 1 – the inverse of the correlation matrix of ( Z 1 , . . . , Z k ) k z k = ( z 1 , . . . , z k ) T – the vector of complete observations for the subject which has missing value z k +1 . From expression (3) we obtain the (conditional) variance of imputed value as follows σ k +1 ) 2 = 1 − r T R − 1 (ˆ (5) k r These results for dropouts are proved by K¨ a¨ arik and K¨ a¨ arik (2009) COMPSTAT 2010, Paris, France, August 24 –11–

Imputation by Gaussian Copula ... – M. K¨ a¨ arik, E. K¨ a¨ arik 6. Imputation algorithm DEPENDENCE STRUCTURES Start from a simple correlation structure, depending on one parameter only. (1) The compound symmetry (CS) or the constant correlation structure, when the correlations between all measurements are equal, r ij = ρ, i, j = 1 , . . . , m, i � = j . (2) The first order autoregressive correlation structure ( AR ), when the observations on the same subject that are closer are more highly correlated than measurements that are further apart, r ij = ρ | j − i | , i, j = 1 , . . . , m, i � = j. COMPSTAT 2010, Paris, France, August 24 –12–

Imputation by Gaussian Copula Model with an Application to - PowerPoint PPT Presentation

Imputation by Gaussian Copula ... M. K a arik, E. K a arik Imputation by Gaussian Copula Model with an Application to Incomplete Customer Satisfaction Data Meelis K a arik, Ene K a arik Institute of Mathematical

Modeling Multivariate Risk To Copula, or Not To Copula: That is the Question X. Sheldon Lin

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Overview Multiple Imputation for Multilevel Data Bayesian estimation for MLMs Univariate

Multiple Imputation for Missing Data in KLoSA Juwon Song Korea University and UCLA Contents 1.

Consistent Variance Estimates for Multiple Multiple imputation Imputation in R MI alternative

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Copula Models for Dependent Data Analysis Yihao Deng Department of Mathematical Sciences Purdue

Copulas A copula is the joint distribution of random variables U 1 , U 2 , . . . , U p , each of

Confidence interval example Stat 542 Peter Hoff University of Washington Copula modeling

MixtComp software: Model-based clustering/imputation with mixed data, missing data and uncertain

Vine copula mixture models and clustering for non-Gaussian data Statistical Methods in Machine

Efficient Bayesian inference for Copula Gaussian graphical models A. Mohammadi, F. Abegaz and E.

Gaussian Model Trees for Traffic Imputation Sebastian Buschjger, Thomas Liebig and Katharina

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Statistical Significance Tests in NLP Natural Language Processing VU (706.230) - Andi Rexha

there is something beyond the twitter network Karol Wgrzycki 2016-07-11 1 modeling

Normality tests P RACTICIN G S TATIS TICS IN TERVIEW QUES TION S IN R Zuzanna Chmielewska

Operational Trials: Data Analysis Wendy Bergerud Research Branch BC Min. of Forests May 2003

Agreement between the Xmax distributions measured by the Pierre Auger and Telescope Array

and Timing Analysis for Real-Time Networks RTN 2018 Stefan Reif, Timo Hnig, Wolfgang

Predictor-Corrector and Morphing Ensemble Filters for the Assimilation of Sparse Data into

Sta$s$cs with R Mining So0ware Repositories 2015 University of