 
              Confounder Adjustment in Multiple Hypothesis Testing Qingyuan Zhao Department of Statistics, Stanford University January 28, 2016 Slides are available at http://web.stanford.edu/~qyzhao/ .
Collaborators Confounder Adjustment Qingyuan Zhao Jingshu Wang Trevor Hastie Art Owen Introduction Background Motivating Examples Previous Work Model and Inference Model and Identifiability Estimation Hypothesis Tests Numerical Examples Summary
Microarray experiments Confounder Adjustment Qingyuan Zhao Introduction Background Motivating Responses: normalized gene expression Examples Previous Work level. Model and Inference Primary variables (variables of interest): Model and Identifiability treatment, disease status, etc. Estimation Hypothesis Tests Control covariates: age, gender, batch, Numerical Examples date, etc. Summary 3/49
Microarray data analysis Confounder Adjustment Biologist: “Which genes are (causally) related to this disease?” Qingyuan Zhao Statistician: “Let me run some analysis.” Introduction Background Two common practices Motivating Examples 1 Sparse regression : regress the primary variable on the Previous Work Model and genes. More common for SNP data and predictive tasks. Inference Model and 2 Association tests/screening (this talk) : for each gene, Identifiability Estimation perform a significance test of correlation with the primary Hypothesis Tests Numerical variable. Examples Summary Statistician: “Here a short list of candidate genes with false discovery rate (FDR) ≤ 20%.” Biologist: “Good, let me validate these discoveries.” 4/49
Concerns Confounder Adjustment J. P. Ioannidis. Why most published research findings are false. Qingyuan Zhao Chance , 18(4):40–47, 2005 Introduction Two major challenges to reproducibility in genetic screening: Background Motivating Examples 1 Correlated tests : Is the FDR still controlled? If not, can Previous Work Model and we correct the statistical analysis? Inference Well studied in the last 15 years [Benjamini and Yekutieli, Model and Identifiability Estimation 2001, Storey et al., 2004, Efron, 2007, Fan et al., 2012]. Hypothesis Tests 2 Confounded tests (this talk) : the individual association Numerical Examples tests are biased in presence of unobserved confounders. Summary Can we still provide a good candidate list? Equally long history [e.g. Alter et al., 2000, Price et al., 2006]. Still many open questions. 5/49
Confounding Confounder Adjustment Brief history Qingyuan Fisher [1935] first uses the term in experiment designs. Zhao Kish [1959] first uses its modern meaning: Introduction Background A mixing of effects of unobserved extraneous factors Motivating Examples (called confounders) with the effect of interest . Previous Work Model and Huge literature, but mostly in causal inference. Inference Model and Identifiability Estimation Aliases for confounders in genetic screening: Hypothesis Tests “systematic ancestry differences” [Price et al., 2006]. Numerical Examples “batch effects” (widely used by biologists). Summary “surrogate variables” [Leek and Storey, 2007, 2008]. “unwanted variation” [Gagnon-Bartsch and Speed, 2012]. “latent effects” [Sun et al., 2012]. 6/49
Example 1: gender study Confounder Adjustment Qingyuan Which genes are more expressed in male/female? Zhao A microarray experiment by Vawter et al. [2004]: Introduction Background Postmortem samples from the brains of 10 individuals. Motivating Examples Previous Work For each individual, 3 samples from different cortices. Model and Inference Each sample is sent to 3 different labs for analysis. Model and Identifiability Two different microarray platforms are used by the labs. Estimation Hypothesis Tests In total, 10 × 3 × 3 = 90 samples. Numerical Examples This example was first used by Gagnon-Bartsch and Speed Summary [2012] to demonstrate the importance to “remove unwanted variation”. 8/49
Screening Confounder Adjustment Qingyuan Zhao Notation Introduction Y : n × p matrix of gene expression. Background Motivating Examples X : n × 1 vector of gender. Previous Work Model and Inference Simplest association test: Model and Identifiability Estimation Regress each column of Y (gene) on X . Hypothesis Tests Numerical Examples In R, run summary(lm(Y ∼ X)) . Summary Equivalent to a two-sample t -test with equal variance. 9/49
Histogram of t-statistics Confounder Adjustment Qingyuan Zhao 6 Introduction Background N(0.055,0.066^2) Motivating Examples Previous Work 4 density Model and Inference Model and Identifiability 2 Estimation Hypothesis Tests Numerical Examples 0 Summary −1.0 −0.5 0.0 0.5 1.0 t−statistics Skewed and very underdispersed. 10/49
What happened? Confounder Adjustment Qingyuan ● Zhao Introduction Background Motivating Examples Previous Work lab ● ● 1 ● Model and ● 2 ● PC2 Inference 3 ● Model and platform Identifiability ● ● ● 0 ● Estimation ● ● ● 1 Hypothesis Tests ● ● ● ● Numerical ● ● ● Examples ● ● Summary ● ● ● ● PC1 11/49
Association test Confounder Adjustment Qingyuan Zhao Notation Y : n × p matrix of gene expression. Introduction Background X : n × 1 vector of gender. Motivating Examples Previous Work Z : n × d matrix of control covariates (lab and platform). Model and Inference Model and Identifiability Modified association test: Estimation Hypothesis Tests Regress each column of Y (gene) on X and Z . Numerical Examples In R, run summary(lm(Y ∼ X+Z)) . Summary Report the significance of the coefficients of X . 12/49
Histogram of t-statistics Confounder Adjustment Qingyuan Zhao 2.0 Introduction Background 1.5 Motivating Examples N(0.043,0.24^2) Previous Work density Model and 1.0 Inference Model and Identifiability Estimation 0.5 Hypothesis Tests Numerical Examples 0.0 Summary −1.0 −0.5 0.0 0.5 1.0 t−statistics Better, but still problematic. Reasonable guess: there are more unobserved confounders! 13/49
Example 2: COPD study Confounder COPD = chronic obstructive pulmonary disease. Adjustment Qingyuan Singh et al. [2011] tried to find genes associated with the Zhao severity of COPD (moderate or severe). Introduction Background Motivating Examples Previous Work 0.15 Model and Inference N(0.024,2.6^2) Model and Identifiability Estimation density 0.10 Hypothesis Tests Numerical Examples 0.05 Summary 0.00 −5 0 5 t−statistics Overdispersed and skewed. 14/49
Example 3: Mutual fund selection Confounder Adjustment Barras et al. [2010] used the following model to select mutual Qingyuan Zhao funds: Introduction Background Y it = α i + γ T i Z t + e it , i = 1 , . . . , n , t = 1 , . . . , p . Motivating Examples Previous Work Y it : observed log-return of fund i at time t . Model and Inference Model and α i : risk-adjusted return (Goal: find funds with positive α ). Identifiability Estimation Z t : systematic risk factors. Hypothesis Tests Numerical Examples They assumed: Summary α is sparse (Berk and Green equilibrium); No unobserved risk factors (is that possible/necessary?). 15/49
Idea 0: Remove the largest principal component(s) Confounder Adjustment Qingyuan Zhao EIGENSTRAT [Price et al., 2006] Introduction Regression model: Background Motivating Examples Previous Work Y n × p = X n × 1 β T p × 1 + Z n × r Γ T p × r + E n × p , Model and Inference where Z is the first r PC(s) of Y . Model and Identifiability Estimation Hypothesis Tests Numerical Motivation: in SNP, the largest PC(s) usually correspond Examples to ancestry difference. Summary Weakness: can easily remove true signals. 17/49
Idea 1: Use control genes Confounder Adjustment Same regression model: Qingyuan Zhao Y n × p = X n × 1 β T p × 1 + Z n × r Γ T p × r + E n × p , Introduction Background Motivating Examples RUV2 [Gagnon-Bartsch and Speed, 2012] Previous Work Model and If we know β C = 0 (negative controls), Inference Model and Identifiability 1 Run PCA on col C ( Y ) to obtain Z . Estimation Hypothesis Tests 2 Run the regression for col - C ( Y ). Numerical Examples Summary Example: bacterial RNAs (spike-in controls). Limited to the availability and number of negative controls. 18/49
Idea 2: Sparsity Confounder Adjustment Same regression model: Qingyuan Zhao Y n × p = X n × 1 β T p × 1 + Z n × r Γ T p × r + E n × p , Introduction Background Motivating Examples Idea: If β contains actual effects, it should be a sparse vector. Previous Work Model and Inference SVA [Leek and Storey, 2008] Model and Identifiability Iterate between Estimation Hypothesis Tests 1 Weighted PCA on Y (based on how likely β = 0 ). Numerical Examples 2 Regress Y on X and the estimated PCs. Summary Does not always converge. 19/49
Idea 2: Sparsity Confounder Adjustment Qingyuan Zhao Same regression model: Introduction Background Y n × p = X n × 1 β T p × 1 + Z n × r Γ T Motivating p × r + E n × p , Examples Previous Work Model and Idea: If β contains actual effects, it should be a sparse vector. Inference Model and Identifiability LEAPP [Sun, Zhang, and Owen, 2012] Estimation Hypothesis Tests 1 Run PCA on the residuals of Y ∼ X . Numerical Examples 2 Run a sparse regression. Summary 20/49
Recommend
More recommend