Confounder adjustment in large-scale linear structural models - PowerPoint PPT Presentation

Confounder adjustment in large-scale linear structural models Qingyuan Zhao Department of Statistics, The Wharton School, University of Pennsylvania June 19 2018, EcoStat Based on ◮ Wang, J., Zhao, Q., Hastie, T., & Owen, A. B. Confounder adjustment in multiple hypothesis testing. Annals of Statistics , 45(5), 1863-1894, 2017. ◮ Song, Y., Zhao, Q. Performance evaluation in presence of latent factors. (In preparation). Slides are available at http://www-stat.wharton.upenn.edu/~qyzhao/ .

Setting Multivariate linear regression T + ǫ T + Z n × p = X Y n × 1 α n × d β n × p . p × 1 p × d ◮ Y : “Panel data” or “transposable data”. Modern datasets are often high dimensional (both n , p ≫ 1). ◮ X : “Primary variable”, whose coefficients α are of interest. ◮ Z : “Control variables”, whose coefficients β are not of interest (i.e. nuisance parameters). ◮ Noise ǫ ∼ MN ( 0 , I n , Σ ) where Σ = diag ( σ 2 1 , . . . , σ 2 p ). Two examples ◮ Gene discovery: Y is gene expression (row: tissue; column: gene), X is the treatment. ◮ Mutual fund selectioin: Y is the monthly return of mutual funds (row: month; column: fund), X is the intercept, Z includes systematic risk factors. 1/17

The confounding problem T + Z T + ǫ n × p = X n × p . Y n × 1 α n × d β p × 1 p × d Omitted variable bias When not all Z are known or measured, the OLS estimate of α can be severely biased. To see this, suppose T + W n × d = X n × 1 γ n × d , where W ⊥ ⊥ X . Z d × 1 Therefore Y = X ( α + βγ ) T + W β T + ǫ and the OLS estimate of α indeed converges to α + βγ . 2/17

An illustrative example The gender study 1 Question: Which genes are more expressed in male/female? A microarray experiment was conducted in this study: ◮ Postmortem samples from the brains of 10 individuals. ◮ For each individual, 3 samples from different cortices. ◮ Each sample is sent to 3 different labs for analysis. ◮ Two different microarray platforms are used by the labs. In total, there are 10 × 3 × 3 = 90 samples. This example was first used by Gagnon-Bartsch and Speed 2 to demonstrate the importance of “removing unwanted variation” (RUV). 1Vawter, Marquis P., et al. “Gender-specific gene expression in post-mortem human brain: localization to sex chromosomes.” Neuropsychopharmacology 29.2 (2004). 2Gagnon-Bartsch, J. A., and Speed, T. P. “Using control genes to correct for unwanted variation in microarray data.” Biostatistics 13.3 (2012). 3/17

A simple association test ◮ Regress each column of Y (gene) on X . ◮ In R , run summary(lm(Y ∼ X)) . ◮ Equivalent to a two-sample t -test with equal variance. Histogram of t-statistics: skewed and underdispersed 6 N(0.055,0.066^2) 4 density 2 0 −1.0 −0.5 0.0 0.5 1.0 t−statistics 4/17

What happened? Plot of largest principle components ● lab ● ● 1 ● ● 2 ● PC2 3 ● platform ● ● ● 0 ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● PC1 5/17

Our solution in a nutshell Recall that (for simplicity, assume Z is entirely unobserved) T + Z T + ǫ T + W n × p = X n × p , n × d = X Y n × 1 α n × d β Z n × 1 γ p × 1 n × d p × d d × 1 ⇓ ) T + W β T + ǫ . Y = X ( α + βγ � �� τ Confounder adjusted testing and estimation (CATE) 1. OLS using the observed regressors: τ = ( X T X ) − 1 X T Y ≈ α + βγ , R = ( I − P X ) Y ≈ W β T + ǫ . ˆ 2. Factor analysis of R ⇒ loading matrix ˆ β . ˆ 3. Path analysis: p × 1 ≈ α τ ˆ p × 1 + β γ . d × 1 p × d Problem: the third step is not going to work because it has ( p + d ) parameters but only p equations, i.e. α is not identified . 6/17

Identification Path analysis equation: p × 1 ≈ α τ p × 1 + β γ . d × 1 p × d ◮ τ and (the column space of) β can be identified from data. ◮ α and γ cannot be identified from data. In other words, different values of ( α , γ ) may correspond to the same distribution of the observed data. ◮ Solution to non-identifiability: put additional restrictions. Proposition Suppose Γ can be identified from the factor analysis. Then β is identifiable under either of the two following conditions: 1. Negative control: α C = 0 for a known set C such that |C| ≥ d and rank ( β C ) = d . 2. Sparsity: � α � 0 ≤ ⌊ ( p − d ) / 2 ⌋ , and rank ( β C ) = d , ∀C ⊂ { 1 , . . . , p } such that |C| = d . 7/17

Estimation under sparsity Is sparsity reasonable? Not always, but acceptable in our examples: ◮ In genomics screening, most genes are probably unrelated. ◮ Most mutual funds likely have no “alpha” (otherwise they will be quickly identified by the investors) 3 Estimation via robust regression in CATE Using a robust loss function ρ ( · ) (such as Huber’s), solve p � τ j − ˆ � β T ˆ j γ � γ = arg min ˆ ρ , ˆ σ j γ j =1 τ − ˆ α = ˆ ˆ β ˆ γ . This is similar to solving a penalized regression in outlier detection: 4 τ − α − ˆ � 2 � � (ˆ γ , ˆ α ) = arg min � ˆ βγ Σ + P ρ ( α ) ˆ α , γ . 3Berk, J. B., & Green, R. C. (2004). “Mutual fund flows and performance in rational markets.” Journal of Political Economy , 112(6). 4She, Y., & Owen, A. B. (2011). “Outlier detection using nonconvex penalized regression.” JASA , 106. 8/17

Some theoretical guarantees Theorem When n , p → ∞ , if the factor analysis estimates 5 of Γ and Σ are uniformly consistent, the robust loss function ρ is “nice”, we have for a fixed j, 1. ˆ α j is consistent if � β � 1 / p → 0 ; 2. ˆ α j is asymptotically normal and has “oracle efficiency” if √ n / p → 0 . � β � 1 ◮ “Oracle efficiency” means it has the same variance as the OLS estimator that observes the latent factors Z . 5Bai, J., & Li, K. (2012). Statistical analysis of factor models of high dimension. Annals of Statistics , 40(1). 9/17

Mutual fund example Dataset Mutual fund returns from 1984—2015, obtained from Center for Research in Security Prices (CRSP). Factor model In finance, it is common to fit a linear model to the returns β T Y tj − r t = α j + j Z t + ǫ tj . � �� ” Skill ” of manager Excess return idiosyncratic risk systematic risk People have discovered many systematic risk factors Z over the years: ◮ Market-average: this is the Capital Asset Pricing Model (CAPM). ◮ Stock caps and book-to-market ratio 6 . ◮ Momentum 7 . ◮ ...... 6Fama, E. F., & French, K. R. (1993). “Common risk factors in the returns on stocks and bonds.” Journal of Financial Economics , 33(1). 7Carhart, M. M. (1997). “On persistence in mutual fund performance.” Journal of Finance , 52(1). 10/17

Mutual fund selection by CAPM A recent study 8 shows that ◮ Most investors use CAPM-alpha to select mutual funds. ◮ More sophisticated investors adjust for more risk factors. Is CAPM-alpha a good indicator for future performance? An empirical exercise: ◮ In the beginning of every quarter, we use data in the past five years to compute their cash flow , average returns , and CAPM-alpha . ◮ For each metric, funds are then divided into 10 groups . ◮ We evaluate the performance of each group in the next year. 8Barber, B. M., Huang, X., & Odean, T. (2016). “Which factors matter to investors? Evidence from mutual fund flows.” Review of Financial Studies , 29(10) 11/17

Confounder adjustment in large-scale linear structural models - PowerPoint PPT Presentation

Confounder adjustment in large-scale linear structural models Qingyuan Zhao Department of Statistics, The Wharton School, University of Pennsylvania June 19 2018, EcoStat Based on Wang, J., Zhao, Q., Hastie, T., & Owen, A. B. Confounder

Confounder Adjustment in Multiple Hypothesis Testing Qingyuan Zhao Department of Statistics,

Teaching Confounder-Based Statistical Literacy 19 June, 2019 1 2 2019 Univ. New Mexico 2019

RISK ADJUSTMENT DOCUMENTATION & CODING 1 DEFINE RISK ADJUSTMENT Define Risk Adjustment and

Covariate Adjustment and Statistical Power Tara Slough EGAP Learning Days X Covariate Adjustment

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

EpiGraphDB Query for confounders http://epigraphdb.org/confounder/ (cf:Gwas)-[r1:MR]->

Trade and Labour Trade and Labour Market Adjustment Market Adjustment Joseph Francois Johannes

The SDL adjustment mechanism 28 June 2018 Outline of presentation What is the SDL adjustment

Risk Adjustment Using CDPS Todd Gilmer, PhD Associate Professor University of California, San

Structural Matrices in MDOF Systems Structural Matrices Evaluation of Structural Giacomo Boffi

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE

Model the WAIS-III IQ Scale Erin Buchanan Professor DataCamp Structural Equation Modeling with

Exploring models Categorical data R.W. Oldford 1974 Motor trend magazine data Recall the R data

CS 378: Autonomous Intelligent Robotics Instructor: Jivko Sinapov

Beyond 'One Size Fits All' A tiered model for digital preservation Open Repositories 2013 Umar

ISLE

Hands-on QM/MM Tutorial Files are also available on: http://villekaila.wordpress.com/ and on the

I nuovi immunomodulanti (CC-122) Romano Danesi Farmacologia clinica e Farmacogenetica

Structural analysis of effectors of the oncogenic Ras proteins Marcus Brunnert Department of

KEK Photon Factory Test Beams h High Energy Accelerator Research Organization Institute of

Sambuz

Useful Links

Newsletter

Mail Us

Confounder adjustment in large-scale linear structural models - PowerPoint PPT Presentation

Confounder adjustment in large-scale linear structural models Qingyuan Zhao Department of Statistics, The Wharton School, University of Pennsylvania June 19 2018, EcoStat Based on Wang, J., Zhao, Q., Hastie, T., & Owen, A. B. Confounder

Confounder Adjustment in Multiple Hypothesis Testing Qingyuan Zhao Department of Statistics,

Teaching Confounder-Based Statistical Literacy 19 June, 2019 1 2 2019 Univ. New Mexico 2019

RISK ADJUSTMENT DOCUMENTATION &amp; CODING 1 DEFINE RISK ADJUSTMENT Define Risk Adjustment and

Covariate Adjustment and Statistical Power Tara Slough EGAP Learning Days X Covariate Adjustment

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

EpiGraphDB Query for confounders http://epigraphdb.org/confounder/ (cf:Gwas)-[r1:MR]-&gt;

Trade and Labour Trade and Labour Market Adjustment Market Adjustment Joseph Francois Johannes

The SDL adjustment mechanism 28 June 2018 Outline of presentation What is the SDL adjustment

Risk Adjustment Using CDPS Todd Gilmer, PhD Associate Professor University of California, San

Structural Matrices in MDOF Systems Structural Matrices Evaluation of Structural Giacomo Boffi

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Graphics 2014 Linear Algebra II Linear Maps &amp; Matrices Linear Maps &amp; Matrices CORE

Model the WAIS-III IQ Scale Erin Buchanan Professor DataCamp Structural Equation Modeling with

Exploring models Categorical data R.W. Oldford 1974 Motor trend magazine data Recall the R data

CS 378: Autonomous Intelligent Robotics Instructor: Jivko Sinapov

Beyond 'One Size Fits All' A tiered model for digital preservation Open Repositories 2013 Umar

ISLE

Hands-on QM/MM Tutorial Files are also available on: http://villekaila.wordpress.com/ and on the

I nuovi immunomodulanti (CC-122) Romano Danesi Farmacologia clinica e Farmacogenetica

Structural analysis of effectors of the oncogenic Ras proteins Marcus Brunnert Department of

KEK Photon Factory Test Beams h High Energy Accelerator Research Organization Institute of

Sambuz

Useful Links

Newsletter

Mail Us

RISK ADJUSTMENT DOCUMENTATION & CODING 1 DEFINE RISK ADJUSTMENT Define Risk Adjustment and

EpiGraphDB Query for confounders http://epigraphdb.org/confounder/ (cf:Gwas)-[r1:MR]->

Graphics 2014 Linear Algebra II Linear Maps & Matrices Linear Maps & Matrices CORE