confounder adjustment in large scale linear structural
play

Confounder adjustment in large-scale linear structural models - PowerPoint PPT Presentation

Confounder adjustment in large-scale linear structural models Qingyuan Zhao Department of Statistics, The Wharton School, University of Pennsylvania June 19 2018, EcoStat Based on Wang, J., Zhao, Q., Hastie, T., & Owen, A. B. Confounder


  1. Confounder adjustment in large-scale linear structural models Qingyuan Zhao Department of Statistics, The Wharton School, University of Pennsylvania June 19 2018, EcoStat Based on ◮ Wang, J., Zhao, Q., Hastie, T., & Owen, A. B. Confounder adjustment in multiple hypothesis testing. Annals of Statistics , 45(5), 1863-1894, 2017. ◮ Song, Y., Zhao, Q. Performance evaluation in presence of latent factors. (In preparation). Slides are available at http://www-stat.wharton.upenn.edu/~qyzhao/ .

  2. Setting Multivariate linear regression T + ǫ T + Z n × p = X Y n × 1 α n × d β n × p . p × 1 p × d ◮ Y : “Panel data” or “transposable data”. Modern datasets are often high dimensional (both n , p ≫ 1). ◮ X : “Primary variable”, whose coefficients α are of interest. ◮ Z : “Control variables”, whose coefficients β are not of interest (i.e. nuisance parameters). ◮ Noise ǫ ∼ MN ( 0 , I n , Σ ) where Σ = diag ( σ 2 1 , . . . , σ 2 p ). Two examples ◮ Gene discovery: Y is gene expression (row: tissue; column: gene), X is the treatment. ◮ Mutual fund selectioin: Y is the monthly return of mutual funds (row: month; column: fund), X is the intercept, Z includes systematic risk factors. 1/17

  3. The confounding problem T + Z T + ǫ n × p = X n × p . Y n × 1 α n × d β p × 1 p × d Omitted variable bias When not all Z are known or measured, the OLS estimate of α can be severely biased. To see this, suppose T + W n × d = X n × 1 γ n × d , where W ⊥ ⊥ X . Z d × 1 Therefore Y = X ( α + βγ ) T + W β T + ǫ and the OLS estimate of α indeed converges to α + βγ . 2/17

  4. An illustrative example The gender study 1 Question: Which genes are more expressed in male/female? A microarray experiment was conducted in this study: ◮ Postmortem samples from the brains of 10 individuals. ◮ For each individual, 3 samples from different cortices. ◮ Each sample is sent to 3 different labs for analysis. ◮ Two different microarray platforms are used by the labs. In total, there are 10 × 3 × 3 = 90 samples. This example was first used by Gagnon-Bartsch and Speed 2 to demonstrate the importance of “removing unwanted variation” (RUV). 1Vawter, Marquis P., et al. “Gender-specific gene expression in post-mortem human brain: localization to sex chromosomes.” Neuropsychopharmacology 29.2 (2004). 2Gagnon-Bartsch, J. A., and Speed, T. P. “Using control genes to correct for unwanted variation in microarray data.” Biostatistics 13.3 (2012). 3/17

  5. A simple association test ◮ Regress each column of Y (gene) on X . ◮ In R , run summary(lm(Y ∼ X)) . ◮ Equivalent to a two-sample t -test with equal variance. Histogram of t-statistics: skewed and underdispersed 6 N(0.055,0.066^2) 4 density 2 0 −1.0 −0.5 0.0 0.5 1.0 t−statistics 4/17

  6. What happened? Plot of largest principle components ● lab ● ● 1 ● ● 2 ● PC2 3 ● platform ● ● ● 0 ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● PC1 5/17

  7. Our solution in a nutshell Recall that (for simplicity, assume Z is entirely unobserved) T + Z T + ǫ T + W n × p = X n × p , n × d = X Y n × 1 α n × d β Z n × 1 γ p × 1 n × d p × d d × 1 ⇓ ) T + W β T + ǫ . Y = X ( α + βγ � �� � τ Confounder adjusted testing and estimation (CATE) 1. OLS using the observed regressors: τ = ( X T X ) − 1 X T Y ≈ α + βγ , R = ( I − P X ) Y ≈ W β T + ǫ . ˆ 2. Factor analysis of R ⇒ loading matrix ˆ β . ˆ 3. Path analysis: p × 1 ≈ α τ ˆ p × 1 + β γ . d × 1 p × d Problem: the third step is not going to work because it has ( p + d ) parameters but only p equations, i.e. α is not identified . 6/17

  8. Identification Path analysis equation: p × 1 ≈ α τ p × 1 + β γ . d × 1 p × d ◮ τ and (the column space of) β can be identified from data. ◮ α and γ cannot be identified from data. In other words, different values of ( α , γ ) may correspond to the same distribution of the observed data. ◮ Solution to non-identifiability: put additional restrictions. Proposition Suppose Γ can be identified from the factor analysis. Then β is identifiable under either of the two following conditions: 1. Negative control: α C = 0 for a known set C such that |C| ≥ d and rank ( β C ) = d . 2. Sparsity: � α � 0 ≤ ⌊ ( p − d ) / 2 ⌋ , and rank ( β C ) = d , ∀C ⊂ { 1 , . . . , p } such that |C| = d . 7/17

  9. Estimation under sparsity Is sparsity reasonable? Not always, but acceptable in our examples: ◮ In genomics screening, most genes are probably unrelated. ◮ Most mutual funds likely have no “alpha” (otherwise they will be quickly identified by the investors) 3 Estimation via robust regression in CATE Using a robust loss function ρ ( · ) (such as Huber’s), solve p � τ j − ˆ � β T ˆ j γ � γ = arg min ˆ ρ , ˆ σ j γ j =1 τ − ˆ α = ˆ ˆ β ˆ γ . This is similar to solving a penalized regression in outlier detection: 4 τ − α − ˆ � 2 � � (ˆ γ , ˆ α ) = arg min � ˆ βγ Σ + P ρ ( α ) ˆ α , γ . 3Berk, J. B., & Green, R. C. (2004). “Mutual fund flows and performance in rational markets.” Journal of Political Economy , 112(6). 4She, Y., & Owen, A. B. (2011). “Outlier detection using nonconvex penalized regression.” JASA , 106. 8/17

  10. Some theoretical guarantees Theorem When n , p → ∞ , if the factor analysis estimates 5 of Γ and Σ are uniformly consistent, the robust loss function ρ is “nice”, we have for a fixed j, 1. ˆ α j is consistent if � β � 1 / p → 0 ; 2. ˆ α j is asymptotically normal and has “oracle efficiency” if √ n / p → 0 . � β � 1 ◮ “Oracle efficiency” means it has the same variance as the OLS estimator that observes the latent factors Z . 5Bai, J., & Li, K. (2012). Statistical analysis of factor models of high dimension. Annals of Statistics , 40(1). 9/17

  11. Mutual fund example Dataset Mutual fund returns from 1984—2015, obtained from Center for Research in Security Prices (CRSP). Factor model In finance, it is common to fit a linear model to the returns β T Y tj − r t = α j + j Z t + ǫ tj . � �� � ���� ���� � �� � ” Skill ” of manager Excess return idiosyncratic risk systematic risk People have discovered many systematic risk factors Z over the years: ◮ Market-average: this is the Capital Asset Pricing Model (CAPM). ◮ Stock caps and book-to-market ratio 6 . ◮ Momentum 7 . ◮ ...... 6Fama, E. F., & French, K. R. (1993). “Common risk factors in the returns on stocks and bonds.” Journal of Financial Economics , 33(1). 7Carhart, M. M. (1997). “On persistence in mutual fund performance.” Journal of Finance , 52(1). 10/17

  12. Mutual fund selection by CAPM A recent study 8 shows that ◮ Most investors use CAPM-alpha to select mutual funds. ◮ More sophisticated investors adjust for more risk factors. Is CAPM-alpha a good indicator for future performance? An empirical exercise: ◮ In the beginning of every quarter, we use data in the past five years to compute their cash flow , average returns , and CAPM-alpha . ◮ For each metric, funds are then divided into 10 groups . ◮ We evaluate the performance of each group in the next year. 8Barber, B. M., Huang, X., & Odean, T. (2016). “Which factors matter to investors? Evidence from mutual fund flows.” Review of Financial Studies , 29(10) 11/17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend