Dimension Reduction and High-Dimensional Data Estimation and - PowerPoint PPT Presentation

Dimension Reduction and High-Dimensional Data Estimation and Inference with Application to Genomics and Neuroimaging Maxime Turgeon April 9, 2019 McGill University Department of Epidemiology, Biostatistics, and Occupational Health 1/21

Introduction ❼ Data revolution fueled by technological developments, era of “big data”. ❼ In genomics and neuroimaging, high-throughput technologies lead to high-dimensional data . ❼ High costs lead to small-to-moderate samples size. ❼ More features than samples (large p , small n ) 2/21

Omnibus Hypotheses and Dimension Reduction ❼ Traditionally, analysis performed one feature at a time . ❼ Large computational burden ❼ Conservative tests and low power ❼ Ignore correlation between features ❼ From a biological standpoint, there are natural groupings of measurements ❼ Key : Summarise group-wise information using latent features ❼ Dimension Reduction 3/21

High-dimensional data–Estimation ❼ Several approaches use regularization ❼ Zou et al. (2006) Sparse PCA ❼ Witten et al. (2009) Penalized Matrix Decomposition ❼ Other approaches use structured estimators ❼ Bickel & Levina (2008) Banded and thresholded covariance estimators ❼ All of these approaches require tuning parameters, which increases computational burden 4/21

High-dimensional data–Inference ❼ Double Wishart problem and largest root ❼ Distribution of largest root is difficult to compute ❼ Several approximation strategies presented ❼ Chiani found simple recursive equations, but computationally unstable ❼ Result of Johnstone gives an excellent good approximation ❼ Does not work with high-dimensional data 5/21

Contribution of the thesis In this thesis, I address the limitations outlined above. ❼ Block-independence leads to simple approach free of tuning parameters ❼ Empirical estimator that extends Johnstone’s theorem to high-dimensional data ❼ Application of these ideas to sequencing study of DNA methylation and ACPA levels. 6/21

First Manuscript–Estimation

Principal Component of Explained Variance Let Y be a multivariate outcome of dimension p and X , a vector of covariates. We assume a linear relationship: Y = β T X + ε. The total variance of the outcome can then be decomposed as Var( Y ) = Var( β T X ) + Var( ε ) = V M + V R . 7/21

PCEV: Statistical Model Decompose the total variance of Y into: 1. Variance explained by the covariates; 2. Residual variance. 8/21

PCEV: Statistical Model The PCEV framework seeks a linear combination w T Y such that the proportion of variance explained by X is maximised: w T V M w R 2 ( w ) = w T ( V M + V R ) w . Maximisation using a combination of Lagrange multipliers and linear algebra. Key observation : R 2 ( w ) measures the strength of the association 9/21

Block-diagonal Estimator I propose a block approach to the computation of PCEV in the presence of high-dimensional outcomes. ❼ Suppose the outcome variables Y can be divided in blocks of variables in such a way that ❼ Variables within blocks are correlated ❼ Variables between blocks are uncorrelated   0 0 ∗ Cov( Y ) =  0 0  ∗   0 0 ∗ 10/21

Block-diagonal Estimator ❼ We can perform PCEV on each of these blocks, resulting in a component for each block. ❼ Treating all these “partial” PCEVs as a new, multivariate pseudo-outcome, we can perform PCEV again; the result is a linear combination of the original outcome variables. ❼ Mathematically equivalent to performing PCEV in a single-step (under assumption) ❼ Extensive simulation study shows good power and robustness of inference to violations of assumption. ❼ Presented application to genomics and neuroimaging data. 11/21

Second Manuscript–Inference

Double Wishart Problem ❼ Recall that PCEV is maximising a Rayleigh quotient: w T V M w R 2 ( w ) = w T ( V M + V R ) w . ❼ Equivalent to finding largest root λ of a double Wishart problem : det ( A − λ ( A + B )) = 0 , where A = V M , B = V R . 12/21

Inference ❼ Evidence in the literature that the null distribution of the largest root λ should be related to the Tracy-Widom distribution . ❼ Result of Johnstone (2008) gives an excellent approximation to the distribution using an explicit location-scale family of the TW(1). 13/21

Inference ❼ However, Johnstone’s theorem requires a rank condition on the matrices (rarely satisfied in high dimensions). ❼ The null distribution of λ is asymptotically equal to that of the largest root of a scaled Wishart (Srivastava). ❼ The null distribution of the largest root of a Wishart is also related to the Tracy-Widom distribution. ❼ More generally, random matrix theory suggests that the Tracy-widom distribution is key in central-limit-like theorems for random matrices. 14/21

Empirical Estimate I proposed to obtain an empirical estimate as follows: Estimate the null distribution 1. Perform a small number of permutations ( ∼ 50) on the rows of Y ; 2. For each permutation, compute the largest root statistic. 3. Fit a location-scale variant of the Tracy-Widom distribution. Numerical investigations support this approach for computing p-values. The main advantage over a traditional permutation strategy is the computation time . 15/21

Third Manuscript–Application

Data ❼ Anti-citrullinated Protein Antibody (ACPA) levels were measured in 129 levels without any symptom of Rheumatoid Arthritis (RA). ❼ DNA methylation levels were measured from whole-blood samples using a targeted sequencing technique ❼ CpG dinucleotides were grouped in regions of interest before the sequencing ❼ We have 23,350 regions to analyze individually, corresponding to multivariate datasets Y k , k = 1 , . . . , 23 , 350. 16/21

Method ❼ PCEV was performed independently on all regions. ❼ Significant amount of missing data; complete-case analysis. ❼ Analysis was adjusted for age, sex, and smoking status. ❼ ACPA levels are dichotomized into high and low. ❼ For the 2519 regions with more CpGs than observations, we used the Tracy-Widom empirical estimator to obtain p-values. 17/21

Results ❼ There were 1062 statistically significant regions at the α = 0 . 05 level. ❼ Univariate analysis of 175,300 CpG dinucleotides yielded 42 significant results ❼ These 42 CpG dinucleotides were in 5 distinct regions. 18/21

Discussion

Summary ❼ This thesis described specific approaches to dimension reduction with high-dimensional datasets. ❼ Manuscript 1 : Block-independence assumption leads to convenient estimation strategy that is free of tuning parameters. ❼ Manuscript 2 : Empirical estimator provides valid p-values for high-dimensional data by leveraging Johnstone’s theorem. ❼ Manuscript 3 : Application of this thesis’ ideas to a study of the association between aCPA levels and DNA methylation. ❼ All methods from Manuscripts 1 & 2 are part of the R package pcev . 19/21

Limitations ❼ Inference for PCEV-block is robust to block-independence violations, but not estimation ❼ Could have impact on downstream analyses. ❼ Empirical estimator does not address limitations due to power ❼ But combining with shrinkage estimator should improve power. ❼ Missing data and multivariate analysis 20/21

Future Work ❼ Estimate effective number of independent tests in region-based analyses ❼ Multiple imputation and PCEV ❼ Nonlinear dimension reduction 21/21

Thank you The slides can be found at maxturgeon.ca/talks . 21/21

Dimension Reduction and High-Dimensional Data Estimation and - PowerPoint PPT Presentation

Dimension Reduction and High-Dimensional Data Estimation and Inference with Application to Genomics and Neuroimaging Maxime Turgeon April 9, 2019 McGill University Department of Epidemiology, Biostatistics, and Occupational Health 1/21

Dimension Reduction CSE 6242 / CX 4242 Thanks : Prof. Jaegul Choo , Dr. Ramakrishnan Kannan,

Dimension Reduction and Nearest Neighbor Search Advanced Algorithms Nanjing University, Fall

Linear Dimension Reduction (in L 2 ) Linear Dimension Reduction: R D R d Goal: Find a low-dim.

VC-dimension and Erd os-P osa property Nicolas Bousquet LIRMM, University Montpellier II

Reduced-Rank Singular Value Decomposition for Dimension Reduction with High-Dimensional Data

Dimension Reduction CS 6242 Ramakrishnan Kannan Thanks : Prof. Jaegul Choo and Prof. Le

n -dimensional manifold M with T := TM n -dimensional manifold M with T := TM T n -dimensional

What can we say about high- dimensional objects from a low-dimensional representation? 2

High Dimensional Data Alark Joshi High dimensional data Data with multiple dimensions,

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

The Human Dimension Sue Manns Regional Director Pegasus The Human Dimension The Human

Packing Dimension Results for Anisotropic Gaussian Random Fields Dongsheng Wu Department of

The Metric Dimension Problem. J. D az Monash U., May 2018 The Metric Dimension problem

Nonparametric Variable Selection via Sufficient Dimension Reduction Lexin Li Workshop on Current

Extreme Value Theory and Dimension GARDES Inference on reduction for the study of hyperspectral

Dimension Reduction CS 760@UW-Madison Goals for the lecture you should understand the following

Hierarchical Bayes models for Perfusion Imaging Volker J Schmid Department of Statistics,

Continuous Dynamical Systems Florence Hubert florence.hubert@univ-amu.fr Charlotte Perrin

A Multitumor Regional Symposium Focused on the Application of Emerging Research Information to the

Analyzing Malware Detection Effectiveness with Multiple Anti- Malware Programs Jose A. Morales

Sequencing Treatments in Relapsed Hodgkin Lymphoma Leonard T. Heffner, Jr., M.D. July 27, 2017

In search of new markers in chronic lymphocy3c leukemia and

Department of Internal Medicine Coordinating H2020 grants Immunotherapy in infectious disease -

SPRINT: a Simple Parallel R INTerface to High Performance Computing (HPC) and a Parallel R