Nonparametric Graph Estimation Han Liu Department of Opera-ons - PowerPoint PPT Presentation

Nonparametric Graph Estimation Han Liu Department ¡of ¡Opera-ons ¡Research ¡and ¡Financial ¡Engineering Princeton ¡University

Acknowledgement Fang Han John Lafferty Larry Wasserman Tuo Zhao JHU Biostats Chicago CS/Stats CMU Stats/ML JHU CS http:// www.princeton.edu/~hanliu 2

High Dimensional Data Analysis The dimensionality d increases with the sample size n Approximation Error + Estimation Error + Computing Error This talk Well studied under linear and Gaussian models A little nonparametricity goes a long way 3

Graph Estimation Problem Infer conditional independence based on observational data d variables X 1 , … , X d G = ( V , E ) ⎛ ⎞ n samples x 1 , … , x n ⎟ … ⎜ x 1 x 1 ⎟ ⎜ ⎟ 1 d X j ⎜ X i ⇒ ⎟ ⎜ ⎟    ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎟  x n x n ⎟ ⎜ ⎝ ⎠ 1 d ( X i , X j ) ∉ E ⇔ X i ⊥ X j | the rest Applications: density estimation, computing, visualization... 4

Desired Statistical Properties Characterize the performance using different criteria Persistency : Risk( ˆ f ) - Risk( f o ) = o P (1) F Model ¡class Consistency : Distance( ˆ f , f * ) = o P (1) oracle es-mator f o ( ) = o (1) Sparsistency : P graph( ˆ f ) ≠ graph( f * ) ˆ f f * Minimax optimality true ¡func-on 5

Outline Nonparanormal Forest Density Estimation Summary 6

Gaussian Graphical Models ( ) Ω = Σ − 1 X ~ N d µ , Σ Ω jk = 0 ⇔ X j ⊥ X k | the rest (Lauritzen 96) glasso--Graphical Lasso ( Yuan and Lin 06, Banerjee 08, Friedman et al. 08 ) Sample covariance ∑ Ω  0 { tr ( ˆ } S Ω ) − log | Ω | + λ Ω jk min              j , k Negative Gaussian log-likelihood L 1 -regularization Neighborhood selection ( Meinshausen and Buhlmann 06 ) 7

Gaussian Graphical Models CLIME -- Constrained L 1 -Minimization Method ( Cai et al. 2011 ) subject to ∑ Ω jk S Ω − I ‖ ‖ ˆ max ≤ λ min Ω j , k gDantzig -- Graphical Dantzig Selector ( Yuan 2010 ) 8

Computation and Theory Computing: scalable up to thousands of dimensions glasso ( Hastie et al. ) huge ( Zhao and Liu ) language: C language: Fortran scalability: d<6000 scalability: d<3000 Speed: 3 x faster Speed: very fast Theory: persistency, consistency, sparsistency, optimal rate,... ⎛ ⎞ log d ⎟ ⎜ ‖ ˆ S − Σ ‖ ⎟ max = O P ⎜ key result for analysis ⎟ ⎜ ⎜ ⎟ ⎝ ⎠ n population covariance sample covariance 9

Many Real Data are non-Gaussian Normal Q-Q plot of one typical gene Sample Quantile Arabidopsis Data ( Wille et al. 04 ) Theoretical Quantile ( n = 118, d =39) Relax the Gaussian assumption without losing statistical and computational efficiency? 10

The Nonparanormal Gaussian ⇒ Gaussian Copula Nonparanormal Definition ( Liu, Lafferty, Wasserman 09 ) A random vector X = ( X 1 , … , X d ) is nonparanormal ( ) X ~ NPN d Σ ,{ f j } j = 1 d ( ) is normal in case f ( X ) = f 1 ( X 1 ), … , f d ( X d ) ( ) . f ( X ) ~ N d 0, Σ Here f j ' s are strictly monotone and diag ( Σ ) = 1 . f j ( t ) = t − µ j ⇒ recover arbitrary Gaussian distributions σ j 11

Visualization Bivariate nonparanormal densities with different transformations 12

Basic Properties The graph is encoded in the inverse correlation matrix ( ) and Ω = Σ − 1 , then Let X ~ NPN d Σ ,{ f j } j = 1 d ⎧ ⎫ ⎪ 2 f ( x ) T Ω f ( x ) ⎪ d (2 π ) d /2 | Ω | − 1/2 exp − 1 1 ⎪ ⎪ ∏ ′ p X ( x ) = ⎨ ⎬ f j ( x j ) ⎪ ⎪ ⎩ ⎪ ⎪ ⎭ j = 1 ⇒ Ω ij = 0 ⇔ X i ⊥ X j | the rest Not jointly convex, how to estimate the parameters? 13

Estimating Transformation Functions without worrying about Ω d Directly estimate { f j } j = 1 f j strictly monotone CDF of X j f j ( X j ) ~ N (0,1) ( ) = P f j ( X j ) ≤ f j ( t ) ( ) = Φ f j ( t ) ( ) F j ( t ) = P X j ≤ t ⇒ ( ) f j ( t ) = Φ − 1 F j ( t ) Normal-score transformation n 1 i ≤ t ) ∑ ( x j ˆ F j ( t ) = I n + 1 i = 1 14

Estimating Inverse Correlation Matrix Nonparanormal Algorithm ( Liu, Han, Lafferty, Wasserman 12 ) Step 1 : calculate the Spearman's rank correlation coefficient matrix ˆ R ρ Step 2 : transform ˆ R ρ into ˆ Σ ρ according to ⎛ ⎞ jk = 2 ⋅ sin π ⎟ ˆ ⎜ ( ∗ ) ˆ ˆ Σ ρ provides good estimate of Σ . Σ ρ R ρ ⎟ ⎜ ⎜ ⎟ ⎝ ⎠ jk 6 Σ ρ into glasso / CLIME / gDantzig to get ˆ Step 3 : plug ˆ Ω ρ and the graph The same procedure is independently proposed by (Xue and Zou 12) 15

Nonparanormal Theory Theorem ( Liu, Han, Lafferty, Wasserman 12 ) Let X ~ NPN d ( Σ , f ) and Ω = Σ − 1 . Given whatever conditions on Σ and Ω that secure the consistency and sparsistency of glasso / CLIME / gDantzig under the Gaussian models, the nonparanormal is also consistent and sparsistent with exactly the same parametric rates of convergence. ⇒ The nonparanormal is a safe replacement of the Gaussian model 16

Proof of the Theorem ⎛ ⎞ Σ ρ − Σ ‖ log d ⎟ ⎜ ‖ ˆ ⎟ ⎟ . max = O P ⎜ Proof: The key is to show that ⎟ ⎜ ⎜ ⎝ ⎠ n For Gaussian distribution, Kruskal (1948) shows monotone transformation invariant ⎛ ⎞ Σ jk = 2 ⋅ sin π ⎟ Pearson’s ⎜ Population Spearman’s 6 R ρ ⎟ ⎜ ⎜ ⎟ ⎝ ⎠ correlation coefficient jk rank coefficient Also true for the nonparanormal distribution ⎛ ⎞ Σ ρ − Σ ‖ R ρ − R ρ ‖ log d ⎟ ⎜ ‖ ˆ ‖ ˆ max  ⎟ . ⎟ max = O P ⎜ ⎟ ⎜ ⎜ ⎝ ⎠ n the theory of U - statistics. 17

Empirical Results For nonGaussian data, the nonparanormal >> glasso Sample x i ~ NPN d Σ , f ( ) with n = 200, d = 40 and transformation f j FN FP glasso true graph nonparanormal Oracle graph: pick the best tuning parameter along the path 18

Nonparanormal: Efficiency Loss For Gaussian data, the nonparanormal almost loses no efficiency Computationally -- no extra cost Statistically -- sample x 1 , … , x n ~ N d (0, Σ ) with n = 80 and d = 100 1-FN almost no efficiency loss ROC curve for graph recovery 1-FP 19

Arabidopsis Data The nonparanormal behaves differently from glasso on the Arabidopsis data λ 1 The paths are different λ 2 highly nonlinear ˆ f j λ 3 MECPS Nonlinear transformation causes graph difference glasso nonparanormal difference 20

Scientific Implications Cross-pathway interactions? nonparanormal MVA Pathway MEP Pathway HMGR1 MECPS HMGR2 glasso Still open in the current biological literature ( Hou et al. 2010 ) 21

Tradeoff Nonparanormal: unrestricted graphs, more flexible distributions What if the true distribution is not nonparanormal? Tradeoff structural flexibility for greater nonparametricity 22

Forest Densities Gaussian Copula ⇒ Fully nonparametric distribution A forest F = ( V , E F ) is an acylic graph . A distribution is supported on a forest F=(V, E F ) if p ( x i , x j ) ∏ ∏ ( x k ) p F ( x ) = ⋅ p p ( x i ) p ( x j ) ( i , j ) ∈ E F k ∈ V p ( x i , x j ), ˆ ˆ F = ( V , E ˆ ˆ p ( x k ) F ) Forest density estimator Advantages: visualization, computing, distributional flexibility, inference 23

Some Previous Work Most existing work on forests are for discrete distributions Chow and Liu (1968) Bach and Jordan (2003) Tan et al. (2010) Chechetka and Guestrin (2007) Our focus: statistical properties in high dimensions 24

‖ Estimation Find a forest F ( k ) = argmin ( ) subject to E F ≤ k KL p ( x ) p F ( x ) F projection of p ( x ) onto F true density Maximum weight forest problem ( Kruskal 56 ) F ( k ) = argmax ∑ subject to E F ≤ k I ( p ij ) F ( i , j ) ∈ E F mutual information p ( x i , x j ) ∫ I ( p ij ) = p ( x i , x j )log p ( x i ) p ( x j ) dx i dx j p ( x i , x j ), ˆ Clipped KDE ˆ p ( x k ) 25

Forest Density Estimation Algorithm Forest Density Estimation Algorithm 1. Sort edges according to empirical mutual information I ( ˆ p ij ) 2. Greedily pick a set of edges such that no cycles are formed 3. Output the obtained forest after k edges have been added 26

Assumptions for Forest Graph Estimation (A1) Bivariate marginals p ( x j , x k ) ∈ 2nd - order H  older class (A2) p ( x ) has bounded support (e.g. [0,1] d ) and κ 1 ≤ min j , k p ( x j , x k ) ≤ max j , k p ( x j , x k ) ≤ κ 2 (A3) p ( x j , x k ) has vanishing partial derivatives on boundaries (A4) For a "crucial" set of edges, their mutual info. distinct enough from each other To secure enough signal-to-noise-ratio for correct structure recovery ( Tan, Anandkumar, Willsky 11 ) 27

‖ Forest Density Estimation Theory F ( k ) = argmin ( ) F : E F ≤ k KL p ( x ) p F ( x ) Theorem-Oracle Sparsistency ( Liu et al. 12 ) P ( k ) : densities supported by For graph estimation, let forests with at most k edges log d → 0, parametric scaling n and 1d and 2d KDEs use the same bandwidth Oracle density estimator p F ( k ) h  n − 1/4 , undersmooth ˆ p ˆ F ( k ) F ( k ) ≠ F ( k ) ( ) = o (1). p k P ˆ we have sup Forest Estimator true density 28

Nonparametric Graph Estimation Han Liu Department of Opera-ons - PowerPoint PPT Presentation

Nonparametric Graph Estimation Han Liu Department of Opera-ons Research and Financial Engineering Princeton University Acknowledgement Fang Han John Lafferty Larry Wasserman Tuo Zhao JHU Biostats Chicago

Nonparametric Minimax Estimation of the Estimation of the Volatility in High- Volatility in

Nonparametric Density Estimation October 1, 2018 Introduction If we cant fit a

Nonparametric spectral-based estimation of latent structures Stphane Bonhomme (Chicago), Koen

Nonparametric density estimation Christopher F Baum EC 823: Applied Econometrics Boston College,

Nonparametric density estimation Christopher F Baum ECON 8823: Applied Econometrics Boston

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Outline Density Estimation 1 Nonparametric Methods Bins Kernel Estimators k-Nearest Neighbor

Nonparametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Density Estimation

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of

Nonparametric Estimation of Additive Multivariate Diffusion Processes Berthold R. Haag

Nonparametric estimation in a multiplicative noise model Charlotte Dion (1) , (2) Joint work with

A Series of Lectures on Approximate Dynamic Programming Dimitri P . Bertsekas Laboratory for

Recap: VINCIA Plug-in to PYTHIA 8 C++ (~20,000 lines) Giele, Kosower, Skands, PRD 78 (2008)

Todays Lecture: More on linear search Cell arrays Application of cell array:

Asynchronous DP, Real-Time DP and Intro to RL Lecturer: Daniel Russo Scribe: Kejia Shi, Yexin Wu

Solving large scale eigenvalue problems Lecture 9, April 25, 2018: Lanczos and Arnoldi methods

i ntroduction i ntroduction Invariants of Hilbert series numerical semigroups New

Bounds on the first Hilbert Coefficient Krishna Hanumanthu, Craig Huneke University of Kansas

Neural Encoding Models Maneesh Sahani Gatsby Computational Neuroscience Unit University College