nonparametric graph estimation
play

Nonparametric Graph Estimation Han Liu Department of Opera-ons - PowerPoint PPT Presentation

Nonparametric Graph Estimation Han Liu Department of Opera-ons Research and Financial Engineering Princeton University Acknowledgement Fang Han John Lafferty Larry Wasserman Tuo Zhao JHU Biostats Chicago


  1. Nonparametric Graph Estimation Han Liu Department ¡of ¡Opera-ons ¡Research ¡and ¡Financial ¡Engineering Princeton ¡University

  2. Acknowledgement Fang Han John Lafferty Larry Wasserman Tuo Zhao JHU Biostats Chicago CS/Stats CMU Stats/ML JHU CS http:// www.princeton.edu/~hanliu 2

  3. High Dimensional Data Analysis The dimensionality d increases with the sample size n Approximation Error + Estimation Error + Computing Error This talk Well studied under linear and Gaussian models A little nonparametricity goes a long way 3

  4. Graph Estimation Problem Infer conditional independence based on observational data d variables X 1 , … , X d G = ( V , E ) ⎛ ⎞ n samples x 1 , … , x n ⎟ … ⎜ x 1 x 1 ⎟ ⎜ ⎟ 1 d X j ⎜ X i ⇒ ⎟ ⎜ ⎟    ⎜ ⎟ ⎟ ⎜ ⎟ ⎜ ⎟  x n x n ⎟ ⎜ ⎝ ⎠ 1 d ( X i , X j ) ∉ E ⇔ X i ⊥ X j | the rest Applications: density estimation, computing, visualization... 4

  5. Desired Statistical Properties Characterize the performance using different criteria Persistency : Risk( ˆ f ) - Risk( f o ) = o P (1) F Model ¡class Consistency : Distance( ˆ f , f * ) = o P (1) oracle es-mator f o ( ) = o (1) Sparsistency : P graph( ˆ f ) ≠ graph( f * ) ˆ f f * Minimax optimality true ¡func-on 5

  6. Outline Nonparanormal Forest Density Estimation Summary 6

  7. Gaussian Graphical Models ( ) Ω = Σ − 1 X ~ N d µ , Σ Ω jk = 0 ⇔ X j ⊥ X k | the rest (Lauritzen 96) glasso--Graphical Lasso ( Yuan and Lin 06, Banerjee 08, Friedman et al. 08 ) Sample covariance ∑ Ω  0 { tr ( ˆ } S Ω ) − log | Ω | + λ Ω jk min              j , k Negative Gaussian log-likelihood L 1 -regularization Neighborhood selection ( Meinshausen and Buhlmann 06 ) 7

  8. Gaussian Graphical Models CLIME -- Constrained L 1 -Minimization Method ( Cai et al. 2011 ) subject to ∑ Ω jk S Ω − I ‖ ‖ ˆ max ≤ λ min Ω j , k gDantzig -- Graphical Dantzig Selector ( Yuan 2010 ) 8

  9. Computation and Theory Computing: scalable up to thousands of dimensions glasso ( Hastie et al. ) huge ( Zhao and Liu ) language: C language: Fortran scalability: d<6000 scalability: d<3000 Speed: 3 x faster Speed: very fast Theory: persistency, consistency, sparsistency, optimal rate,... ⎛ ⎞ log d ⎟ ⎜ ‖ ˆ S − Σ ‖ ⎟ max = O P ⎜ key result for analysis ⎟ ⎜ ⎜ ⎟ ⎝ ⎠ n population covariance sample covariance 9

  10. Many Real Data are non-Gaussian Normal Q-Q plot of one typical gene Sample Quantile Arabidopsis Data ( Wille et al. 04 ) Theoretical Quantile ( n = 118, d =39) Relax the Gaussian assumption without losing statistical and computational efficiency? 10

  11. The Nonparanormal Gaussian ⇒ Gaussian Copula Nonparanormal Definition ( Liu, Lafferty, Wasserman 09 ) A random vector X = ( X 1 , … , X d ) is nonparanormal ( ) X ~ NPN d Σ ,{ f j } j = 1 d ( ) is normal in case f ( X ) = f 1 ( X 1 ), … , f d ( X d ) ( ) . f ( X ) ~ N d 0, Σ Here f j ' s are strictly monotone and diag ( Σ ) = 1 . f j ( t ) = t − µ j ⇒ recover arbitrary Gaussian distributions σ j 11

  12. Visualization Bivariate nonparanormal densities with different transformations 12

  13. Basic Properties The graph is encoded in the inverse correlation matrix ( ) and Ω = Σ − 1 , then Let X ~ NPN d Σ ,{ f j } j = 1 d ⎧ ⎫ ⎪ 2 f ( x ) T Ω f ( x ) ⎪ d (2 π ) d /2 | Ω | − 1/2 exp − 1 1 ⎪ ⎪ ∏ ′ p X ( x ) = ⎨ ⎬ f j ( x j ) ⎪ ⎪ ⎩ ⎪ ⎪ ⎭ j = 1 ⇒ Ω ij = 0 ⇔ X i ⊥ X j | the rest Not jointly convex, how to estimate the parameters? 13

  14. Estimating Transformation Functions without worrying about Ω d Directly estimate { f j } j = 1 f j strictly monotone CDF of X j f j ( X j ) ~ N (0,1) ( ) = P f j ( X j ) ≤ f j ( t ) ( ) = Φ f j ( t ) ( ) F j ( t ) = P X j ≤ t ⇒ ( ) f j ( t ) = Φ − 1 F j ( t ) Normal-score transformation n 1 i ≤ t ) ∑ ( x j ˆ F j ( t ) = I n + 1 i = 1 14

  15. Estimating Inverse Correlation Matrix Nonparanormal Algorithm ( Liu, Han, Lafferty, Wasserman 12 ) Step 1 : calculate the Spearman's rank correlation coefficient matrix ˆ R ρ Step 2 : transform ˆ R ρ into ˆ Σ ρ according to ⎛ ⎞ jk = 2 ⋅ sin π ⎟ ˆ ⎜ ( ∗ ) ˆ ˆ Σ ρ provides good estimate of Σ . Σ ρ R ρ ⎟ ⎜ ⎜ ⎟ ⎝ ⎠ jk 6 Σ ρ into glasso / CLIME / gDantzig to get ˆ Step 3 : plug ˆ Ω ρ and the graph The same procedure is independently proposed by (Xue and Zou 12) 15

  16. Nonparanormal Theory Theorem ( Liu, Han, Lafferty, Wasserman 12 ) Let X ~ NPN d ( Σ , f ) and Ω = Σ − 1 . Given whatever conditions on Σ and Ω that secure the consistency and sparsistency of glasso / CLIME / gDantzig under the Gaussian models, the nonparanormal is also consistent and sparsistent with exactly the same parametric rates of convergence. ⇒ The nonparanormal is a safe replacement of the Gaussian model 16

  17. Proof of the Theorem ⎛ ⎞ Σ ρ − Σ ‖ log d ⎟ ⎜ ‖ ˆ ⎟ ⎟ . max = O P ⎜ Proof: The key is to show that ⎟ ⎜ ⎜ ⎝ ⎠ n For Gaussian distribution, Kruskal (1948) shows monotone transformation invariant ⎛ ⎞ Σ jk = 2 ⋅ sin π ⎟ Pearson’s ⎜ Population Spearman’s 6 R ρ ⎟ ⎜ ⎜ ⎟ ⎝ ⎠ correlation coefficient jk rank coefficient Also true for the nonparanormal distribution ⎛ ⎞ Σ ρ − Σ ‖ R ρ − R ρ ‖ log d ⎟ ⎜ ‖ ˆ ‖ ˆ max  ⎟ . ⎟ max = O P ⎜ ⎟ ⎜ ⎜ ⎝ ⎠ n the theory of U - statistics. 17

  18. Empirical Results For nonGaussian data, the nonparanormal >> glasso Sample x i ~ NPN d Σ , f ( ) with n = 200, d = 40 and transformation f j FN FP glasso true graph nonparanormal Oracle graph: pick the best tuning parameter along the path 18

  19. Nonparanormal: Efficiency Loss For Gaussian data, the nonparanormal almost loses no efficiency Computationally -- no extra cost Statistically -- sample x 1 , … , x n ~ N d (0, Σ ) with n = 80 and d = 100 1-FN almost no efficiency loss ROC curve for graph recovery 1-FP 19

  20. Arabidopsis Data The nonparanormal behaves differently from glasso on the Arabidopsis data λ 1 The paths are different λ 2 highly nonlinear ˆ f j λ 3 MECPS Nonlinear transformation causes graph difference glasso nonparanormal difference 20

  21. Scientific Implications Cross-pathway interactions? nonparanormal MVA Pathway MEP Pathway HMGR1 MECPS HMGR2 glasso Still open in the current biological literature ( Hou et al. 2010 ) 21

  22. Tradeoff Nonparanormal: unrestricted graphs, more flexible distributions What if the true distribution is not nonparanormal? Tradeoff structural flexibility for greater nonparametricity 22

  23. Forest Densities Gaussian Copula ⇒ Fully nonparametric distribution A forest F = ( V , E F ) is an acylic graph . A distribution is supported on a forest F=(V, E F ) if p ( x i , x j ) ∏ ∏ ( x k ) p F ( x ) = ⋅ p p ( x i ) p ( x j ) ( i , j ) ∈ E F k ∈ V p ( x i , x j ), ˆ ˆ F = ( V , E ˆ ˆ p ( x k ) F ) Forest density estimator Advantages: visualization, computing, distributional flexibility, inference 23

  24. Some Previous Work Most existing work on forests are for discrete distributions Chow and Liu (1968) Bach and Jordan (2003) Tan et al. (2010) Chechetka and Guestrin (2007) Our focus: statistical properties in high dimensions 24

  25. ‖ Estimation Find a forest F ( k ) = argmin ( ) subject to E F ≤ k KL p ( x ) p F ( x ) F projection of p ( x ) onto F true density Maximum weight forest problem ( Kruskal 56 ) F ( k ) = argmax ∑ subject to E F ≤ k I ( p ij ) F ( i , j ) ∈ E F mutual information p ( x i , x j ) ∫ I ( p ij ) = p ( x i , x j )log p ( x i ) p ( x j ) dx i dx j p ( x i , x j ), ˆ Clipped KDE ˆ p ( x k ) 25

  26. Forest Density Estimation Algorithm Forest Density Estimation Algorithm 1. Sort edges according to empirical mutual information I ( ˆ p ij ) 2. Greedily pick a set of edges such that no cycles are formed 3. Output the obtained forest after k edges have been added 26

  27. Assumptions for Forest Graph Estimation (A1) Bivariate marginals p ( x j , x k ) ∈ 2nd - order H  older class (A2) p ( x ) has bounded support (e.g. [0,1] d ) and κ 1 ≤ min j , k p ( x j , x k ) ≤ max j , k p ( x j , x k ) ≤ κ 2 (A3) p ( x j , x k ) has vanishing partial derivatives on boundaries (A4) For a "crucial" set of edges, their mutual info. distinct enough from each other To secure enough signal-to-noise-ratio for correct structure recovery ( Tan, Anandkumar, Willsky 11 ) 27

  28. ‖ Forest Density Estimation Theory F ( k ) = argmin ( ) F : E F ≤ k KL p ( x ) p F ( x ) Theorem-Oracle Sparsistency ( Liu et al. 12 ) P ( k ) : densities supported by For graph estimation, let forests with at most k edges log d → 0, parametric scaling n and 1d and 2d KDEs use the same bandwidth Oracle density estimator p F ( k ) h  n − 1/4 , undersmooth ˆ p ˆ F ( k ) F ( k ) ≠ F ( k ) ( ) = o (1). p k P ˆ we have sup Forest Estimator true density 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend