high dimensional statistics some progress and challenges
play

High-dimensional statistics: Some progress and challenges ahead - PowerPoint PPT Presentation

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 1 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh,


  1. High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 1 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.

  2. Introduction classical asymptotic theory: sample size n → + ∞ with number of parameters p fixed ◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation

  3. Introduction classical asymptotic theory: sample size n → + ∞ with number of parameters p fixed ◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation modern applications in science and engineering: ◮ large-scale problems: both p and n may be large (possibly p ≫ n ) ◮ need for high-dimensional theory that provides non-asymptotic results for ( n, p )

  4. Introduction classical asymptotic theory: sample size n → + ∞ with number of parameters p fixed ◮ law of large numbers, central limit theory ◮ consistency of maximum likelihood estimation modern applications in science and engineering: ◮ large-scale problems: both p and n may be large (possibly p ≫ n ) ◮ need for high-dimensional theory that provides non-asymptotic results for ( n, p ) curses and blessings of high dimensionality ◮ exponential explosions in computational complexity ◮ statistical curses (sample complexity) ◮ concentration of measure

  5. Introduction modern applications in science and engineering: ◮ large-scale problems: both p and n may be large (possibly p ≫ n ) ◮ need for high-dimensional theory that provides non-asymptotic results for ( n, p ) curses and blessings of high dimensionality ◮ exponential explosions in computational complexity ◮ statistical curses (sample complexity) ◮ concentration of measure Key ideas: what embedded low-dimensional structures are present in data? how can they can be exploited algorithmically?

  6. Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ ∈ R p × p given i.i.d. samples X i ∼ N (0 , Σ), for i = 1 , 2 , . . . , n

  7. Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ ∈ R p × p given i.i.d. samples X i ∼ N (0 , Σ), for i = 1 , 2 , . . . , n Classical approach: Estimate Σ via sample covariance matrix: n � 1 � X i X T Σ n := i n i =1 � �� � average of p × p rank one matrices

  8. Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ ∈ R p × p given i.i.d. samples X i ∼ N (0 , Σ), for i = 1 , 2 , . . . , n Classical approach: Estimate Σ via sample covariance matrix: n � 1 � X i X T Σ n := i n i =1 � �� � average of p × p rank one matrices Reasonable properties: ( p fixed, n increasing) Unbiased: E [ � Σ n ] = Σ Consistent: � a.s. Σ n − → Σ as n → + ∞ Asymptotic distributional properties available

  9. Vignette I: High-dimensional matrix estimation want to estimate a covariance matrix Σ ∈ R p × p given i.i.d. samples X i ∼ N (0 , Σ), for i = 1 , 2 , . . . , n Classical approach: Estimate Σ via sample covariance matrix: n � 1 � X i X T Σ n := i n i =1 � �� � average of p × p rank one matrices An alternative experiment: Fix some α > 0 Study behavior over sequences with p n = α Does � Σ n ( p ) converge to anything reasonable?

  10. Empirical vs MP law ( α = 0.5) 1 Theory 0.9 0.8 0.7 Density (rescaled) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 3 Eigenvalue Marcenko & Pastur, 1967.

  11. Empirical vs MP law ( α = 0.2) 1 Theory 0.9 0.8 0.7 Density (rescaled) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 Eigenvalue Marcenko & Pastur, 1967.

  12. Low-dimensional structure: Gaussian graphical models Zero pattern of inverse covariance 1 1 2 2 3 4 3 5 5 4 1 2 3 4 5 � � − 1 2 x T Θ ∗ x P ( x 1 , x 2 , . . . , x p ) ∝ exp . Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 5 / 1

  13. Maximum-likelihood with ℓ 1 -regularization Zero pattern of inverse covariance 1 1 2 2 3 4 3 5 5 4 1 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ ∗ ∈ R p × p .

  14. Maximum-likelihood with ℓ 1 -regularization Zero pattern of inverse covariance 1 1 2 2 3 4 3 5 5 4 1 2 3 4 5 Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ ∗ ∈ R p × p . Estimator (for inverse covariance) � � n � � � 1 � x i x T Θ ∈ arg min � i , Θ � � − log det(Θ) + λ n | Θ jk | n Θ i =1 j � = k Some past work: Yuan & Lin, 2006; d’Aspr´ emont et al., 2007; Bickel & Levina, 2007; El Karoui, 2007; d’Aspremont et al., 2007; Rothman et al., 2007; Zhou et al., 2007; Friedman et al., 2008; Lam & Fan, 2008; Ravikumar et al., 2008; Zhou, Cai & Huang, 2009

  15. Gauss-Markov models with hidden variables Z X 1 X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z , vector X = ( X 1 , X 2 , X 3 , X 4 ) is Gauss-Markov.

  16. Gauss-Markov models with hidden variables Z X 1 X 2 X 3 X 4 Problems with hidden variables: conditioned on hidden Z , vector X = ( X 1 , X 2 , X 3 , X 4 ) is Gauss-Markov. Inverse covariance of X satisfies { sparse, low-rank } decomposition:   1 − µ µ µ µ   µ 1 − µ µ µ  = I 4 × 4 − µ 11 T .    µ µ 1 − µ µ µ µ µ 1 − µ (Chandrasekaran, Parrilo & Willsky, 2010)

  17. Vignette II: High-dimensional sparse linear regression θ ∗ X y w S = + n n × p S c Set-up: noisy observations y = Xθ ∗ + w with sparse θ ∗ Estimator: Lasso program n p � � 1 i θ ) 2 + λ n � ( y i − x T θ ∈ arg min | θ j | n θ i =1 j =1 Some past work: Tibshirani, 1996; Chen et al., 1998; Donoho/Xuo, 2001; Tropp, 2004; Fuchs, 2004; Efron et al., 2004; Meinshausen & Buhlmann, 2005; Candes & Tao, 2005; Donoho, 2005; Haupt & Nowak, 2005; Zhou & Yu, 2006; Zou, 2006; Koltchinskii, 2007; van

  18. Application A: Compressed sensing (Donoho, 2005; Candes & Tao, 2005) X β ∗ y = n n × p p (a) Image: vectorize to β ∗ ∈ R p (b) Compute n random projections

  19. Application A: Compressed sensing (Donoho, 2005; Candes & Tao, 2005) In practice, signals are sparse in a transform domain: θ ∗ := Ψ β ∗ is a sparse signal, where Ψ is an orthonormal matrix. X Ψ T θ ∗ y s = n n × p p Reconstruct θ ∗ (and hence image β ∗ = Ψ T θ ∗ ) based on finding a sparse solution to under-constrained linear system X = X Ψ T is another random matrix. y = � where � X θ

  20. Noiseless ℓ 1 recovery: Unrescaled sample size Prob. exact recovery vs. sample size ( µ = 0) 1 0.9 0.8 Prob. of exact recovery 0.7 0.6 0.5 0.4 0.3 p = 128 0.2 p = 256 p = 512 0.1 0 0 50 100 150 200 250 300 Raw sample size n Probability of recovery versus sample size n .

  21. Application B: Graph structure estimation let G = ( V, E ) be an undirected graph on p = | V | vertices pairwise graphical model factorizes over edges of graph: � � � P ( x 1 , . . . , x p ; θ ) ∝ exp θ st ( x s , x t ) . ( s,t ) ∈ E given n independent and identically distributed (i.i.d.) samples of X = ( X 1 , . . . , X p ), identify the underlying graph structure

  22. Pseudolikelihood and neighborhood regression Markov properties encode neighborhood structure: d ( X s | X V \ s ) = ( X s | X N ( s ) ) � �� � � �� � Condition on full graph Condition on Markov blanket N ( s ) = { s, t, u, v, w } X s X t X w X s X u X v basis of pseudolikelihood method (Besag, 1974) basis of many graph learning algorithm (Friedman et al., 1999; Csiszar & Talata, 2005; Abeel et al., 2006; Meinshausen & Buhlmann, 2006)

  23. Graph selection via neighborhood regression 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 . . . . . . . . . . . . . . . 0 Predict X s based on X \ s := { X s , t � = s } . 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 X \ s X s

  24. Graph selection via neighborhood regression 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 . . . . . . . . . . . . . . . 0 Predict X s based on X \ s := { X s , t � = s } . 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 X \ s X s 1 For each node s ∈ V , compute (regularized) max. likelihood estimate: � � n � − 1 � θ [ s ] := arg min L ( θ ; X i \ s ) + λ n � θ � 1 n ���� � �� � θ ∈ R p − 1 i =1 local log. likelihood regularization

  25. Graph selection via neighborhood regression 1 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 1 0 0 0 . . . . . . . . . . . . . . . 0 Predict X s based on X \ s := { X s , t � = s } . 0 1 1 1 1 1 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 1 X \ s X s 1 For each node s ∈ V , compute (regularized) max. likelihood estimate: � � n � − 1 � θ [ s ] := arg min L ( θ ; X i \ s ) + λ n � θ � 1 n ���� � �� � θ ∈ R p − 1 i =1 local log. likelihood regularization 2 Estimate the local neighborhood � N ( s ) as support of regression vector � θ [ s ] ∈ R p − 1 .

  26. US Senate network (2004–2006 voting)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend