high dimensional statistics some progress and challenges
play

High-dimensional statistics: Some progress and challenges ahead - PowerPoint PPT Presentation

High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 3 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh,


  1. High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture 3 Joint work with: Alekh Agarwal, Arash Amini, Po-Ling Loh, Sahand Negahban, Garvesh Raskutti, Pradeep Ravikumar, Bin Yu.

  2. Non-parametric regression Goal: How to predict output from covariates? given covariates ( x 1 , x 2 , x 3 , . . . , x p ) output variable y want to predict y based on ( x 1 , . . . , x p )

  3. Non-parametric regression Goal: How to predict output from covariates? given covariates ( x 1 , x 2 , x 3 , . . . , x p ) output variable y want to predict y based on ( x 1 , . . . , x p ) Different models: Ordered in terms of complexity/richness: linear non-linear but still parametric semi-parametric non-parametric

  4. Non-parametric regression Goal: How to predict output from covariates? given covariates ( x 1 , x 2 , x 3 , . . . , x p ) output variable y want to predict y based on ( x 1 , . . . , x p ) Different models: Ordered in terms of complexity/richness: linear non-linear but still parametric semi-parametric non-parametric Challenge: How to control statistical and computational complexity for large number of predictors p ?

  5. High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� � � θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w .

  6. High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� � � θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w . Sample complexity: How many samples n for reliable prediction? linear models p/ǫ 2 ◮ without any structure: sample size n ≍ necessary/sufficient ���� linear in p

  7. High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� � � θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w . Sample complexity: How many samples n for reliable prediction? linear models p/ǫ 2 ◮ without any structure: sample size n ≍ necessary/sufficient ���� linear in p ( s log p ) /ǫ 2 ◮ with sparsity s ≪ p : sample size n ≍ necessary/sufficient � �� � logarithmic in p

  8. High dimensions and sample complexity Possible models: p � ordinary linear regression: y = θ j x j + w j =1 � �� � � θ, x � general non-parametric model: y = f ( x 1 , x 2 , . . . , x p ) + w . Sample complexity: How many samples n for reliable prediction? linear models p/ǫ 2 ◮ without any structure: sample size n ≍ necessary/sufficient ���� linear in p ( s log p ) /ǫ 2 ◮ with sparsity s ≪ p : sample size n ≍ necessary/sufficient � �� � logarithmic in p non-parametric models: p -dimensional, smoothness α (1 /ǫ ) 2+ p/α Curse of dimensionality: n ≍ � �� � Exponential in p

  9. Structure in non-parametric regression Upshot: Essential to impose structural constraints for high-dimensional non-parametric models. Reduced dimension models: dimension-reducing function: ϕ : R p → R k , where k ≪ p lower-dimensional function: g : R k → R composite function: f : R p → R � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27

  10. Structure in non-parametric regression Reduced dimension models: dimension-reducing function: ϕ : R p → R k , where k ≪ p lower-dimensional function: g : R k → R composite function: f : R p → R � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) Example: Regression on k -dimensional manifold: 1.5 1 Form of model 0.5 0 � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) −0.5 −1 −1.5 ϕ is co-ordinate mapping. 0.1 0.05 1.5 1 0 0.5 0 −0.05 −0.5 −1 −0.1 −1.5 Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27

  11. Structure in non-parametric regression Reduced dimension models: dimension-reducing function: ϕ : R p → R k , where k ≪ p lower-dimensional function: g : R k → R composite function: f : R p → R � � f ( x 1 , x 2 , . . . , x p ) = g ϕ ( x 1 , x 2 , . . . , x p ) Example: Ridge functions Form of model: k � � � f ( x 1 , x 2 , . . . , x p ) = g j � a j , x � j =1 Dimension-reducing mapping for some A ∈ R k × p . ϕ ( x 1 , . . . , x p ) = Ax Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 4 / 27

  12. Remainder of lecture 1 Sparse additive models ◮ formulation, applications ◮ families of estimators ◮ efficient implementation as SOCP 2 Statistical rates ◮ Kernel complexity ◮ Subset selection plus univariate function estimation 3 Minimax lower bounds ◮ Statistics as channel coding ◮ Metric entropy and lower bounds Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 5 / 27

  13. Sparse additive models additive models f ( x 1 , x 2 , . . . , x p ) = � p j =1 f j ( x j ) (Stone, 1985; Hastie & Tibshirani, 1990) Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27

  14. Sparse additive models additive models f ( x 1 , x 2 , . . . , x p ) = � p j =1 f j ( x j ) (Stone, 1985; Hastie & Tibshirani, 1990) additivity with sparsity � f ( x 1 , x 2 , . . . , x p ) = f j ( x j ) for unknown subset of cardinality | S | = s j ∈ S Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27

  15. Sparse additive models additive models f ( x 1 , x 2 , . . . , x p ) = � p j =1 f j ( x j ) (Stone, 1985; Hastie & Tibshirani, 1990) additivity with sparsity � f ( x 1 , x 2 , . . . , x p ) = f j ( x j ) for unknown subset of cardinality | S | = s j ∈ S studied by previous authors: ◮ Lin & Zhang, 2006: COSSO relaxation, ◮ Ravikumar et al., 2007: SPAM back-fitting procedure, consistency ◮ Bach et al., 2008: multiple kernel learning (MLK), consistency in classical setting ◮ Meier et al., 2007, L 2 ( P n ) regularization ◮ Koltchinski & Yuan, 2008, 2010. ◮ Raskutti, W. & Yu, 2009: minimax lower bounds Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 6 / 27

  16. Application: Copula methods and graphical models X t 1 X t 2 transform X j �→ Z j = f j ( X j ) model ( Z 1 , . . . , Z p ) as jointly X t 5 Gaussian Markov random field � � � P ( z 1 , z 2 , . . . , z p ) ∝ exp θ st z s z t . X s X t 3 ( s,t ) ∈ E X t 4 exploit Markov properties: neighborhood-based selection for learning graphs (Besag, 1974; Meinshausen & Buhlmann, 2006) combined with copula method: semi-parametric approach to graphical model learning (Liu, Lafferty & Wasserman, 2009) Martin Wainwright (UC Berkeley) High-dimensional statistics February, 2013 7 / 27

  17. Sparse and smooth Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ with: sparse representation: f ∗ = � j ∈ S f ∗ j univariate functions are smooth: f j ∈ H j

  18. Sparse and smooth Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ with: sparse representation: f ∗ = � j ∈ S f ∗ j univariate functions are smooth: f j ∈ H j Disregarding computational cost: n � � � 2 1 min min y i − f ( x i ) f = � n | S |≤ s f j i =1 � �� � j ∈ S f j ∈H j � y − f � 2 n

  19. Sparse and smooth Disregarding computational cost: n � � � 2 1 min min y i − f ( x i ) f = � n | S |≤ s f j i =1 � �� � j ∈ S f j ∈H j � y − f � 2 n 1-Hilbert-norm as convex surrogate: p � � f � 1 , H := � f j � H j j =1

  20. Sparse and smooth Disregarding computational cost: n � � � 2 1 min min y i − f ( x i ) f = � n | S |≤ s f j i =1 � �� � j ∈ S f j ∈H j � y − f � 2 n 1-Hilbert-norm as convex surrogate: p � � f � 1 , H := � f j � H j j =1 1- L 2 ( P n )-norm as convex surrogate: p � � f � 1 ,n := � f j � L 2 ( P n ) j =1 � n where � f j � 2 L 2 ( P n ) := 1 i =1 f 2 j ( x ij ). n

  21. A family of estimators Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ = � j ∈ S f ∗ j .

  22. A family of estimators Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ = � j ∈ S f ∗ j . Estimator: � 1 � n p � � � � 2 + ρ n � f � 1 , H + µ n � f � 1 ,n � f ∈ arg min y i − f j ( x ij ) . f = � p n j =1 f j i =1 j =1

  23. A family of estimators Noisy samples y i = f ∗ ( x i 1 , x i 2 , . . . , x ip ) + w i for i = 1 , 2 , . . . , n of unknown function f ∗ = � j ∈ S f ∗ j . Estimator: � 1 � n p � � � � 2 + ρ n � f � 1 , H + µ n � f � 1 ,n � f ∈ arg min y i − f j ( x ij ) . f = � p n j =1 f j i =1 j =1 Two kinds of regularization: � � p p n � � � � � 1 f 2 � f � 1 ,n = � f j � L 2 ( P n ) = j ( x ij ) , and n j =1 j =1 i =1 p � � f � 1 , H = � f j � H j . j =1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend