Nonparametric Sparsity John Lafferty Larry Wasserman - PowerPoint PPT Presentation

Nonparametric Sparsity John Lafferty Larry Wasserman Computer Science Dept. Department of Statistics Machine Learning Dept. Machine Learning Dept. Carnegie Mellon University Carnegie Mellon University

Motivation • “Modern” data are very high dimensional • In order to be “learnable,” there must be lower-dimensional structure • Developing practical algorithms with theoretical guarantees for beating the curse of (apparent) dimensionality is a main scientific challenge for our field 2

Motivation • Sparsity is emerging as a key concept in statistics and machine learning • Dramatic progress in recent years on understanding sparsity in parametric settings • Nonparametric sparsity: Wide open 3

Outline • High dimensional learning: Parametric and nonparametric • Rodeo: Greedy, sparse nonparametric regression • Extensions of the Rodeo 4

Parametric Case: Variable Selection in Linear Models d β j X j + ǫ = X T β + ǫ � Y = j =1 where d might be larger than n . Predictive risk R = E ( Y new − X T new β ) 2 . Want to choose subset ( X j : j ∈ S ) , S ⊂ { 1 , . . . , d } to make R small. Bias-variance tradeoff: small S = ⇒ Bias ↑ Variance ↓ large S = ⇒ Bias ↓ Variance ↑ 5

Lasso/Basis Pursuit (Chen & Donoho, 1994; Tibshirani, 1996) � d j =1 | β j | ≤ t Level sets of squared error For orthogonal designs, solution given by soft thresholding ˆ β j = sign ( β j ) ( | β j | − λ ) + 6

Convex Relaxations for Sparse Signal Recovery Desired problem: min � β � 0 such that � X β − y � 2 ≤ � Requires intractable combinatorial optimization. Convex optimization surrogate: min � β � 1 such that � X β − y � 2 ≤ � Substantial progress recently on theoretical justification (Cand` es and Tao, Donoho, Tropp, Meinshausen and B¨ uhlmann, Wainwright, Zhao and Yu, Fan and Peng,...) 7

Nonparametric Regression Given ( X 1 , Y 1 ) , . . . , ( X n , Y n ) where X i = ( X 1 i , . . . , X di ) T ∈ R d , Y i ∈ R , Y i = m ( X 1 i , . . . , X di ) + ǫ i , E ( ǫ i ) = 0 Risk: � m ( x ) − m ( x )) 2 dx R ( m, ˆ m ) = E ( ˆ Minimax theorem: � 1 � 4 / (4+ d ) inf m sup R ( m, ˆ m ) ≍ n ˆ m ∈ F where F is class of functions with 2 smooth derivatives. Note the curse of dimensionality. 8

The Curse of Dimensionality (Sobolev space of order 2) d = 20 Risk = 0.01 1e+12 0.5 8e+11 0.4 0.3 6e+11 sample size Risk 0.2 4e+11 0.1 2e+11 0.0 0e+00 1e+02 1e+04 1e+06 1e+08 10 12 14 16 18 20 dimension sample size 9

Nonparametric Sparsity • In many applications, reasonable to expect true function depends only on small number of variables • Assume m ( x ) = m ( x R ) where x R = ( x j ) j ∈ R are the relevant variables with | R | = r ≪ d • Can hope to achieve the better minimax rate n − 4 / (4+ r ) • Challenge: Variable selection in nonparametric regression 10

Rodeo : Regularization of derivative expectation operator • A general strategy for nonparametric estimation: Regularize derivatives of estimator with respect to smoothing parameters • A simple new algorithm for simultaneous bandwidth and variable selection in nonparametric regression • Theoretical analysis : Algorithm correctly determines relevant variables, with high probability, and achieves (near) optimal minimax rate of convergence • Examples showing performance consistent with theory 11

Key Idea in Rodeo: Change of Representation � h F � ( x ) dx F ( h ) = F (0) + 0 12

Rodeo: The Main Idea • Use a nonparametric estimator based on a kernel • Start with large bandwidths in each dimension, for an estimate having small variance but high bias - Choosing large bandwidth is like ignoring a variable • Compute the derivatives of the estimate with respect to bandwidth • Threshold the derivatives to get a sparse estimate • Intuition: If a variable is irrelevant, then changing the bandwidth in that dimension should only result in a small change in the estimator 13

Rodeo: The Main Idea h 1 Start Rodeo path Ideal path Optimal bandwidth h 2 14

Using Local Linear Smoothing The estimator can be written as n � m h ( x ) = ˆ G ( X i , x, h ) Y i i =1 Our method is based on the statistic n Z j = ∂ ˆ m h ( x ) � = G j ( X i , x, h ) Y i ∂ h j i =1 The estimated variance is n � s 2 j = Var ( Z j | X 1 , . . . , X n ) = σ 2 G 2 j ( X i , x, h ) i =1 15

Rodeo: Hard Tresholding Version 1. Select parameter 0 < β < 1 and initial bandwidth h 0 . 2. Initialize the bandwidths, and activate all covariates: (a) h j = h 0 , j = 1 , 2 , . . . , d . (b) A = { 1 , 2 , . . . , d } 3. While A is nonempty , do for each j ∈ A : (a) Compute estimated derivative expectation: Z j and s j � (b) Compute threshold λ j = s j 2 log n . (c) If | Z j | > λ j , set h j ← β h j ; otherwise remove j from A . 4. Output bandwidths h � = ( h 1 , . . . , h d ) and estimator m ( x ) = � ˜ m h � ( x ) 16

Example: m ( x ) = 2( x 1 + 1) 3 + 2 sin(10 x 2 ) , d = 20 Average over 50 runs Typical Run 1.0 1.0 17 13 16 11 18 19 15 20 14 4 6 7 8 9 19 8 11 16 3 4 15 7 20 5 12 14 6 10 18 9 17 13 0.8 0.8 0.6 0.6 Bandwidth Bandwidth 3 0.4 0.4 5 12 10 1 0.2 0.2 1 2 2 0.0 0.0 2 4 6 8 10 12 14 5 10 15 Rodeo Step Rodeo Step 17

Loss with r=2, Increasing Dimension 0.10 0.08 0.06 0.04 0.02 0.00 5 10 15 20 25 30 5 10 15 20 25 30 Leave-one-out cross-validation Rodeo 18

Main Result: Near Optimal Rates Theorem. Suppose that d = O (log n/ log log n ) , h 0 = 1 / log log n , and | m jj ( x ) | > 0 . Then the rodeo outputs bandwidths h � that satisfy � � h � j = h 0 for all j > r → 1 P − and for every � > 0 , j ≤ n − 1 / (4+ r )+ � for all j ≤ r � � n − 1 / (4+ r ) − � ≤ h � → 1 . P − Let T n be the stopping time of the algorithm. Then P ( t L ≤ T n ≤ t U ) → 1 where � � 1 nA min t L = ( r + 4) log(1 / β )log log n (log log n ) d � � 1 nA max t U = ( r + 4) log(1 / β )log log n (log log n ) d 19

Greedy Rodeo and LARS • Rodeo can be viewed as a nonparametric version of least angle regression (LARS), (Efron et al., 2004) • In forward stagewise, variable selection is incremental. LARS adds the variable most correlated with the residuals of the current fit. • For the Rodeo, the derivative is essentially the correlation between the output and the derivative of the effective kernel • Reducing the bandwidth is like adding more of that variable 20

LARS Regularization Paths 4 8 6 13 Standardized Coefficients 4 3 1 2 7 9 0 2 10 � 2 5 0.0 0.2 0.4 0.6 0.8 1.0 |beta|/max|beta| 21

Greedy Rodeo Bandwidth Paths 0.5 0.4 Bandwidth 0.3 3 9 7 4 1 2 8 0.2 0.1 0.0 0 20 40 60 80 100 Greedy Rodeo Step Rodeo order: 3 (body mass index), 9 (serum), 7 (serum), 4 (blood pressure), 1 (age), 2 (sex), 8 (serum), 5 (serum), 10 (serum), 6 (serum). LARS order: 3, 9, 4, 7, 2, 10, 5, 8, 6, 1. 22

Extensions • Sparse density estimation • Local polynomial estimation • Classification using Rodeo with generalized linear models • Other nonparametric estimators • Data-adaptive basis pursuit 23

Combining Rodeo and Lasso: Data-Adaptive Basis Pursuit (with Han Liu) true regression line data adaptive basis, J=36 0.4 0.4 0.2 0.2 fitted 0.0 0.0 y � 0.2 � 0.2 � 0.4 � 0.4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x 24

Data-Adaptive Basis Pursuit • Recall idea of Rodeo: � 1 � � � Z ( x, h ( s )) , ˙ m ( x ) = � ˜ m 1 ( x ) − h ( s ) ds 0 • Let Φ ( X i ) = vec ( Z ( X i , h ( s k )) · dh ( s k )) over a grid of bandwidths • Run the Lasso: min � Y − Φ ( X ) β � 2 β such that � β � 1 ≤ t 25

Data-Adaptive Basis Pursuit base 6 base 27 base 15 0.03 0.02 0.02 0.05 0.00 0.01 0.00 y y y 0.00 � 0.02 � 0.01 � 0.04 � 0.02 � 0.05 � 0.03 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x base 9 base 30 base 18 0.03 0.06 0.06 0.04 0.04 0.02 0.02 0.02 0.01 0.00 0.00 y y y 0.00 � 0.02 � 0.02 � 0.01 � 0.04 � 0.04 � 0.06 � 0.06 � 0.02 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x 26

Summary • Sparsity is playing an increasingly important role in statistics and machine learning • In order to be “learnable,” there must be lower- dimensional structure • Nonparametric sparsity: many open problems. • Rodeo: conceptually simple and practical, theoretically nice properties. 27

Nonparametric Sparsity John Lafferty Larry Wasserman - PowerPoint PPT Presentation

Nonparametric Sparsity John Lafferty Larry Wasserman Computer Science Dept. Department of Statistics Machine Learning Dept. Machine Learning Dept. Carnegie Mellon University Carnegie Mellon University

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Sparsity in Information Theory and Biology Olgica Milenkovic ECE Department, UIUC Joint work

Sparsity and optimality of splines: Deterministic vs. statistical justification Michael Unser

Blind Image Deconvolution Need for Theoretical . . . Based on Sparsity: Need for Improvement

RAMSEY CLASSES SPARSITY AND MODELS FOR FINITE - NESIETPIIL JAROSLAV UNIVERSITY CHARLES

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure,

Sparsity-aware sampling theorems and applications Rachel Ward University of Texas at Austin

Structural Sparsity Jaroslav Neetil Patrice Ossona de Mendez Charles University CAMS,

The Sparsity Gap Joel A. Tropp Computing & Mathematical Sciences California Institute

Computing sparsity stuff in real world graphs Marcin Pilipczuk a lot of slides by Wojciech Nadara

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Han Liu 1 , 2 John

INDUCTION OF a few sentiments PREFECTS 20 0 4 w e r e expressed SCHOOL CAPTAINS by Sila

19 March 2018 Welcome Steering Committee Update Reminder to Vote! TMF RM Community

Rapid Response 1 5/25/2016 By the end of our time with you today We hope youll:

Generalized Significance in Scale Space: The GS3 Package Daniel V. Samarov Statistical

Safe R oute s to Sc hool Tuesday, September 1, 2020 12:151:15PM House ke e ping 1.

MARCH MEETING | THURSDAY, MARCH 28, 2019 WESTERN DAKOTA TECHNICAL INSTITUTE | RAPID CITY, SD

FISCAL YEAR 2 FISCAL YEAR 2018 018-2019 ANNU 2019 ANNUAL PERFORMANC AL PERFORMANCE REPORT E

Sambuz

Useful Links

Newsletter

Mail Us

Nonparametric Sparsity John Lafferty Larry Wasserman - PowerPoint PPT Presentation

Nonparametric Sparsity John Lafferty Larry Wasserman Computer Science Dept. Department of Statistics Machine Learning Dept. Machine Learning Dept. Carnegie Mellon University Carnegie Mellon University

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

Nonparametric Regression Splines for Nonparametric Regression Splines for Regional Atmospheric

Nonparametric Sequential Change Detection for High-Dimensional Problems Yasin Ylmaz Electrical

The np package np : A Package for Nonparametric Kernel The np package implements a variety of

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Sparsity in Information Theory and Biology Olgica Milenkovic ECE Department, UIUC Joint work

Sparsity and optimality of splines: Deterministic vs. statistical justification Michael Unser

Blind Image Deconvolution Need for Theoretical . . . Based on Sparsity: Need for Improvement

RAMSEY CLASSES SPARSITY AND MODELS FOR FINITE - NESIETPIIL JAROSLAV UNIVERSITY CHARLES

Structured sparsity and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure,

Sparsity-aware sampling theorems and applications Rachel Ward University of Texas at Austin

Structural Sparsity Jaroslav Neetil Patrice Ossona de Mendez Charles University CAMS,

The Sparsity Gap Joel A. Tropp Computing &amp; Mathematical Sciences California Institute

Computing sparsity stuff in real world graphs Marcin Pilipczuk a lot of slides by Wojciech Nadara

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Han Liu 1 , 2 John

INDUCTION OF a few sentiments PREFECTS 20 0 4 w e r e expressed SCHOOL CAPTAINS by Sila

19 March 2018 Welcome Steering Committee Update Reminder to Vote! TMF RM Community

Rapid Response 1 5/25/2016 By the end of our time with you today We hope youll:

Generalized Significance in Scale Space: The GS3 Package Daniel V. Samarov Statistical

Safe R oute s to Sc hool Tuesday, September 1, 2020 12:151:15PM House ke e ping 1.

MARCH MEETING | THURSDAY, MARCH 28, 2019 WESTERN DAKOTA TECHNICAL INSTITUTE | RAPID CITY, SD

FISCAL YEAR 2 FISCAL YEAR 2018 018-2019 ANNU 2019 ANNUAL PERFORMANC AL PERFORMANCE REPORT E

Sambuz

Useful Links

Newsletter

Mail Us

The Sparsity Gap Joel A. Tropp Computing & Mathematical Sciences California Institute