Estimator selection Christophe Giraud Universit e Paris-Sud et - PowerPoint PPT Presentation

Estimator selection Christophe Giraud Universit´ e Paris-Sud et Paris-Saclay M2 MSV et MDA 1/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 1 / 22

What shall I do with these data ? Classical steps 1 Elucidate the question(s) you want to answer to, and check your data This requires some ◮ deep discussions with specialists (biologists, physicians, etc), ◮ low level analyses (PCA, LDA, etc) to detect key features, outliers, etc ◮ and ... experience ! 2 Choose and apply an estimation procedure 3 Check your results (residues, possible bias, stability, etc) 2/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 2 / 22

Setting Gaussian regression with unknown variance: i . i . d . Y i = f ∗ ∼ N (0 , σ 2 ) i + ε i with ε i f ∗ = ( f ∗ n ) T and σ 2 are unknown 1 , . . . , f ∗ we want to estimate f ∗ Ex 1 : sparse linear regression f ∗ = X β ∗ with β ∗ ”sparse” in some sense and X ∈ R n × p with possibly p > n 3/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 3 / 22

A plethora of estimators Sparse linear regression Coordinate sparsity: Lasso, Dantzig, Elastic-Net, Exponential-Weighting, Projection on subspaces { V λ : λ ∈ Λ } given by PCA, Random Forest, PLS, etc. Structured sparsity: Group-lasso, Fused-Lasso, Bayesian estimators, etc 4/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 4 / 22

Important practical issues Which estimator shall I use? Lasso? Group-Lasso? Random-Forest? Exponential-Weighting? Forward–Backward? With which tuning parameter? which penalty level λ for the lasso? which beta for expo-weighting? etc 5/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 5 / 22

Difficulties No procedure is universally better than the others A sensible choice of the tuning parameters depends on ◮ some unknown characteristics of f (sparsity, smoothness, etc) ◮ the unknown variance σ 2 . Even if you are a pure Lasso-enthusiast, you miss some key informations in order to apply properly the lasso procedure ! 6/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 6 / 22

The objective Formalization We have a collection of estimation schemes (lasso, group-lasso, etc) and for each scheme we have a grid of different values for the tuning parameters. At the end, putting all the estimators together we have a collection { ˆ f λ , λ ∈ Λ } of estimators. Ideal objective Select the ”best” estimator among the collection { ˆ f λ , λ ∈ Λ } . 7/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 7 / 22

Cross-Validation The most popular technique for choosing tuning parameters 8/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 8 / 22

Principle split the data into a training set and a validation set: the estimators are built on the training set and the validation set is used for estimating their prediction risk. Most popular cross-validation scheme Hold-out : a single split of the data for training and validation . V -fold CV : the data is split into V subsamples. Each subsample is successively removed for validation , the remaining data being used for training . Leave-one-out : corresponds to n -fold CV. Leave- q -out : every possible subset of cardinality q of the data is removed for validation , the remaining data being used for training . Classical choice of V : between 5 and 10 (remains tractable). 9/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 9 / 22

V -fold CV train train train train test train train train test train train train test train train train test train train train test train train train train Recursive data splitting for 5-fold Cross-Validation Pros and Cons Universality: Cross-Validation can be implemented in most statistical frameworks and for most estimation procedures. Usually (but not always!) give good results in practice. But limited theoretical garanties in large dimensional settings. 10/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 10 / 22

Complexity selection (LinSelect) 11/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 11 / 22

Principle To adapt the ideas of model selection to estimator selection. Pros and Cons Strong theoretical guaranties, Computationally feasible, Good performances in the Gaussian setting, But relies on the Gaussian assumption 12/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 12 / 22

Reminder on BM model selection � � f m � 2 + σ 2 pen BM ( m ) � Y − � m ∈ argmin with pen BM ( m ) ≈ 2 log(1 /π m ) ˆ m ∈M 3 difficulties 1 We cannot explore a huge collection of models : we restrict to a � � S m , m ∈ � subcollection M 2 A good model S m for � f λ must achieve a good balance between the f λ � 2 and the complexity log(1 /π m ) approximation error � � f λ − Proj S m � 3 the criterion must not depend on the unknown variance σ 2 : We replace σ 2 in front of the penalty term by the estimator m = � Y − Proj S m Y � 2 σ 2 � . (1) n − dim( S m ) 13/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 13 / 22

LinSelect procedure Selection procedure We select � λ , with � λ = argmin λ crit ( � f λ ) where f � � � f λ � 2 + 1 f λ � 2 + pen π ( m ) � crit ( � � Y − Proj S m � 2 � � f λ − Proj S m � σ 2 f λ ) = inf m m ∈ � M where � σ 2 m is given by (1) and pen π ( m ) ≈ pen BM ( m ). 14/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 14 / 22

Example : tuning the lasso � � � f λ = X � Collection of estimators: lasso estimators β λ : λ > 0 Collection of models { S m , m ∈ M} and probability π : those for coordinate sparse regression Subcollection: � M = { � m ( λ ) : λ > 0 and 1 ≤ | � m ( λ ) | ≤ n / (3 log p ) } with m ( λ ) = supp ( � � β λ ) Theoretical garanty: under some suitable assumptions � � � X ( β ∗ − β ) � 2 + | β | 0 log( p ) λ − β ∗ ) � 2 ≤ C inf � X ( � σ 2 β � κ 2 ( β ) β � =0 with probability at least 1 − C 1 p − C 2 . 15/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 15 / 22

Scaled-Lasso Automatic tuning of the Lasso 16/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 16 / 22

Scale invariance β ( Y , X ) of β ∗ is scale-invariant if � The estimator � β ( sY , X ) = s � β ( Y , X ) for any s > 0. Example: the estimator � Y − X β � 2 + λ Ω( β ) , � β ( Y , X ) ∈ argmin β where Ω is homogeneous with degree 1 is not scale-invariant unless λ is proportional to σ . In particular the Lasso estimator is not scale-invariant when λ is not proportional to σ . 17/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 17 / 22

Rescaling Idea: σ = � Y − X β � / √ n . estimate σ with � set λ = µ � σ divide the criterion by � σ to get a convex problem Scale-invariant criterion √ n � Y − X β � + µ Ω( β ) . � β ( Y , X ) ∈ argmin β Example: scaled-Lasso � √ n � Y − X β � + µ | β | 1 � � β ∈ argmin . β ∈ R p 18/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 18 / 22

Pros and Cons � Universal choice µ = 5 log( p ) strong theoretical guaranties (Corollary 5.5) computationally feasible but poor performances in practice 19/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 19 / 22

Numerical experiments (1/2) Tuning the Lasso 165 examples extracted from the literature each example e is evaluated on the basis of 400 runs Comparison to the oracle � β λ ∗ procedure quantiles 0% 50% 75% 90% Lasso 10-fold CV 1.03 1.11 1.15 1.19 Lasso LinSelect 0.97 1.03 1.06 1.19 Square-Root Lasso 1.32 2.61 3.37 11.2 � � � � � � For each procedure ℓ , quantiles of R / R β ˆ λ ℓ ; β 0 β λ ∗ ; β 0 , for e = 1 , . . . , 165. 20/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 20 / 22

Numerical experiments (2/2) Computation time n p 10-fold CV LinSelect Square-Root 100 100 4 s 0.21 s 0.18 s 100 500 4.8 s 0.43 s 0.4 s 500 500 300 s 11 s 6.3 s Packages: enet for 10-fold CV and LinSelect lars for Square-Root Lasso (procedure of Sun & Zhang) 21/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 21 / 22

Impact of the unknown variance? Case of coordinate-sparse linear regression σ known or k known σ unknown and k unknown Minimax risk n k Ultra-high dimension 2 k log( p/k ) ≥ n Minimax prediction risk over k -sparse signal as a function of k 22/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 22 / 22

Estimator selection Christophe Giraud Universit e Paris-Sud et - PowerPoint PPT Presentation

Estimator selection Christophe Giraud Universit e Paris-Sud et Paris-Saclay M2 MSV et MDA 1/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 1 / 22 What shall I do with these data ? Classical steps 1

Weight Selection for a Model Weight Selection for a Model Average Estimator Average Estimator Alan

One Step Studentized M -estimator M -Estimator Marek Omelka Department of Probability and

Testing proportions BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD Estimation An estimator is a

Complex models - large p, small n Shrinkage estimation Applying statistical methods to analyze

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Bayes Estimator Lecture 15 Biostatistics 602 - Statistical Inference . . Summary Conjugate

k-Maximum Likelihood Estimator for mixtures of generalized Gaussians ICPR 2012, Tokyo, Japan

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal Jiantao Jiao

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Applied Statistics Lecturer: Serena Arima Likelihood ML estimator Summaries ML properties LR

Lecture 7: Kernel Density Estimation Applied Statistics 2015 1 / 20 Kernel Density Estimator

Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension

INTRODUCTION TO MODEL DRIVEN ENGINEERING Transformation Definition Transformation Tool 14:52

! MDA$basedTeachingand* ! MDA based Reaserch ! Sample 1: Empirical Research

02291: System Integration Model-Driven Architecture (MDA) Hubert Baumeister huba@dtu.dk DTU

Asymptotics of Joint Maxima of Discrete Random Variables Anne Feidt University of Zurich with

Scientific Papers in Social Sciences Alexander Garcia / Philipp Mayr / Leyla Jael Garcia Florida

r s

lcda : Local Classification of Discrete Data by Latent Class Models Michael B ucker

An Approach for Bridging the Gap Between Business Rules and the Semantic Web Birgit Demuth

Estimator selection Christophe Giraud Universit e Paris-Sud et - PowerPoint PPT Presentation

Estimator selection Christophe Giraud Universit e Paris-Sud et Paris-Saclay M2 MSV et MDA 1/22 Christophe Giraud (Paris Sud) High-dimensional statistics M2 MSV & MDA 1 / 22 What shall I do with these data ? Classical steps 1

Weight Selection for a Model Weight Selection for a Model Average Estimator Average Estimator Alan

One Step Studentized M -estimator M -Estimator Marek Omelka Department of Probability and

Testing proportions BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD Estimation An estimator is a

Complex models - large p, small n Shrinkage estimation Applying statistical methods to analyze

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

SECONDHAND SELECTION Sales Price - 275,000.00 EU SECONDHAND SELECTION INTERNAL VIEWS SECONDHAND

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

SELECTION Deterministic Stochastic Proportionate selection: Roulette Wheel Selection

Selection 2 Selection Selection given a set of (distinct) elements, finding the element larger

Bayes Estimator Lecture 15 Biostatistics 602 - Statistical Inference . . Summary Conjugate

k-Maximum Likelihood Estimator for mixtures of generalized Gaussians ICPR 2012, Tokyo, Japan

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal Jiantao Jiao

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Applied Statistics Lecturer: Serena Arima Likelihood ML estimator Summaries ML properties LR

Lecture 7: Kernel Density Estimation Applied Statistics 2015 1 / 20 Kernel Density Estimator

Uniform Convergence Rate of the Kernel Density Estimator Adaptive to Intrinsic Volume Dimension

INTRODUCTION TO MODEL DRIVEN ENGINEERING Transformation Definition Transformation Tool 14:52

! MDA$based*Teaching*and* ! MDA based Reaserch ! Sample 1: Empirical Research

02291: System Integration Model-Driven Architecture (MDA) Hubert Baumeister huba@dtu.dk DTU

Asymptotics of Joint Maxima of Discrete Random Variables Anne Feidt University of Zurich with

Scientific Papers in Social Sciences Alexander Garcia / Philipp Mayr / Leyla Jael Garcia Florida

r s

lcda : Local Classification of Discrete Data by Latent Class Models Michael B ucker

An Approach for Bridging the Gap Between Business Rules and the Semantic Web Birgit Demuth

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

! MDA$basedTeachingand* ! MDA based Reaserch ! Sample 1: Empirical Research