Gaussian model selection with an unknown variance Yannick Baraud - PowerPoint PPT Presentation

Gaussian model selection with an unknown variance Yannick Baraud Laboratoire J.A. Dieudonn´ e Universit´ e de Nice Sophia Antipolis baraud@unice.fr Joint work with C. Giraud and S. Huet logo

The statistical framework We observe � � µ, σ 2 I n Y ∼ N where both parameters µ ∈ R n and σ > 0 are unknown. Our aim: Estimate µ from the observation of Y . logo

Example : Variable selection p � � � µ, σ 2 I n Y ∼ N with µ = θ j X j . j = 1 and p possibly larger than n but expect that � �� ≪ n j , θ j � = 0 � � Our aim: Estimate µ and j , θ j � = 0 . logo

The estimation strategy: model selection We start with a collection { S m , m ∈ M} of linear subspaces (models) of R n . S m − → ˆ µ m = Π S m Y Our aim : select ˆ m = ˆ m ( Y ) among M in such a way � m | 2 � � µ m | 2 � | µ − ˆ µ ˆ inf | µ − ˆ . E close to m ∈M E logo

Variable selection (continued) p � � � µ, σ 2 I n Y ∼ N with µ = θ j X j j = 1 For m ⊂ { 1 , . . . , p } , such that | m | ≤ D max < n we set � � S m = Span X j , j ∈ m . Ordered variable selection. Take M o = {{ 1 , . . . , D } , D ≤ D max } ∪ { ∅ } (Almost) complete variable selection. Take M c = { m ⊂ P ( { 1 , . . . , p } ) , | m | ≤ D max } logo

Some selection criteria � � µ m | 2 + pen ( m ) ˆ | Y − ˆ m = argmin m ∈M - Mallows’ C p (1973): pen ( m ) = 2 D m σ 2 where D m = dim ( S m ) . e & Massart (2001): pen ( m ) = pen ( m , σ 2 ) . - Birg´ logo

Advantages : - Non-asymptotic theory - Variable selection: no assumption on the predictors X j . - Bayesian flavor : allows (into some extent) to take into account knowlege/intuition Drawbacks : - The computation of ˆ m may not feasible if M is too large logo

For the problem of variable selection : Tibshirani(1996) Lasso :   � � 2  � �   p  � � � θ λ = argmin ˆ � � Y − θ j X j + λ | θ | 1 . � �   θ ∈ R p  � �  j = 1 Cand` es & Tao (2007) Dantzig selector:  � �  � �  p  � � � θ λ = arg ˆ � � min  | θ | 1 , max � X j , Y − θ j ′ X j ′ � ≤ λ � �  j = 1 ,..., p � � j ′ = 1 � � � m λ = j , ˆ ˆ ˆ θ λ θ λ − → j � = 0 and ˆ µ ˆ m λ = j X j j ∈ ˆ m λ logo

Advantages : - The computation is feasible even if p is very large - Non-asymptotic theory Drawbacks : - The procedure work under suitable assumptions on the predictors X j - There is no way to check these assumptions if p is very large - Blind to knowledge/intuition logo

For all these procedures, remains the problem of estimating σ 2 or choosing λ These parameters depends on the data distribution and must be estimated In general, there is no natural estimator of σ 2 (complete variable selection with p > n ) Cross-validation... The performance of the procedure crucially depends upon these parameters. logo

Other selection criteria � � 1 + pen ( m ) µ m | 2 Crit ( m ) = | Y − ˆ n − D m or � µ m | 2 � + pen ′ ( m ) Crit ′ ( m ) | Y − ˆ = log n Both criteria are the same if one takes � � 1 + pen ( m ) pen ′ ( m ) = n log ≈ pen ( m ) n − D m logo

� � 1 + pen ( m ) µ m | 2 Crit ( m ) = | Y − ˆ n − D m or � µ m | 2 � + pen ′ ( m ) Crit ( m ) = log | Y − ˆ n Akaike(1969) FPE : pen ( m ) = 2 D m Akaike(1973) AIC : pen ′ ( m ) = 2 D m Schwarz/Akaike (1978) BIC/SIC : pen ′ ( m ) = D m log ( n ) Saito(1994) AMDL : pen ′ ( m ) = 3 D m log ( n ) logo

Two questions What can be said about these selection criteria from a 1 non-asymptotic point of view? Is it possible to propose other penalties that would take into 2 account the complexity of the collection { S m , m ∈ M} ? logo

What do we mean by complexity? We shall say that that the collection { S m , m ∈ M} is a -complex (with a ≥ 0) if |{ m ∈ M , D m = D }| ≤ e aD ∀ D ≥ 1 . For the collection { S m , m ∈ M o } |{ m ∈ M , D m = D }| ≤ 1 = ⇒ a = 0 For the collection { S m , m ∈ M c } � p � ≤ p D |{ m ∈ M , D m = D }| ≤ = ⇒ a = log ( p ) D logo

Penalty choice with regard to complexity Let φ ( x ) = ( x − 1 − log ( x )) / 2 for x ≥ 1. Consider a a -complex collection { S m , m ∈ M} . If for some K , K ′ > 1 pen ( m ) ≤ K ′ , ∀ m ∈ M ∗ K ≤ φ − 1 ( a ) D m and select � � 1 + pen ( m ) µ m | 2 ˆ m = argmin m ∈M | Y − ˆ n − D m then � � m | 2 | µ − ˆ µ ˆ E σ 2 C ( K ) K ′ φ − 1 ( a ) ≤ � � µ m | 2 | µ − ˆ inf m ∈M E ∨ 1 σ 2 logo

Case of ordered variable selection a = 0, φ − 1 ( a ) = 1. For all m ∈ M such that D m � = 0 1 < K ≤ pen ( m ) ≤ K ′ D m one has � � m | 2 | µ − ˆ µ ˆ E σ 2 C ( K ) K ′ ≤ � � µ m | 2 | µ − ˆ inf m ∈M E ∨ 1 σ 2 − → FPE and AIC (for n large enough) logo

Case of the complete variable selection with p = n a = log ( n ) , φ − 1 ( a ) ≈ 2 log ( n ) . If for all m ∈ M such that D m � = 0 pen ( m ) 2 D m log ( n ) ≤ K ′ 1 < K ≤ then � � m | 2 | µ − ˆ µ ˆ E σ 2 C ( K ) K ′ log ( n ) ≤ � � µ m | 2 | µ − ˆ ∨ 1 inf m ∈M E σ 2 − → AMDL (but not AIC, FPE, BIC) logo

New penalties Definition Let X D ∼ χ 2 ( D ) , X N ∼ χ 2 ( N ) , be two independent χ 2 . Define �� 1 X D − x X N H D , N ( x ) = E ( X D ) × E , x ≥ 0 N + Definition To each S m with D m < n − 1 , we associate a weight L m ≥ 0 and the penalty � e − L m � 1 . 1 N m N m − 1 H − 1 pen ( m ) = where N m = n − D m . D m + 1 , N m − 1 logo

Theorem Let { S m , m ∈ M} be a collection of models and { L m , m ∈ M} a family of weights. Assume that N m ≥ 7 and D m ∨ L m ≤ n / 2 for all m ∈ M . Define � � 1 + pen ( m ) µ m | 2 ˆ m ∈M | Y − ˆ m = argmin n − D m The estimator ˆ µ ˆ m satisfies � � m | 2 | µ − ˆ µ ˆ � × E σ 2 � � � � µ m | 2 � | µ − ˆ ( D m + 1 ) e − L m . ≤ inf + L m + E σ 2 m ∈M m ∈M logo

Ordered variable selection For m ∈ M o , m = { 1 , . . . , D } , L m = | m | � ( D m + 1 ) e − L m ≤ 2 . 51 − → m ∈M If | m | ≤ D max ≤ [ n / 2 ] ∧ p , � � � � � � m | 2 µ m | 2 | µ − ˆ µ ˆ | µ − ˆ ≤ � inf ∨ 1 E E . σ 2 σ 2 m ∈M logo

Complete Variable selection For m ∈ M c , �� p �� + 2 log ( | m | + 1 ) L m = log | m | � ( D m + 1 ) e − L m ≤ log ( p ) . − → m ∈M If | m | ≤ D max ≤ [ n / ( 2 log ( p ))] ∧ p , � � � � � � m | 2 µ m | 2 | µ − ˆ µ ˆ | µ − ˆ ≤ � log ( p ) inf ∨ 1 . E E σ 2 σ 2 m ∈M logo

Complete Variable selection: order of magnitude of the penalty n=32 n=512 400 8000 K=1.1 AMDL 6000 300 penalty penalty 200 4000 100 2000 0 0 0 2 4 6 8 0 20 40 60 80 logo D D

Comparison with Lasso/Adaptive Lasso The ”Adaptive Lasso” Proposed by Zou(2006).   � � 2  � �  p p   � � � � � � 1 θ λ = argmin ˆ � � � θ j � Y − θ j X j + λ � � γ × . � � � �   θ ∈ R p � ˜  � � θ j  � j = 1 j = 1 − → λ, γ obtained by cross-validation logo

Simulation 1 Consider the predictors X 1 , . . . , X 8 ∈ R 20 such that for all i = 1 , . . . , 20 X T = ( X 1 , i , . . . , X 8 , i ) are i.i.d. N ( 0 , Γ) with Γ j , k = 0 . 5 | j − k | . i and µ = 3 X 1 + 1 . 5 X 2 + 2 X 5 logo

σ = 1 E ( | � % { � % { � r m | ) m = m 0 } m ⊇ m 0 } Our procedure 1.57 3.34 72% 97.8% Lasso 2.09 5.21 10.8% 100% A. Lasso 1.99 4.56 16.8% 99% σ = 3 E ( | � m | ) % { � m = m 0 } % { � m ⊇ m 0 } r Our procedure 3.08 2.01 10.3% 15.7 Lasso 2.06 4.56 10.5% 100% A. Lasso 2.44 3.81 13.2 52% logo

Simulation 2 Let X 1 , X 2 , X 3 be three vectors of R n defined by √ X 1 = ( 1 , − 1 , 0 , . . . , 0 ) / 2 √ 1 + 1 . 001 2 X 2 = ( − 1 , 1 . 001 , 0 , . . . , 0 ) / √ √ � 1 + ( n − 2 ) / n 2 X 3 = ( 1 / 2 , 1 / 2 , 1 / n , . . . , 1 / n ) / and X j = e j for all j = 4 , . . . , n . We take p = n = 20, D max = 8 and µ = ( n , n , 0 , . . . , 0 ) ∈ Span { X 1 , X 2 } . − → µ almost ⊥ X 1 , X 2 and very correlated to X 3 . logo

The result E ( | � m | ) % { � m = m 0 } % { � m ⊇ m 0 } r Our procedure 2.24 2.19 83.4% 96.2% Lasso 285 6 0% 30% A. Lasso 298 5 0% 25% logo

Mixed strategy Let m ∈ M c . L m = | m | if m ∈ M o �� p �� = + log ( p ( | m | + 1 )) if m ∈ M c \ M o log | m | � ( D m + 1 ) e − L m ≤ 3 . 51 − → m ∈M � � m | 2 | µ − ˆ µ ˆ ≤ � E σ 2 � � � � � � � � µ m | 2 µ m | 2 | µ − ˆ | µ − ˆ inf ∨ 1 ∧ log ( p ) inf ∨ 1 . m ∈M o E m ∈M c E σ 2 σ 2 logo

Gaussian model selection with an unknown variance Yannick Baraud - PowerPoint PPT Presentation

Gaussian model selection with an unknown variance Yannick Baraud Laboratoire J.A. Dieudonn e Universit e de Nice Sophia Antipolis baraud@unice.fr Joint work with C. Giraud and S. Huet logo The statistical framework We observe

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance:

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Gaussian Model Selection with Unknown Variance Y. Baraud, C. Giraud and S. Huet Universit e de

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Analysis of variance and regression December 4, 2007 Variance component models Variance

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Endpoints in Gynecologic Cancer Clinical Trials Keiichi Fujiwara Saitama Medical University

Space Charge Efgect Simulation with Liquid Argon Flow Michael Mooney Colorado State University

Outline Interactions with grouping factors Mixed models in R using the lme4 package Part 6:

Review of Linear Algebra Max Turgeon STAT 4690Applied Multivariate Analysis Basic Matrix

A general procedure to combine estimators Fr ed eric Lavancier and Paul Rochet Laboratoire

On spectral bounds for symmetric Markov chains with coarse Ricci curvatures Kazuhiro Kuwae

On extremal type III codes Darwin Villar RWTH-Aachen ALCOMA 15 Introduction ALgebraic Let

Estimating Cost Reductions Associated with the Community Support Program for People Experiencing

Gaussian model selection with an unknown variance Yannick Baraud - PowerPoint PPT Presentation

Gaussian model selection with an unknown variance Yannick Baraud Laboratoire J.A. Dieudonn e Universit e de Nice Sophia Antipolis baraud@unice.fr Joint work with C. Giraud and S. Huet logo The statistical framework We observe

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance:

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Gaussian Model Selection with Unknown Variance Y. Baraud, C. Giraud and S. Huet Universit e de

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Analysis of variance and regression December 4, 2007 Variance component models Variance

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Variational Model Selection for Sparse Gaussian Process Regression Michalis K. Titsias School of

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

STAT 213 Model Selection II Colin Reimer Dawson Oberlin College March 30, 2018 1 / 13 Outline

Endpoints in Gynecologic Cancer Clinical Trials Keiichi Fujiwara Saitama Medical University

Space Charge Efgect Simulation with Liquid Argon Flow Michael Mooney Colorado State University

Outline Interactions with grouping factors Mixed models in R using the lme4 package Part 6:

Review of Linear Algebra Max Turgeon STAT 4690Applied Multivariate Analysis Basic Matrix

A general procedure to combine estimators Fr ed eric Lavancier and Paul Rochet Laboratoire

On spectral bounds for symmetric Markov chains with coarse Ricci curvatures Kazuhiro Kuwae

On extremal type III codes Darwin Villar RWTH-Aachen ALCOMA 15 Introduction ALgebraic Let

Estimating Cost Reductions Associated with the Community Support Program for People Experiencing

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?