High-dimensional regression with unknown variance
Christophe Giraud
Ecole Polytechnique
High-dimensional regression with unknown variance Christophe Giraud - - PowerPoint PPT Presentation
High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: i . i . d . N (0 , 2 ) Y i = f i + i with i f = ( f 1 , . . . , f n )
Ecole Polytechnique
◮ Yi = fi + εi with εi i.i.d.
◮ f = (f1, . . . , fn)∗ and σ2 are unknown ◮ we want to estimate f
◮ f = Xβ with β ”sparse” in some sense and X ∈ Rn×p with
◮ fi = F(xi) with F : X → R
◮ Coordinate sparsity: Lasso, Dantzig, Elastic-Net,
◮ Structured sparsity: Group-lasso, Fused-Lasso, Bayesian
◮ Spline smoothing, Nadaraya kernel smoothing, kernel ridge
◮ Sparse regression : Lasso? Random-Forest?
◮ Non-parametric regression : Kernel regression? (which
◮ which penalty level for the lasso? ◮ which bandwith for kernel regression? ◮ etc
◮ No procedure is universally better than the others ◮ A sensible choice of the tuning parameters depends on
◮ some unknown characteristics of f (sparsity, smoothness, etc) ◮ the unknown variance σ2.
◮ Select the ”best” estimator among a collection {ˆ
k
σ unknown and k unknown σ known or k known
Ultra-high dimension 2k log(p/k) ≥ n n
Minimax risk
σ2>0
β0 k-sparse σ2 > 0
β=0
2 + β0 log(p)σ2
◮ Hold-out ◮ V -fold CV ◮ Leave-q-out
◮ Penalized log-likelihood (AIC, BIC, etc) ◮ Plug-in criteria (with Mallows’Cp, etc) ◮ Slope heuristic
◮ LinSelect
◮ A collection S of linear spaces (for approximation) ◮ A weight function ∆ : S → R+ (measure of complexity)
S∈ b S
S
◮
◮ ΠS orthogonal projector onto S, ◮ pen∆(S) ≍ dim(S) ∨ 2∆(S) when dim(S) ∨ 2∆(S) ≤ 2n/3, ◮
S = Y −ΠSY 2
2
n−dim(S) .
S∈S e−∆(S) ≤ 1.
λ2
λ∈Λ
S∈ b S
◮ S =
dim(S)
βλ)
β
◮ V -fold CV ◮ BIC criterion
◮ ℓ1-penalized log-likelihood. (Stadler, Buhlmann, van de Geer)
λ ,
λ := argmin β∈Rp,σ′>0
2
◮ ℓ1-penalized Huber’s loss. (Belloni et al., Antoniadis)
λ ,
λ
β∈Rp,σ′>0
2
λ
β∈Rp
2 + λ
u∈C(ξ,T)
◮ β00 ≤ C1 κ2[4, supp(β0)] × n log(p),
2
β=0
2 + C2
u∈C(ξ,T)
◮ β00 ≤ C1 κ2[4, supp(β0)] × n φ∗ log(p),
2
β=0
2 + C2
◮ 165 examples extracted from the literature ◮ each example e is evaluated on the basis of 400 runs
λℓ; β0
◮ enet for 10-fold CV and LinSelect ◮ lars for Square-Root Lasso (procedure of Sun & Zhang)
◮ spline smoothing or kernel ridge estimators with smoothing
◮ Nadaraya estimators Aλ with smoothing parameter λ ∈ R+ ◮ λ-nearest neighbors, λ ∈ {1, . . . , k} ◮ L2-basis projection (on the λ first elements) ◮ etc
◮ Cross-Validation schemes (including GCV) ◮ Mallows’ CL + plug-in / slope heuristic ◮ LinSelect
◮ spline smoothing or kernel ridge estimators with smoothing
◮ Nadaraya estimators Aλ with smoothing parameter λ ∈ R+ ◮ λ-nearest neighbors, λ ∈ {1, . . . , k} ◮ L2-basis projection (on the λ first elements) ◮ etc
◮ Cross-Validation schemes (including GCV) ◮ Mallows’ CL + plug-in / slope heuristic ◮ LinSelect
λAλ)
λ0(ˆ σ)) ∈ [n/10, n/3]
◮ Aλ ≈ shrinkage or ”averaging” matrix (covers all classics) ◮ Bias assumption :
λ − f 2 ≤ (1 + ε) infλ
λ
λ, . . . , Sn/2 λ
λ is spanned by ”the k last”
λ − ¯
λ), where
◮ A+
λ is the inverse of the of Aλ to range(A∗ λ) → range(Aλ)
◮ ¯
λ)
S e−∆(S) ≤ 1.
λ − ¯
λ∈Λ E
λ
λ, . . . , Sn/2 λ
λ is spanned by ”the k last”
λ − ¯
λ),
λ is spanned by ”the
S e−∆(S) ≤ 1.
λ − ¯
λ∈Λ E
λ
λ, . . . , Sn/2 λ
λ is spanned by ”the k last”
λ − ¯
λ),
λ is spanned by ”the
S e−∆(S) ≤ 1.
λ − ¯
λ∈Λ E
λ
λ, . . . , Sn/2 λ
λ is spanned by ”the k last”
λ − ¯
λ),
λ is spanned by ”the
S e−∆(S) ≤ 1.
λ − ¯
λ∈Λ