Regularization and shrinkage for model selection in sparse GLM models. Challenging problems in Statistical Learning Workshop
- A. Antoniadis
LJK-Universit´ e Joseph Fourier. Grenoble, March 17 & 18, 2011
0-0
Regularization and shrinkage for model selection in sparse GLM - - PowerPoint PPT Presentation
Regularization and shrinkage for model selection in sparse GLM models. Challenging problems in Statistical Learning Workshop A. Antoniadis LJK-Universit e Joseph Fourier. Grenoble, March 17 & 18, 2011 0-0 Thresholding and
0-0
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization
j,k∈Z
Thresholding and regularization
k∈Z
k∈Z,j≥j0
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization
λ ( ˆ
λ( ˆ
Thresholding and regularization
Thresholding and regularization
λ ( ˆ
ˆ djk
Thresholding and regularization
λ
(a−1) ˆ
djk−aλsign( ˆ djk) a−2
Thresholding and regularization
Hard (1994) Soft (1994) NNG (1998) SCAD (2001)
Thresholding and regularization
Thresholding and regularization
1 4( fk−1 + 2fk + fk+1) + 1 2
√
2
λ
√
2
λ
√
2
λ denotes the soft shrinkage operator with threshold λ.
Thresholding and regularization
λ
λ
λ
λ
Thresholding and regularization
λ
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization
λ(0) = 0.
Thresholding and regularization
x 1+λ. The
λ
(1+λ), and
√
2λ)+
|x|
√
2λ}(|x|)
Thresholding and regularization
x
√
2λ}(|x|) + 2λ2 x2 I{|x|>
√
2λ}(|x|).
λ1
(λ2−λ1)
2λ2
|x| − 1
Thresholding and regularization
√
2λ
|x|
a
√
2λ
(a−2)|x| −
1 a−2
Thresholding and regularization
Shrinkage functions (top) and corresponding diffusivities (bottom). Plotted for λ = 1, λ1 = 1, λ2 = 2 (Firm) and a = 3.7 (Scad). The dashed line is the diagonal.
Thresholding and regularization
Thresholding and regularization
Charbonnier diffusivity The Charbonnier diffusivity (Charbonnier et al. (1994)) is given by g(|x|) =
λ2
−1/2 and corresponds to the shrinkage function δλ(x) = x
λ2+2x2
Perona-Malik diffusivity The Perona-Malik diffusivity (Perona and Malik (1990)) is defined by g(|x|) =
λ2
−1 and lead to the shrinkage function δλ(x) =
2x3 2x2+λ2 .
Weickert diffusivity Weickert (1998) introduced the following diffusivity g(|x|) = I{|x|>0(x)
(|x|/λ)8
function δλ(x) = x exp
x8
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization
n + 2λ ∑ i>i0
Thresholding and regularization
With a choice of an additive penalty ∑i>i0 p(|θi|), the minimization problem becomes separable, i.e. it is equivalent to minimize
for each coordinate i larger than i0. Therefore the estimate of any coordinate θi depends solely on the empirical wavelet coefficient zi. The performance of the resulting wavelet estimator depends on the penalty and the regularization parameter λ.
Thresholding and regularization
Usually, p is chosen to be symmetric and increasing on [0, +∞). AF provide some insights into how to choose a penalty function. A good penalty function should result in unbiasedness ( no over-penalization of large coefficients to avoid unnecessary modeling biases) sparsity (insignificant coefficients should be set to zero to reduce model complexity) stability (continuity of the penalty to avoid instability and large variability in model prediction). We will now show how to derive the penalties corresponding to the thresholding rules defined previously, and check that almost all of them satisfy these conditions.
Thresholding and regularization
0 (rλ(u) − u)du.
Thresholding and regularization
Thresholding and regularization
Penalties corresponding to the shrinkage and thresholding functions with the same name
Thresholding and regularization
The quadratic penalty, while continuous is not singular at zero, and the resulting estimator is not thresholded. All other penalties are singular at zero, thus resulting in thresholding rules that enforce sparseness of the solution. The hard-thresholding penalty is not continuous at the threshold, so it may induce the oscillation of the reconstructed signal (lack of stability). For soft-thresholding, the resulting estimator of large coefficients is shifted by an amount of λ (unnecessary bias when the coefficients are large). The same for Charbonnier and Perona-Malick penalties. All other penalties are singular at zero (encourage sparse solutions), continuous (stable) and do not create excessive bias when the wavelet coefficients are large.
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization
i∈IN
Thresholding and regularization
Thresholding and regularization
β∈Rp
β∈Rp
Thresholding and regularization
β∈Rp
β∈Rp
Thresholding and regularization
j
p i=1 min(|Zj|2, λ2),
Thresholding and regularization
β∈Rp
Thresholding and regularization
Thresholding and regularization
x0 f(x0)
Choose a starting point x0
Thresholding and regularization
x0 f(x0)
Choose a starting point x0 Construct a majorizing function of f(x) at x0.
Thresholding and regularization
x0 f(x0) x1 f(x1)
Choose a starting point x0 Construct a majorizing function of f(x) at x0. Minimize the majorizer (at x1).
Thresholding and regularization
x0 f(x0) x1 f(x1)
Choose a starting point x0 Construct a majorizing function of f(x) at x0. Minimize the majorizer (at x1). Repeat.
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization The surface θ → Q(θ|θ(m)) lies above the surface S(θ) and is tangent to it at the point θ = θ(m). Ordinarily, θ(m) represents the current iterate in a search of the minimum of the surface S(θ). In a majorize-minimize MM algorithm, one minimizes the majorizing function Q(θ|θ(m)) rather than the actual function S(θ). MM algorithm
Thresholding and regularization
Thresholding and regularization
β∈Rp
2 + λT(β)
2 − 1
2,
Thresholding and regularization
λ (β|γ)
2 + λT(β) + Ξ(β|γ)
2 + λT(β) + 1
Thresholding and regularization
λ (β|γ) for γ = β(0); each successive iterate β(n)
λ (β|γ)
λ (β|β(m−1))
Thresholding and regularization
0 (β|γ) = 1
2 − β, (I − Σ)γ + XTy + 1
2 + 1
2 − 1
2
2β −
2, which leads, using
Thresholding and regularization
2 in Sλ(β). Then, following a
Thresholding and regularization
λ (β|γ) with anchor γ is equivalent in minimizing the
2 + λT(β)
Thresholding and regularization For particular inverse problems such algorithms have been studied in the recent literature by many authors, especially when considering sparse regularization and compressed sensing. For convex penalties T(β),
Defrise, De Mol, 2004; Figuereido, Nowak, Bioucas-Dias, 2005, 2007) Other authors independently proposed IST-like schemes for signal/image recovery: Starck, Nguyen and Murtagh (2003); Starck, Cand` es and Donoho (2003); Bect, Blanc-F´ eraud, Aubert, and Chambolle (2004); Tropp, Donoho and others (2005); Candes (2006); Elad, Matalon and Zibulevsky (2006); Hale, Yin and Zhang (2007), ....
Thresholding and regularization
λ (t) ≤ t, ∀t ≥ 0)
λ is not decreasing and coercive.
λ (t) = 0 for 0 ≤ t ≤ τ for some
λ (u) = sup{t; δλ(t) ≤ u} and
λ (−u) = −δ−1 λ (u).
λ (u) − u, ∀u
Thresholding and regularization
θ (t − θ)2/2 + Pλ(θ)
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization
2.
Thresholding and regularization
1 a−1 then (6)
Thresholding and regularization
Thresholding and regularization
Thresholding and regularization
i = −b′′(θi).
i ) hold in general for the
Thresholding and regularization
i β = g(µi) determined by g(µi) = θi. Obviously g = (b′)−1.
πi 1−πi + log(1 − πi)
πi 1−πi , µi = πi, b(t) = log(1 + et), and
t 1−t (the logit link).
1 yi!e−ωiωyi i =
Thresholding and regularization
β −L(β) + Pλ(β)( F(β)),
n
i=1
p
i=1
Thresholding and regularization
n
i=1
2
n
i=1
i γ) − b(xT i β)) + n
i=1
i γ − xT i β),
i β) = b′(xT i β).
γ
2 + P(γ; λ).
Thresholding and regularization
β
2 − n
i=1
i γ) − b(xT i β) − b′(xT i β)(xT i γ − xT i β)
i β)
Thresholding and regularization
Thresholding and regularization