Alexandre Gramfort alexandre.gramfort@inria.fr Inria, Parietal Team Université Paris-Saclay
A practical tour of optimization algorithms for the Lasso
Huawei - Apr. 2017
A practical tour of optimization algorithms for the Lasso Alexandre - - PowerPoint PPT Presentation
A practical tour of optimization algorithms for the Lasso Alexandre Gramfort alexandre.gramfort@inria.fr Inria, Parietal Team Universit Paris-Saclay Huawei - Apr. 2017 Outline What is the Lasso Lasso with an orthogonal design
Alexandre Gramfort alexandre.gramfort@inria.fr Inria, Parietal Team Université Paris-Saclay
Huawei - Apr. 2017
Alex Gramfort Algorithms for the Lasso
2
Alex Gramfort Algorithms for the Lasso
3
with
regression / inverse problems.
p
i=1
x
Alex Gramfort Algorithms for the Lasso
4
n = 10; A = randn(n/2,n); b = randn(n/2,1); gamma = 1; cvx_begin variable x(n) dual variable y minimize(0.5*norm(A*x - b, 2) + gamma * norm(x, 1)) cvx_end
Using CVX Toolbox
Alex Gramfort Algorithms for the Lasso
5
Rewrite: |xi| = x+
i − x− i = max(xi, 0) + max(−xi, 0)
kxk1 = x+ x− xi = x+
i + x− i = max(xi, 0) + min(xi, 0)
Leads to:
positivity constraints (convex constraints)
z∈R2p
+
i
Alex Gramfort Algorithms for the Lasso
6
With f smooth with L-Lipschitz gradient:
x∈Rp f(x)
krf(x) rf(y)k Lkx yk Gradient descent reads:
Alex Gramfort Algorithms for the Lasso
7
x∈C⊂Rp f(x) With a convex set and f smooth with L-Lipschitz gradient
C
C πC(x) x projected gradient reads:
Alex Gramfort Algorithms for the Lasso
9
One has:
So the Lasso boils down to minimizing: x⇤ = argmin
x2Rp
+
1 2kA>b xk2 + λkxk1 x⇤ = argmin
x2Rp
+
p
X
i=1
✓1 2((A>b)i − xi)2 + λ|xi| ◆ (p 1d problems) x⇤ , proxλk·k1(A>b) (Definition of the proximal operator)
Alex Gramfort Algorithms for the Lasso
10
λ −λ
x
Alex Gramfort Algorithms for the Lasso
11
c = A.T.dot(b) x_star = np.sign(c) * np.maximum(np.abs(c) - lambd, 0.)
Alex Gramfort Algorithms for the Lasso
12
Let us define: Leads to: The Lipschitz constant of the gradient: f(x) = 1 2kb Axk2 rf(x) = A>(b Ax) L = kA>Ak2 Quadratic upper bound of f at the previous iterate: ⇔ xk+1 = argmin
x2Rp
+
f(xk) + (x xk)>rf(xk) + L 2 kx xkk2 + λkxk1 xk+1 = argmin
x∈Rp
+
1 2kx (xk 1 Lrf(xk))k2 + λ Lkxk1
Alex Gramfort Algorithms for the Lasso
13
That we can rewrite:
x2Rp
+
L k·k1(xk 1
[Daubechies et al. 2004, Combettes et al. 2005]
Remark: There exist so called “accelerated” methods known as FISTA, Nesterov acceleration… Remark: If f is not strongly convex f(xk) − f(x∗) = O
Very far from an exponential rate of GD with strong convexity
Alex Gramfort Algorithms for the Lasso
14
alpha = 0.1 # Lambda parameter L = linalg.norm(A)**2 x = np.zeros(A.shape[1]) for i in range(max_iter): x += (1. / L) * np.dot(A.T, b - np.dot(A, x)) x = np.sign(X) * np.maximum(np.abs(X) - (alpha / L), 0)
Alex Gramfort Algorithms for the Lasso
16
memory)
etc.) as dot products have some logarithmic complexities.
Alex Gramfort Algorithms for the Lasso
17
∂f(x0) = {g ∈ Rn/f(x) − f(x0) ≥ gT (x − x0), ∀x ∈ Rn}
∂| · |(0) =?
Alex Gramfort Algorithms for the Lasso
18
x
I = {i s.t. xi 6= 0}
I (Ax⇤ − b) + λ sign(x⇤ I) = 0
Ic(Ax⇤ b)k1 λ
I = (A> I AI)1(A> I b − λ sign(x⇤ I))
Alex Gramfort Algorithms for the Lasso
19
I = (A> I AI)1(A> I b − λ sign(x⇤ I))
Alex Gramfort Algorithms for the Lasso
20
Alex Gramfort Algorithms for the Lasso
21
Alex Gramfort Algorithms for the Lasso
23
Limitation of proximal gradient descent: if L is big we make tiny steps !
L k·k1(xk 1
The idea of coordinate descent (CD) is to update one coefficient at a time (also known as univariate relaxation methods in
Hope: make bigger steps. Spoiler: It is the state of the art in machine learning problems (cf. GLMNET R package, scikit-learn) [Friedman et al. 2009]
Alex Gramfort Algorithms for the Lasso
24
Alex Gramfort Algorithms for the Lasso
25
x0 x1 = x2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Alex Gramfort Algorithms for the Lasso
26
Since the regularization
p
i=1
is separable function CD works for the Lasso [Tseng 2001] Proximal coordinate descent algorithm works: we make bigger steps ! Li ⌧ L for k = 1 . . . K i = (k mod p) + 1 xk+1
i
= prox λ
Li (xk
i 1
Li (rf(xk))i)
Alex Gramfort Algorithms for the Lasso
27
Alex Gramfort Algorithms for the Lasso
28
GitHub : @agramfort Twitter : @agramfort