A practical tour of optimization algorithms for the Lasso Alexandre - - PowerPoint PPT Presentation

a practical tour of optimization algorithms for the lasso
SMART_READER_LITE
LIVE PREVIEW

A practical tour of optimization algorithms for the Lasso Alexandre - - PowerPoint PPT Presentation

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort alexandre.gramfort@inria.fr Inria, Parietal Team Universit Paris-Saclay Huawei - Apr. 2017 Outline What is the Lasso Lasso with an orthogonal design


slide-1
SLIDE 1

Alexandre Gramfort alexandre.gramfort@inria.fr Inria, Parietal Team Université Paris-Saclay

A practical tour of optimization algorithms for the Lasso

Huawei - Apr. 2017

slide-2
SLIDE 2

Alex Gramfort Algorithms for the Lasso

Outline

2

  • What is the Lasso
  • Lasso with an orthogonal design
  • From projected gradient to proximal gradient
  • Optimality conditions and subgradients (LARS algo.)
  • Coordinate descent algorithm

… with some demos

slide-3
SLIDE 3

Alex Gramfort Algorithms for the Lasso

Lasso

3

with

  • Commonly attributed to [Tibshirani 96] (> 19000 citations)
  • Also known as Basis Pursuit Denoising [Chen 95] (> 9000 c.)
  • Convex way of promoting sparsity in high dimensional

regression / inverse problems.

  • Can lead to statistical guarantees even if

n ≈ log(p)

λ > 0 kxk1 =

p

X

i=1

|xi|

A ∈ Rn×p x∗ 2 argmin

x

1 2kb Axk2 + λkxk1

slide-4
SLIDE 4

Alex Gramfort Algorithms for the Lasso

Algorithm 0

4

http://cvxr.com/cvx/

n = 10; A = randn(n/2,n); b = randn(n/2,1); gamma = 1; cvx_begin variable x(n) dual variable y minimize(0.5*norm(A*x - b, 2) + gamma * norm(x, 1)) cvx_end

Using CVX Toolbox

slide-5
SLIDE 5

Alex Gramfort Algorithms for the Lasso

Algorithm 1

5

Rewrite: |xi| = x+

i − x− i = max(xi, 0) + max(−xi, 0)

kxk1 = x+ x− xi = x+

i + x− i = max(xi, 0) + min(xi, 0)

Leads to:

  • This is a simple smooth convex optimization problem with

positivity constraints (convex constraints)

z∗ 2 argmin

z∈R2p

+

1 2kb [A, A]zk2 + λ X

i

zi

slide-6
SLIDE 6

Alex Gramfort Algorithms for the Lasso

Gradient Descent

6

With f smooth with L-Lipschitz gradient:

xk+1 = xk 1 Lrf(xk)

min

x∈Rp f(x)

krf(x) rf(y)k  Lkx yk Gradient descent reads:

slide-7
SLIDE 7

Alex Gramfort Algorithms for the Lasso

Projected gradient Descent

7

min

x∈C⊂Rp f(x) With a convex set and f smooth with L-Lipschitz gradient

C

xk+1 = πC(xk 1 Lrf(xk))

C πC(x) x projected gradient reads:

Orthogonal projection on C

slide-8
SLIDE 8

demo_grad_proj.ipynb

slide-9
SLIDE 9

Alex Gramfort Algorithms for the Lasso

What if A is orthogonal?

9

  • Let’s assume we have a square orthogonal design matrix

One has:

A>A = AA> = Ip kb Axk2 = kA>b xk2

So the Lasso boils down to minimizing: x⇤ = argmin

x2Rp

+

1 2kA>b xk2 + λkxk1 x⇤ = argmin

x2Rp

+

p

X

i=1

✓1 2((A>b)i − xi)2 + λ|xi| ◆ (p 1d problems) x⇤ , proxλk·k1(A>b) (Definition of the proximal operator)

slide-10
SLIDE 10

Alex Gramfort Algorithms for the Lasso

Proximal operator of L1 norm

10

  • Solution of

λ −λ

Sλ(c)

c

min

x

1 2(c − x)2 + λ · |x|

is the solution of The soft-thresholding: c → sign(c)(|c| − λ)+

slide-11
SLIDE 11

Alex Gramfort Algorithms for the Lasso

Algorithm with A orthogonal

11

c = A.T.dot(b) x_star = np.sign(c) * np.maximum(np.abs(c) - lambd, 0.)

slide-12
SLIDE 12

Alex Gramfort Algorithms for the Lasso

What if A is NOT orthogonal?

12

Let us define: Leads to: The Lipschitz constant of the gradient: f(x) = 1 2kb Axk2 rf(x) = A>(b Ax) L = kA>Ak2 Quadratic upper bound of f at the previous iterate: ⇔ xk+1 = argmin

x2Rp

+

f(xk) + (x xk)>rf(xk) + L 2 kx xkk2 + λkxk1 xk+1 = argmin

x∈Rp

+

1 2kx (xk 1 Lrf(xk))k2 + λ Lkxk1

slide-13
SLIDE 13

Alex Gramfort Algorithms for the Lasso

Algorithm 2: Proximal gradient descent

13

That we can rewrite:

xk+1 = argmin

x2Rp

+

1 2kx (xk 1 Lrf(xk))k2 + λ Lkxk1 = prox λ

L k·k1(xk 1

Lrf(xk))

[Daubechies et al. 2004, Combettes et al. 2005]

Remark: There exist so called “accelerated” methods known as FISTA, Nesterov acceleration… Remark: If f is not strongly convex f(xk) − f(x∗) = O

✓1 k ◆

Very far from an exponential rate of GD with strong convexity

slide-14
SLIDE 14

Alex Gramfort Algorithms for the Lasso

Proximal gradient

14

alpha = 0.1 # Lambda parameter L = linalg.norm(A)**2 x = np.zeros(A.shape[1]) for i in range(max_iter): x += (1. / L) * np.dot(A.T, b - np.dot(A, x)) x = np.sign(X) * np.maximum(np.abs(X) - (alpha / L), 0)

slide-15
SLIDE 15

demo_grad_proximal.ipynb

slide-16
SLIDE 16

Alex Gramfort Algorithms for the Lasso

Pros of proximal gradient

16

  • First order method (only requires to compute gradients)
  • Algorithms scalable even if p is large (needs to store A in

memory)

  • Great if A is an implicit linear operator (Fourier, Wavelet, MDCT,

etc.) as dot products have some logarithmic complexities.

slide-17
SLIDE 17

Alex Gramfort Algorithms for the Lasso

Subgradient and subdifferential

17

The subdifferential of f at x0 is:

∂f(x0) = {g ∈ Rn/f(x) − f(x0) ≥ gT (x − x0), ∀x ∈ Rn}

Properties

  • The subdifferential is a convex set
  • x0 is a minimizer of f if 0 ∈ ∂f(x0)

∂| · |(0) =?

What is Exercise:

slide-18
SLIDE 18

Alex Gramfort Algorithms for the Lasso

Path of solutions

18

Lemma [Fuchs 97] : Let be a solution of the Lasso x∗ 2 argmin

x

1 2kb Axk2 + λkxk1 x∗

I = {i s.t. xi 6= 0}

Let the support Then:

A>

I (Ax⇤ − b) + λ sign(x⇤ I) = 0

kA>

Ic(Ax⇤ b)k1  λ

x⇤

I = (A> I AI)1(A> I b − λ sign(x⇤ I))

And also:

slide-19
SLIDE 19

Alex Gramfort Algorithms for the Lasso

Algorithm 3: Homotopy and LARS

19

x⇤

I = (A> I AI)1(A> I b − λ sign(x⇤ I))

The idea is to compute the full path of solution noticing that for a given sparsity / sign pattern the solution if affine. The LARS algorithm [Osborne 2000, Efron et al. 2004] consists if finding the breakpoints along the path.

slide-20
SLIDE 20

Alex Gramfort Algorithms for the Lasso

Lasso path with LARS algorithm

20

slide-21
SLIDE 21

Alex Gramfort Algorithms for the Lasso

Pros/Cons of LARS

21

  • Gives the full path of solution
  • Fast with support is small and one can compute Gram matrix
  • Scales with the size of the support
  • Hard to make it numerically stable
  • One can have many many breakpoints [Mairal et al. 2012]

Pros: Cons:

slide-22
SLIDE 22

demo_lasso_lars.ipynb

slide-23
SLIDE 23

Alex Gramfort Algorithms for the Lasso

Coordinate descent (CD)

23

Limitation of proximal gradient descent: if L is big we make tiny steps !

xk+1 = prox λ

L k·k1(xk 1

Lrf(xk))

The idea of coordinate descent (CD) is to update one coefficient at a time (also known as univariate relaxation methods in

  • ptimization or Gauss Seidel’s method).

Hope: make bigger steps. Spoiler: It is the state of the art in machine learning problems (cf. GLMNET R package, scikit-learn) [Friedman et al. 2009]

slide-24
SLIDE 24

Alex Gramfort Algorithms for the Lasso

Coordinate descent (CD)

24

slide-25
SLIDE 25

Alex Gramfort Algorithms for the Lasso

Coordinate descent (CD)

25

x0 x1 = x2

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Warning: It does not always work !

slide-26
SLIDE 26

Alex Gramfort Algorithms for the Lasso

Algorithm 4: Coordinate descent (CD)

26

Since the regularization

kxk1 =

p

X

i=1

|xi|

is separable function CD works for the Lasso [Tseng 2001] Proximal coordinate descent algorithm works: we make bigger steps ! Li ⌧ L for k = 1 . . . K i = (k mod p) + 1 xk+1

i

= prox λ

Li (xk

i 1

Li (rf(xk))i)

slide-27
SLIDE 27

Alex Gramfort Algorithms for the Lasso

Algorithm 4: Coordinate descent (CD)

27

  • Their exist many “tricks” to make CD fast for the Lasso
  • Lazy update of the residuals
  • Pre-computation of certain dot products
  • Active set methods
  • Screening rules
  • More in the next talk…
slide-28
SLIDE 28

Alex Gramfort Algorithms for the Lasso

Conclusion

28

  • What is the Lasso
  • Lasso with an orthogonal design
  • From projected gradient to proximal gradient
  • Optimality conditions and subgradients (LARS algo.)
  • Coordinate descent algorithm
slide-29
SLIDE 29

GitHub : @agramfort Twitter : @agramfort

http://alexandre.gramfort.net

Contact