a practical tour of optimization algorithms for the lasso
play

A practical tour of optimization algorithms for the Lasso Alexandre - PowerPoint PPT Presentation

A practical tour of optimization algorithms for the Lasso Alexandre Gramfort alexandre.gramfort@inria.fr Inria, Parietal Team Universit Paris-Saclay Huawei - Apr. 2017 Outline What is the Lasso Lasso with an orthogonal design


  1. A practical tour of optimization algorithms for the Lasso Alexandre Gramfort alexandre.gramfort@inria.fr Inria, Parietal Team Université Paris-Saclay Huawei - Apr. 2017

  2. Outline • What is the Lasso • Lasso with an orthogonal design • From projected gradient to proximal gradient • Optimality conditions and subgradients (LARS algo.) • Coordinate descent algorithm … with some demos 2 Alex Gramfort Algorithms for the Lasso

  3. Lasso 1 x ∗ 2 argmin 2 k b � Ax k 2 + λ k x k 1 x p X k x k 1 = | x i | with A ∈ R n × p λ > 0 i =1 • Commonly attributed to [Tibshirani 96] (> 19000 citations) • Also known as Basis Pursuit Denoising [Chen 95] (> 9000 c.) • Convex way of promoting sparsity in high dimensional regression / inverse problems. • Can lead to statistical guarantees even if n ≈ log( p ) 3 Alex Gramfort Algorithms for the Lasso

  4. Algorithm 0 Using CVX Toolbox n = 10; A = randn(n/2,n); b = randn(n/2,1); gamma = 1; cvx_begin variable x(n) dual variable y minimize(0.5*norm(A*x - b, 2) + gamma * norm(x, 1)) cvx_end http://cvxr.com/cvx/ 4 Alex Gramfort Algorithms for the Lasso

  5. Algorithm 1 Rewrite: x i = x + i + x − i = max( x i , 0) + min( x i , 0) | x i | = x + i = max( x i , 0) + max( − x i , 0) i − x − k x k 1 = x + � x − Leads to: 1 z ∗ 2 argmin 2 k b � [ A, � A ] z k 2 + λ X z i z ∈ R 2 p i + • This is a simple smooth convex optimization problem with positivity constraints (convex constraints) 5 Alex Gramfort Algorithms for the Lasso

  6. Gradient Descent x ∈ R p f ( x ) min With f smooth with L-Lipschitz gradient: kr f ( x ) � r f ( y ) k  L k x � y k Gradient descent reads: x k +1 = x k � 1 L r f ( x k ) 6 Alex Gramfort Algorithms for the Lasso

  7. Projected gradient Descent x ∈ C ⊂ R p f ( x ) min With a convex set and f smooth with L-Lipschitz gradient C projected gradient reads: x x k +1 = π C ( x k � 1 L r f ( x k )) π C ( x ) C Orthogonal projection on C 7 Alex Gramfort Algorithms for the Lasso

  8. demo_grad_proj.ipynb

  9. What if A is orthogonal? • Let’s assume we have a square orthogonal design matrix A > A = AA > = I p k b � Ax k 2 = k A > b � x k 2 One has: So the Lasso boils down to minimizing: 1 x ⇤ = argmin 2 k A > b � x k 2 + λ k x k 1 x 2 R p + p ✓ 1 ◆ x ⇤ = argmin 2(( A > b ) i − x i ) 2 + λ | x i | X (p 1 d problems) x 2 R p i =1 + x ⇤ , prox λ k · k 1 ( A > b ) (Definition of the proximal operator) 9 Alex Gramfort Algorithms for the Lasso

  10. Proximal operator of L1 norm The soft-thresholding: c → sign( c )( | c | − λ ) + S λ ( c ) − λ c λ • Solution of is the solution of 1 2( c − x ) 2 + λ · | x | min x 10 Alex Gramfort Algorithms for the Lasso

  11. Algorithm with A orthogonal c = A.T.dot(b) x_star = np.sign(c) * np.maximum(np.abs(c) - lambd, 0.) 11 Alex Gramfort Algorithms for the Lasso

  12. What if A is NOT orthogonal? f ( x ) = 1 Let us define: 2 k b � Ax k 2 Leads to: r f ( x ) = � A > ( b � Ax ) The Lipschitz constant of the gradient: L = k A > A k 2 Quadratic upper bound of f at the previous iterate: x k +1 = argmin f ( x k ) + ( x � x k ) > r f ( x k ) x 2 R p + + L 2 k x � x k k 2 + λ k x k 1 ⇔ 1 2 k x � ( x k � 1 L r f ( x k )) k 2 + λ x k +1 = argmin L k x k 1 x ∈ R p + 12 Alex Gramfort Algorithms for the Lasso

  13. Algorithm 2: Proximal gradient descent That we can rewrite: 1 2 k x � ( x k � 1 L r f ( x k )) k 2 + λ x k +1 = argmin L k x k 1 x 2 R p + L k · k 1 ( x k � 1 L r f ( x k )) = prox λ [Daubechies et al. 2004, Combettes et al. 2005] ✓ 1 ◆ Remark: If f is not strongly convex f ( x k ) − f ( x ∗ ) = O k Very far from an exponential rate of GD with strong convexity Remark: There exist so called “accelerated” methods known as FISTA, Nesterov acceleration… 13 Alex Gramfort Algorithms for the Lasso

  14. Proximal gradient alpha = 0.1 # Lambda parameter L = linalg.norm(A)**2 x = np.zeros(A.shape[1]) for i in range(max_iter): x += (1. / L) * np.dot(A.T, b - np.dot(A, x)) x = np.sign(X) * np.maximum(np.abs(X) - (alpha / L), 0) 14 Alex Gramfort Algorithms for the Lasso

  15. demo_grad_proximal.ipynb

  16. Pros of proximal gradient • First order method (only requires to compute gradients) • Algorithms scalable even if p is large (needs to store A in memory) • Great if A is an implicit linear operator (Fourier, Wavelet, MDCT, etc.) as dot products have some logarithmic complexities. 16 Alex Gramfort Algorithms for the Lasso

  17. Subgradient and subdifferential The subdifferential of f at x 0 is: ∂ f ( x 0 ) = { g ∈ R n /f ( x ) − f ( x 0 ) ≥ g T ( x − x 0 ) , ∀ x ∈ R n } Properties • The subdifferential is a convex set • x 0 is a minimizer of f if 0 ∈ ∂ f ( x 0 ) Exercise: What is ∂ | · | (0) =? 17 Alex Gramfort Algorithms for the Lasso

  18. Path of solutions Lemma [Fuchs 97] : Let be a solution of the Lasso x ∗ 1 x ∗ 2 argmin 2 k b � Ax k 2 + λ k x k 1 x Let the support I = { i s.t. x i 6 = 0 } Then: A > I ( Ax ⇤ − b ) + λ sign( x ⇤ I ) = 0 I c ( Ax ⇤ � b ) k 1  λ k A > And also: x ⇤ I = ( A > I A I ) � 1 ( A > I b − λ sign( x ⇤ I )) 18 Alex Gramfort Algorithms for the Lasso

  19. Algorithm 3: Homotopy and LARS The idea is to compute the full path of solution noticing that for a given sparsity / sign pattern the solution if affine. x ⇤ I = ( A > I A I ) � 1 ( A > I b − λ sign( x ⇤ I )) The LARS algorithm [Osborne 2000, Efron et al. 2004] consists if finding the breakpoints along the path. 19 Alex Gramfort Algorithms for the Lasso

  20. Lasso path with LARS algorithm 20 Alex Gramfort Algorithms for the Lasso

  21. Pros/Cons of LARS Pros: • Gives the full path of solution • Fast with support is small and one can compute Gram matrix Cons: • Scales with the size of the support • Hard to make it numerically stable • One can have many many breakpoints [Mairal et al. 2012] 21 Alex Gramfort Algorithms for the Lasso

  22. demo_lasso_lars.ipynb

  23. Coordinate descent (CD) Limitation of proximal gradient descent: L k · k 1 ( x k � 1 x k +1 = prox λ L r f ( x k )) if L is big we make tiny steps ! The idea of coordinate descent (CD) is to update one coefficient at a time (also known as univariate relaxation methods in optimization or Gauss Seidel’s method ) . Hope: make bigger steps. Spoiler: It is the state of the art in machine learning problems (cf. GLMNET R package, scikit-learn) [Friedman et al. 2009] 23 Alex Gramfort Algorithms for the Lasso

  24. Coordinate descent (CD) 24 Alex Gramfort Algorithms for the Lasso

  25. Coordinate descent (CD) 1 0.9 0.8 0.7 0.6 x 0 x 1 = x 2 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Warning: It does not always work ! 25 Alex Gramfort Algorithms for the Lasso

  26. Algorithm 4: Coordinate descent (CD) p X k x k 1 = | x i | Since the regularization i =1 is separable function CD works for the Lasso [Tseng 2001] Proximal coordinate descent algorithm works: for k = 1 . . . K i = ( k mod p ) + 1 i � 1 x k +1 Li ( x k ( r f ( x k )) i ) = prox λ i L i we make bigger steps ! L i ⌧ L 26 Alex Gramfort Algorithms for the Lasso

  27. Algorithm 4: Coordinate descent (CD) • Their exist many “tricks” to make CD fast for the Lasso • Lazy update of the residuals • Pre-computation of certain dot products • Active set methods • Screening rules • More in the next talk… 27 Alex Gramfort Algorithms for the Lasso

  28. Conclusion • What is the Lasso • Lasso with an orthogonal design • From projected gradient to proximal gradient • Optimality conditions and subgradients (LARS algo.) • Coordinate descent algorithm 28 Alex Gramfort Algorithms for the Lasso

  29. Contact http://alexandre.gramfort.net GitHub : @agramfort Twitter : @agramfort

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend