learning step sizes for unfolded sparse coding
play

Learning step sizes for unfolded sparse coding Thomas Moreau INRIA - PowerPoint PPT Presentation

Learning step sizes for unfolded sparse coding Thomas Moreau INRIA Saclay Joint work with Pierre Ablin; Mathurin Massias; Alexandre Gramfort 1/32 Electrophysiology Magnetoencephalography Electroencephalography 2/32 Inverse problems


  1. Learning step sizes for unfolded sparse coding Thomas Moreau INRIA Saclay Joint work with Pierre Ablin; Mathurin Massias; Alexandre Gramfort 1/32

  2. Electrophysiology Magnetoencephalography Electroencephalography 2/32

  3. Inverse problems Maxwell’s Equations z z z x x x D D D Electrical activity Observed signal Forward model: x x x = D D Dz z z 3/32

  4. Inverse problems Inverse Problem Maxwell’s Equations z z z x x x D D D Electrical activity Observed signal Forward model: x x x = D D Dz z z Inverse problem: z z z = f ( x x x ) (ill-posed) 3/32

  5. Inverse problems Inverse Problem Maxwell’s Equations z z z x x x D D D Electrical activity Observed signal Forward model: x x x = D D Dz z z Inverse problem: z z z = f ( x x x ) (ill-posed) Optimization with a regularization R encoding prior knowledge z � 2 argmin z z � x x x − D D Dz z 2 + R ( z z z ) z Example: sparsity with R = λ � · � 1 3/32

  6. Other inverse problems Ultra sound fMRI - compress sensing Astrophysic 4/32

  7. Some challenges for inverse problems Evaluation: often there is no ground truth, • In neuroscience, we cannot access the brain electrical activity. • How to evaluate how well it is reconstructed? Open problem in unsupervised learning Modelization: how to better account for the signal structure, • ℓ 2 reconstruction evaluation does not account for localization • Optimal transport could help in this case? Computational: solving these problems can be too long, • Many problems share the same forward operator D D D • Can we use the structure of the problem? Today’s talk topic! 5/32

  8. Better step sizes for Iterative Shrinkage-Thresholding Algorithm (ISTA) 6/32

  9. The Lasso For a fixed design matrix D ∈ R n × m and λ > 0, the Lasso for x ∈ R n is F x ( z ) = 1 z ∗ = argmin 2 � x − Dz � 2 + λ � z � 1 2 z � �� � f x ( z ) a.k.a. sparse coding, sparse linear regression, ... We are interested in the over-complete case where m > n . 7/32

  10. The Lasso For a fixed design matrix D ∈ R n × m and λ > 0, the Lasso for x ∈ R n is F x ( z ) = 1 z ∗ = argmin 2 � x − Dz � 2 + λ � z � 1 2 z � �� � f x ( z ) a.k.a. sparse coding, sparse linear regression, ... We are interested in the over-complete case where m > n . Properties ◮ The problem is convex in z but not strongly convex in general ◮ z = 0 is solution if and only if λ ≥ λ max . = � D ⊤ x � ∞ 7/32

  11. ISTA: [ Daubechies et al. 2004 ] Iterative Shrinkage-Thresholding Algorithm f x is a L-smooth function with L = � D � 2 2 and ∇ f x ( z ( t ) ) = D ⊤ ( Dz ( t ) − x ) The ℓ 1 -norm is proximable with a separable proximal operator prox µ �·� 1 ( x ) = sign ( x ) max( 0 , | x | − µ ) = ST ( x , µ ) 8/32

  12. ISTA: [ Daubechies et al. 2004 ] Iterative Shrinkage-Thresholding Algorithm f x is a L-smooth function with L = � D � 2 2 and ∇ f x ( z ( t ) ) = D ⊤ ( Dz ( t ) − x ) The ℓ 1 -norm is proximable with a separable proximal operator prox µ �·� 1 ( x ) = sign ( x ) max( 0 , | x | − µ ) = ST ( x , µ ) We can use the proximal gradient descent algorithm (ISTA)     z ( t + 1 ) = ST  z ( t ) − ρ ∇ f x ( z ( t ) ) , ρλ  � �� � D ⊤ ( Dz ( t ) − x ) Here, ρ play the role of a step size (in [ 0 , 2 L [ ). 8/32

  13. ISTA: Majoration-Minimization Taylor expansion of f x in z ( t ) F x ( z ) = f x ( z ( t ) ) + ∇ f x ( z ( t ) ) ⊤ ( z − z ( t ) ) + 1 2 � D ( z − z ( t ) ) � 2 2 + λ � z � 1 ≤ f x ( z ( t ) ) + ∇ f x ( z ( t ) ) ⊤ ( z − z ( t ) ) + L 2 � z − z ( t ) � 2 2 + λ � z � 1 ⇒ Replace the Hessian D ⊤ D by L Id . Separable function that can be minimized in close form � � � � 2 � � L � z ( t ) − 1 z ( t ) − 1 L ∇ f x ( z ( t ) ) , λ � L ∇ f x ( z ( t ) ) − z � argmin + λ � z � 1 = ST � 2 L z 2 � � z ( t ) − 1 L ∇ f x ( z ( t ) ) = prox λ L 9/32

  14. ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz 10/32

  15. ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz ≤ L � z � 2 10/32

  16. ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz ≤ z ⊤ A ⊤ Λ Az [ Moreau and Bruna 2017 ] 10/32

  17. ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz ≤ L S � z � 2 for Supp ( z ) ⊂ S 10/32

  18. Oracle ISTA: Majoration-Minimization For all z such that Supp( z ) ⊂ S . = Supp( z ( t ) ) , F x ( z ) ≤ f x ( z ( t ) ) + ∇ f x ( z ( t ) ) ⊤ ( z − z ( t ) ) + L S 2 � z − z ( t ) � 2 2 + λ � z � 1 with L S = � D · , S � 2 2 . Q x,L ( · , z ( t ) ) Q x,L S ( · , z ( t ) ) F x Cost function 0 1 1 L L S Step size 11/32

  19. Better step-sizes for ISTA Oracle ISTA (OISTA): 1. Get the Lipschitz constant L S associated with support S = Supp( z ( t ) ) . 2. Compute y ( t + 1 ) as a step of ISTA with a step-size of 1 / L S � � z ( t ) − 1 D ⊤ ( Dz ( t ) − x ) , λ y ( t + 1 ) = ST L S L S 3. If Supp( y t + 1 ) ⊂ S , accept the update z ( t + 1 ) = y ( t + 1 ) . 4. Else, z ( t + 1 ) is computed with step size 1 / L . 12/32

  20. OISTA: Performances ISTA FISTA OISTA (proposed) 10 − 6 F x − F ∗ x 10 − 12 step Number of iterations 13/32

  21. OISTA – Step-size 3 Oracle step 2 1 1 L 0 50 100 150 Number of iterations 14/32

  22. OISTA – Improved-convergence rates S ∗ = Supp ( Z ∗ ) µ ∗ = min � Dz � 2 2 for � z � 2 = 1 and Supp ( z ) ⊂ S ∗ . If µ ∗ > 0, OISTA converges with a linear rate F x ( z ( t ) ) − F x ( z ∗ ) ≤ ( 1 − µ ∗ L S ∗ ) t − T ∗ ( F x ( z ( T ∗ ) ) − F x ( z ∗ )) . 15/32

  23. OISTA – Gaussian setting Acceleration quantification with Marchenko-Pastur Entries in D ∈ R n × m are sampled from N ( 0 , 1 ) and S is sampled uniformly with | S | = k . Denote m / n → γ, k / m → ζ , with k , m , n → + ∞ . Then � 1 + √ ζγ � 2 L S L → 1 + √ γ . (1) 1 . 00 0 . 75 0 . 50 Empirical law Empirical law L S L ( 1+ √ ζγ 1+ √ γ ) 2 n = 200 , m = 600 0 . 25 ζ 0 . 00 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 ζ 16/32

  24. OISTA – Limitation ◮ In practice, OISTA is not practical, as you need to compute L S at each iteration and this is costly. ◮ No precomputation possible: there is an exponential number of supports S . 17/32

  25. Using deep learning to approximate OISTA 18/32

  26. Solving the Lasso many times Assume that we want to solve the Lasso for many observation { x 1 , . . . , x N } with a fixed direct operator D i.e. for each x computes 1 I D ( x ) = argmin 2 � x − Dz � + λ � z � 1 z Thus, the goal is not to solve one problem but multiple problems. ⇒ Can we leverage the problem’s structure? ◮ ISTA : worst case algorithm, second order information is L . ◮ OISTA : adaptive algorithm, second order information is L S (NP-hard). ◮ LISTA : adaptive algorithm, use DL to adapt to second order information? 19/32

  27. ISTA is a Neural Network ISTA � � z ( t ) − 1 LD ⊤ ( Dz ( t ) − x ) , λ z ( t + 1 ) = ST L Let W z = I m − 1 L D ⊤ D and W x = 1 L D ⊤ . Then z ( t + 1 ) = ST ( W z z ( t ) + W x x , λ L ) ST( · , λ x W x L ) z ( t + 1 ) One step of ISTA z ( t ) W z 20/32

  28. ISTA is a Neural Network ISTA � � z ( t ) − 1 LD ⊤ ( Dz ( t ) − x ) , λ z ( t + 1 ) = ST L L D ⊤ D and W x = 1 L D ⊤ . Then Let W z = I m − 1 z ( t + 1 ) = ST ( W z z ( t ) + W x x , λ L ) ST( · , λ z ∗ x W x L ) RNN equivalent to ISTA W z 20/32

  29. Learned ISTA [ Gregor and Le Cun 2010 ] Recurrence relation of ISTA define a RNN ST( · , λ x W x L ) z ∗ � � z ( t ) − 1 LD ⊤ ( Dz ( t ) − x ) , λ z ( t + 1 ) = ST L W z This RNN can be unfolded as a feed-forward network. x W ( 0 ) W ( 1 ) W ( 2 ) x x x ST( · , θ ( 0 ) ) W ( 1 ) ST( · , θ ( 1 ) ) W ( 2 ) ST( · , θ ( 2 ) ) z ( 2 ) z z Let Φ Θ ( T ) denote a network with T layers parametrized with Θ ( T ) . If W ( i ) = W x and W ( i ) = W z , then Φ Θ T ( x ) = z ( t ) . x z 21/32

  30. LISTA – Training Empirical risk minimization : We need a training set of { x 1 , . . . x N training sample and our goad is to accelerate ISTA on unseen data x ∼ p . The training solves N � 1 Θ ( T ) ∈ arg min ˜ L x (Φ Θ ( T ) ( x i )) . N Θ ( T ) i = 1 for a loss L x . ⇒ Choice of loss L x ? 22/32

  31. LISTA – Training Supervised: a ground truth z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Solving the inverse problem. 23/32

  32. LISTA – Training Supervised: a ground truth z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Solving the inverse problem. Semi-supervised: the solution of the Lasso z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Accelerating the resolution of the Lasso. 23/32

  33. LISTA – Training Supervised: a ground truth z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Solving the inverse problem. Semi-supervised: the solution of the Lasso z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Accelerating the resolution of the Lasso. Unsupervised: there is no ground truth L x ( z ) = 1 2 � x − Dz � 2 2 + λ � z � 1 Solving the Lasso. 23/32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend