Learning step sizes for unfolded sparse coding Thomas Moreau INRIA - PowerPoint PPT Presentation

Learning step sizes for unfolded sparse coding Thomas Moreau INRIA Saclay Joint work with Pierre Ablin; Mathurin Massias; Alexandre Gramfort 1/32

Electrophysiology Magnetoencephalography Electroencephalography 2/32

Inverse problems Maxwell’s Equations z z z x x x D D D Electrical activity Observed signal Forward model: x x x = D D Dz z z 3/32

Inverse problems Inverse Problem Maxwell’s Equations z z z x x x D D D Electrical activity Observed signal Forward model: x x x = D D Dz z z Inverse problem: z z z = f ( x x x ) (ill-posed) 3/32

Inverse problems Inverse Problem Maxwell’s Equations z z z x x x D D D Electrical activity Observed signal Forward model: x x x = D D Dz z z Inverse problem: z z z = f ( x x x ) (ill-posed) Optimization with a regularization R encoding prior knowledge z � 2 argmin z z � x x x − D D Dz z 2 + R ( z z z ) z Example: sparsity with R = λ � · � 1 3/32

Other inverse problems Ultra sound fMRI - compress sensing Astrophysic 4/32

Some challenges for inverse problems Evaluation: often there is no ground truth, • In neuroscience, we cannot access the brain electrical activity. • How to evaluate how well it is reconstructed? Open problem in unsupervised learning Modelization: how to better account for the signal structure, • ℓ 2 reconstruction evaluation does not account for localization • Optimal transport could help in this case? Computational: solving these problems can be too long, • Many problems share the same forward operator D D D • Can we use the structure of the problem? Today’s talk topic! 5/32

Better step sizes for Iterative Shrinkage-Thresholding Algorithm (ISTA) 6/32

The Lasso For a fixed design matrix D ∈ R n × m and λ > 0, the Lasso for x ∈ R n is F x ( z ) = 1 z ∗ = argmin 2 � x − Dz � 2 + λ � z � 1 2 z � �� f x ( z ) a.k.a. sparse coding, sparse linear regression, ... We are interested in the over-complete case where m > n . 7/32

The Lasso For a fixed design matrix D ∈ R n × m and λ > 0, the Lasso for x ∈ R n is F x ( z ) = 1 z ∗ = argmin 2 � x − Dz � 2 + λ � z � 1 2 z � �� f x ( z ) a.k.a. sparse coding, sparse linear regression, ... We are interested in the over-complete case where m > n . Properties ◮ The problem is convex in z but not strongly convex in general ◮ z = 0 is solution if and only if λ ≥ λ max . = � D ⊤ x � ∞ 7/32

ISTA: [ Daubechies et al. 2004 ] Iterative Shrinkage-Thresholding Algorithm f x is a L-smooth function with L = � D � 2 2 and ∇ f x ( z ( t ) ) = D ⊤ ( Dz ( t ) − x ) The ℓ 1 -norm is proximable with a separable proximal operator prox µ �·� 1 ( x ) = sign ( x ) max( 0 , | x | − µ ) = ST ( x , µ ) 8/32

ISTA: [ Daubechies et al. 2004 ] Iterative Shrinkage-Thresholding Algorithm f x is a L-smooth function with L = � D � 2 2 and ∇ f x ( z ( t ) ) = D ⊤ ( Dz ( t ) − x ) The ℓ 1 -norm is proximable with a separable proximal operator prox µ �·� 1 ( x ) = sign ( x ) max( 0 , | x | − µ ) = ST ( x , µ ) We can use the proximal gradient descent algorithm (ISTA)     z ( t + 1 ) = ST  z ( t ) − ρ ∇ f x ( z ( t ) ) , ρλ  � �� D ⊤ ( Dz ( t ) − x ) Here, ρ play the role of a step size (in [ 0 , 2 L [ ). 8/32

ISTA: Majoration-Minimization Taylor expansion of f x in z ( t ) F x ( z ) = f x ( z ( t ) ) + ∇ f x ( z ( t ) ) ⊤ ( z − z ( t ) ) + 1 2 � D ( z − z ( t ) ) � 2 2 + λ � z � 1 ≤ f x ( z ( t ) ) + ∇ f x ( z ( t ) ) ⊤ ( z − z ( t ) ) + L 2 � z − z ( t ) � 2 2 + λ � z � 1 ⇒ Replace the Hessian D ⊤ D by L Id . Separable function that can be minimized in close form � � � � 2 � � L � z ( t ) − 1 z ( t ) − 1 L ∇ f x ( z ( t ) ) , λ � L ∇ f x ( z ( t ) ) − z � argmin + λ � z � 1 = ST � 2 L z 2 � � z ( t ) − 1 L ∇ f x ( z ( t ) ) = prox λ L 9/32

ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz 10/32

ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz ≤ L � z � 2 10/32

ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz ≤ z ⊤ A ⊤ Λ Az [ Moreau and Bruna 2017 ] 10/32

ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz ≤ L S � z � 2 for Supp ( z ) ⊂ S 10/32

Oracle ISTA: Majoration-Minimization For all z such that Supp( z ) ⊂ S . = Supp( z ( t ) ) , F x ( z ) ≤ f x ( z ( t ) ) + ∇ f x ( z ( t ) ) ⊤ ( z − z ( t ) ) + L S 2 � z − z ( t ) � 2 2 + λ � z � 1 with L S = � D · , S � 2 2 . Q x,L ( · , z ( t ) ) Q x,L S ( · , z ( t ) ) F x Cost function 0 1 1 L L S Step size 11/32

Better step-sizes for ISTA Oracle ISTA (OISTA): 1. Get the Lipschitz constant L S associated with support S = Supp( z ( t ) ) . 2. Compute y ( t + 1 ) as a step of ISTA with a step-size of 1 / L S � � z ( t ) − 1 D ⊤ ( Dz ( t ) − x ) , λ y ( t + 1 ) = ST L S L S 3. If Supp( y t + 1 ) ⊂ S , accept the update z ( t + 1 ) = y ( t + 1 ) . 4. Else, z ( t + 1 ) is computed with step size 1 / L . 12/32

OISTA: Performances ISTA FISTA OISTA (proposed) 10 − 6 F x − F ∗ x 10 − 12 step Number of iterations 13/32

OISTA – Step-size 3 Oracle step 2 1 1 L 0 50 100 150 Number of iterations 14/32

OISTA – Improved-convergence rates S ∗ = Supp ( Z ∗ ) µ ∗ = min � Dz � 2 2 for � z � 2 = 1 and Supp ( z ) ⊂ S ∗ . If µ ∗ > 0, OISTA converges with a linear rate F x ( z ( t ) ) − F x ( z ∗ ) ≤ ( 1 − µ ∗ L S ∗ ) t − T ∗ ( F x ( z ( T ∗ ) ) − F x ( z ∗ )) . 15/32

OISTA – Gaussian setting Acceleration quantification with Marchenko-Pastur Entries in D ∈ R n × m are sampled from N ( 0 , 1 ) and S is sampled uniformly with | S | = k . Denote m / n → γ, k / m → ζ , with k , m , n → + ∞ . Then � 1 + √ ζγ � 2 L S L → 1 + √ γ . (1) 1 . 00 0 . 75 0 . 50 Empirical law Empirical law L S L ( 1+ √ ζγ 1+ √ γ ) 2 n = 200 , m = 600 0 . 25 ζ 0 . 00 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 ζ 16/32

OISTA – Limitation ◮ In practice, OISTA is not practical, as you need to compute L S at each iteration and this is costly. ◮ No precomputation possible: there is an exponential number of supports S . 17/32

Using deep learning to approximate OISTA 18/32

Solving the Lasso many times Assume that we want to solve the Lasso for many observation { x 1 , . . . , x N } with a fixed direct operator D i.e. for each x computes 1 I D ( x ) = argmin 2 � x − Dz � + λ � z � 1 z Thus, the goal is not to solve one problem but multiple problems. ⇒ Can we leverage the problem’s structure? ◮ ISTA : worst case algorithm, second order information is L . ◮ OISTA : adaptive algorithm, second order information is L S (NP-hard). ◮ LISTA : adaptive algorithm, use DL to adapt to second order information? 19/32

ISTA is a Neural Network ISTA � � z ( t ) − 1 LD ⊤ ( Dz ( t ) − x ) , λ z ( t + 1 ) = ST L Let W z = I m − 1 L D ⊤ D and W x = 1 L D ⊤ . Then z ( t + 1 ) = ST ( W z z ( t ) + W x x , λ L ) ST( · , λ x W x L ) z ( t + 1 ) One step of ISTA z ( t ) W z 20/32

ISTA is a Neural Network ISTA � � z ( t ) − 1 LD ⊤ ( Dz ( t ) − x ) , λ z ( t + 1 ) = ST L L D ⊤ D and W x = 1 L D ⊤ . Then Let W z = I m − 1 z ( t + 1 ) = ST ( W z z ( t ) + W x x , λ L ) ST( · , λ z ∗ x W x L ) RNN equivalent to ISTA W z 20/32

Learned ISTA [ Gregor and Le Cun 2010 ] Recurrence relation of ISTA define a RNN ST( · , λ x W x L ) z ∗ � � z ( t ) − 1 LD ⊤ ( Dz ( t ) − x ) , λ z ( t + 1 ) = ST L W z This RNN can be unfolded as a feed-forward network. x W ( 0 ) W ( 1 ) W ( 2 ) x x x ST( · , θ ( 0 ) ) W ( 1 ) ST( · , θ ( 1 ) ) W ( 2 ) ST( · , θ ( 2 ) ) z ( 2 ) z z Let Φ Θ ( T ) denote a network with T layers parametrized with Θ ( T ) . If W ( i ) = W x and W ( i ) = W z , then Φ Θ T ( x ) = z ( t ) . x z 21/32

LISTA – Training Empirical risk minimization : We need a training set of { x 1 , . . . x N training sample and our goad is to accelerate ISTA on unseen data x ∼ p . The training solves N � 1 Θ ( T ) ∈ arg min ˜ L x (Φ Θ ( T ) ( x i )) . N Θ ( T ) i = 1 for a loss L x . ⇒ Choice of loss L x ? 22/32

LISTA – Training Supervised: a ground truth z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Solving the inverse problem. 23/32

LISTA – Training Supervised: a ground truth z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Solving the inverse problem. Semi-supervised: the solution of the Lasso z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Accelerating the resolution of the Lasso. 23/32

LISTA – Training Supervised: a ground truth z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Solving the inverse problem. Semi-supervised: the solution of the Lasso z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Accelerating the resolution of the Lasso. Unsupervised: there is no ground truth L x ( z ) = 1 2 � x − Dz � 2 2 + λ � z � 1 Solving the Lasso. 23/32

Learning step sizes for unfolded sparse coding Thomas Moreau INRIA - PowerPoint PPT Presentation

Learning step sizes for unfolded sparse coding Thomas Moreau INRIA Saclay Joint work with Pierre Ablin; Mathurin Massias; Alexandre Gramfort 1/32 Electrophysiology Magnetoencephalography Electroencephalography 2/32 Inverse problems

Demo (Step 1, Selection) Demo (Step 1, Optimization) Demo (Step 2, Selection) Demo (Step 2,

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Sparse Coding and Dictionary Learning for Image Analysis Part I: Optimization for Sparse Coding

Bag of Pursuits and Neural Gas for Improved Sparse Coding Manifold Learning with Sparse Coding

Quick guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step 3:

Step by step guide Step 1: Purchasing an RSBlog! membership Step 2: Downloading RSBlog! Step 3:

Step by step guide Step 1: Purchasing an RSEvents! membership Step 2: Downloading RSEvents! Step

Step by step guide Step 1: Accessing the account Step 2: Download RSFiles! 2.1 Download the

Step 1 Step 2 Step 3 Step 4 Step 5 Preparation of a sketch Submission of birth map of all

Quick guide Step 1: Purchasing RSMail! Step 2: Download RSMail! Step 3: Installing RSMail! Step

Credential Assessment Mapping Privilege Escalation at Scale Matt Weeks @scriptjunkie1 Adversary

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Dynamical systems Expanding maps on the circle. Coding Jana Rodriguez Hertz ICTP 2018 coding

Test of Time Award Online Dictionary Learning for Sparse Coding Julien Mairal, Francis Bach, Jean

Psychiatric Disease Eric Burguire (ICM Institute, Paris)

THE CMP INTEGRATES LIFELONG LEARNING WITH ASSESSMENT THE CMP INTEGRATES LIFELONG LEARNING WITH

Sebastian Pape Templateless Biometric-Enforced Non-Transferability of Anonymous Credentials 1

Penguins Prosper with E-Paper Jaya Kumar, Aug 2009 Quick Disclaimer Not promoting any

Distributed Data Sharing with Petashare for Collaborative Research Sreekanth Pothanis Dr.

A View from Inside Pharmaceutical Development: Perspective on Career Paths 1 Disclaimer The

MUSIC THERAPY RESEARCH: CURRENT [ Text / Graphic Area ] AND FUTURE Melanie Kwan, MMT, LCAT,

What have you Studied so far? Perioperative Electrophysiology: Perioperative Management of