The smoothed multivariate square-root Lasso: an optimization lens on - - PowerPoint PPT Presentation

the smoothed multivariate square root lasso
SMART_READER_LITE
LIVE PREVIEW

The smoothed multivariate square-root Lasso: an optimization lens on - - PowerPoint PPT Presentation

The smoothed multivariate square-root Lasso: an optimization lens on concomitant estimation Joseph Salmon http://josephsalmon.eu IMAG, Univ. Montpellier, CNRS Series of works with: Quentin Bertrand (INRIA) Mathurin Massias (University of


slide-1
SLIDE 1

The smoothed multivariate square-root Lasso:

an optimization lens on concomitant estimation Joseph Salmon http://josephsalmon.eu IMAG, Univ. Montpellier, CNRS Series of works with: Quentin Bertrand (INRIA) Mathurin Massias (University of Genova) Olivier Fercoq (Institut Polytechnique de Paris) Alexandre Gramfort (INRIA)

1 / 40

slide-2
SLIDE 2

Table of Contents

Neuroimaging The M/EEG problem Stastistical model Estimation procedures Sparsity and Multi-task approaches Smoothing interpretation of concomitant and √ Lasso Optimization algorithm

2 / 40

slide-3
SLIDE 3

The M/EEG inverse problem

◮ observe magnetoelectric field outside the scalp (100 sensors) ◮ reconstruct cerebral activity inside the brain (10,000 locations)

n = 100 sensors

p = 10,000 locations

n ≪ p: ill-posed problem ◮ Motivation: identify brain regions responsible for the signals ◮ Applications: epilepsy treatment, brain aging, anesthesia risks

3 / 40

slide-4
SLIDE 4

M/EEG inverse problem for brain imaging

◮ sensors: electric and magnetic fields during a cognitive task

4 / 40

slide-5
SLIDE 5

MEG elements: magnometers and gradiometers

Device Sensors Detail of a sensor

5 / 40

slide-6
SLIDE 6

M/EEG = MEG + EEG

Photo Credit: Stephen Whitmarsh

6 / 40

slide-7
SLIDE 7

Table of Contents

Neuroimaging The M/EEG problem Stastistical model Estimation procedures Sparsity and Multi-task approaches Smoothing interpretation of concomitant and √ Lasso Optimization algorithm

7 / 40

slide-8
SLIDE 8

Source modeling

TIME SPACE

Position a few thousands candidate sources over the brain (e.g., every 5mm)

8 / 40

slide-9
SLIDE 9

Design matrix - Forward operator

9 / 40

slide-10
SLIDE 10

Mathematical model: linear regression

10 / 40

slide-11
SLIDE 11

Experiments repeated r times

Stimuli Repetition Repetition Stimulated patient M/EEG observed signals

11 / 40

slide-12
SLIDE 12

M/EEG specifity #1: combined measurements

Device Sensors Sensor detail Structure of Y and X:

12 / 40

slide-13
SLIDE 13

Sensor types & noise structure

25 50 75 100 125 150 175 −5 5 uV blank Nave=55

EEG (59 channels)

25 50 75 100 125 150 175 200 fT/cm blank

Gradiometers (203 channels)

25 50 75 100 125 150 175 Time (ms) −500 500 fT blank

Magnetometers (102 channels)

50 20 40

EEG covariance

200 100 200

Gradiometers covariance

100 50 100

Magnetometers covariance

13 / 40

slide-14
SLIDE 14

M/EEG specificity #2: averaging repetitions of experiment

Stimuli Repetition Repetition Stimulated patient M/EEG observed signals 14 / 40

slide-15
SLIDE 15

M/EEG specificity #2: averaging repetitions of experiment

Stimuli Repetition Repetition Stimulated patient M/EEG observed signals Averaged signal 14 / 40

slide-16
SLIDE 16

M/EEG specificity #2: averaged signals

(EEG only) (EEG only) (EEG only)

Limit on the repetitions: subject/patient fatigue

15 / 40

slide-17
SLIDE 17

A multi-task framework

Multi-task regression notation: ◮ n observations (number of sensors) ◮ T tasks (temporal information) ◮ p features (spatial description) ◮ r number of repetitions for the experiment ◮ Y (1), . . . , Y (r) ∈ Rn×T observation matrices; ¯ Y = 1

r

  • l Y (l)

◮ X ∈ Rn×p forward matrix Y (l) = XB∗ + S∗E(l) , where ◮ B∗ ∈ Rp×T : true source activity matrix (unknown) ◮ S∗ ∈ Sn

++ co-standard deviation matrix(1) (unknown)

◮ E(1), . . . , E(r) ∈ Rn×T : white noise (standard Gaussian)

(1)S σ means S − σ Idn is Semi-Definite Positive

16 / 40

slide-18
SLIDE 18

Table of Contents

Neuroimaging The M/EEG problem Stastistical model Estimation procedures Sparsity and Multi-task approaches Smoothing interpretation of concomitant and √ Lasso Optimization algorithm

17 / 40

slide-19
SLIDE 19

Sparsity everywhere

Signals can often be represented combining few atoms/features: ◮ Fourier decomposition for sounds

(2)I. Daubechies. Ten lectures on wavelets. SIAM, 1992. (3)B. A. Olshausen and D. J. Field. “Sparse coding with an overcomplete basis set: A strategy employed by V1?”

In: Vision research (1997). 18 / 40

slide-20
SLIDE 20

Sparsity everywhere

Signals can often be represented combining few atoms/features: ◮ Fourier decomposition for sounds ◮ Wavelets for images (1990’s)(2)

(2)I. Daubechies. Ten lectures on wavelets. SIAM, 1992. (3)B. A. Olshausen and D. J. Field. “Sparse coding with an overcomplete basis set: A strategy employed by V1?”

In: Vision research (1997). 18 / 40

slide-21
SLIDE 21

Sparsity everywhere

Signals can often be represented combining few atoms/features: ◮ Fourier decomposition for sounds ◮ Wavelets for images (1990’s)(2) ◮ Dictionary learning for images (2000’s)(3)

(2)I. Daubechies. Ten lectures on wavelets. SIAM, 1992. (3)B. A. Olshausen and D. J. Field. “Sparse coding with an overcomplete basis set: A strategy employed by V1?”

In: Vision research (1997). 18 / 40

slide-22
SLIDE 22

Sparsity everywhere

Signals can often be represented combining few atoms/features: ◮ Fourier decomposition for sounds ◮ Wavelets for images (1990’s)(2) ◮ Dictionary learning for images (2000’s)(3) ◮ Neuroimaging: measurements assumed to be explained by a few active brain sources

(2)I. Daubechies. Ten lectures on wavelets. SIAM, 1992. (3)B. A. Olshausen and D. J. Field. “Sparse coding with an overcomplete basis set: A strategy employed by V1?”

In: Vision research (1997). 18 / 40

slide-23
SLIDE 23

Justification for dipolarity assumption

Sparsity holds: dipolar patterns equivalent to focal sources ◮ short duration ◮ simple cognitive task ◮ repetitions of experiment average out other sources ◮ ICA recovers dipolar patterns,(4) well modeled by focal sources:

(4)A. Delorme et al. “Independent EEG sources are dipolar”. In: PloS one 7.2 (2012), e30135.

19 / 40

slide-24
SLIDE 24

(Structured) Sparsity inducing penalties(5)

ˆ B ∈ arg min

B∈Rp×T

1

2nT Y − XB2

F + λB1

  • time

sources

Sparse support: no structure ✗ Lasso penalty B1

  • p
  • j=1

T

  • t=1

|Bjt|

(5)G. Obozinski, B. Taskar, and M. I. Jordan. “Joint covariate selection and joint subspace selection for multiple

classification problems”. In: Statistics and Computing 20.2 (2010), pp. 231–252. 20 / 40

slide-25
SLIDE 25

(Structured) Sparsity inducing penalties(5)

ˆ B ∈ arg min

B∈Rp×T

1

2nT Y − XB2

F + λB2,1

  • time

sources

Sparse support: group structure ✓ Group-Lasso penalty B2,1

p

  • j=1

Bj:2 with Bj:, j-th row of B

(5)G. Obozinski, B. Taskar, and M. I. Jordan. “Joint covariate selection and joint subspace selection for multiple

classification problems”. In: Statistics and Computing 20.2 (2010), pp. 231–252. 20 / 40

slide-26
SLIDE 26

Data-fitting term and experiment repetitions

◮ Classical estimator: use averaged(6) signal ¯ Y ˆ B ∈ arg min

B∈Rp×T

1

2nT

  • ¯

Y − XB

  • 2

F + λB2,1

  • ◮ How to take advantage of the number of repetitions?

Intuitive estimator: ˆ Brepet ∈ arg min

B∈Rp×T

  • 1

2nTr

r

  • l=1
  • Y (l) − XB
  • 2

F + λB2,1

  • (6)& whitened, say using baseline data

21 / 40

slide-27
SLIDE 27

Data-fitting term and experiment repetitions

◮ Classical estimator: use averaged(6) signal ¯ Y ˆ B ∈ arg min

B∈Rp×T

1

2nT

  • ¯

Y − XB

  • 2

F + λB2,1

  • ◮ How to take advantage of the number of repetitions?

Intuitive estimator: ˆ Brepet ∈ arg min

B∈Rp×T

  • 1

2nTr

r

  • l=1
  • Y (l) − XB
  • 2

F + λB2,1

  • ◮ Fail: ˆ

Brepet = ˆ B (because of datafit ·2

F )

(6)& whitened, say using baseline data

21 / 40

slide-28
SLIDE 28

Data-fitting term and experiment repetitions

◮ Classical estimator: use averaged(6) signal ¯ Y ˆ B ∈ arg min

B∈Rp×T

1

2nT

  • ¯

Y − XB

  • 2

F + λB2,1

  • ◮ How to take advantage of the number of repetitions?

Intuitive estimator: ˆ Brepet ∈ arg min

B∈Rp×T

  • 1

2nTr

r

  • l=1
  • Y (l) − XB
  • 2

F + λB2,1

  • ◮ Fail: ˆ

Brepet = ˆ B (because of datafit ·2

F )

֒ → investigate other datafits

(6)& whitened, say using baseline data

21 / 40

slide-29
SLIDE 29

Table of Contents

Neuroimaging The M/EEG problem Stastistical model Estimation procedures Sparsity and Multi-task approaches Smoothing interpretation of concomitant and √ Lasso Optimization algorithm

22 / 40

slide-30
SLIDE 30

Lasso(7),(8): the “modern least-squares”(9)

ˆ β ∈ arg min

β∈Rp

1 2n y − Xβ2 + λ β1 ◮ y ∈ Rn: observations ◮ X ∈ Rn×p: design matrix ◮ sparsity: for λ large enough, ˆ β0 ≪ p

(7)R. Tibshirani. “Regression Shrinkage and Selection via the Lasso”. In: J. R. Stat. Soc. Ser. B Stat.

  • Methodol. 58.1 (1996), pp. 267–288.

(8)S. S. Chen and D. L. Donoho. “Atomic decomposition by basis pursuit”. In: SPIE. 1995. (9)E. J. Candès, M. B. Wakin, and S. P. Boyd. “Enhancing Sparsity by Reweighted l1 Minimization”. In: J.

Fourier Anal. Applicat. 14.5-6 (2008), pp. 877–905. 23 / 40

slide-31
SLIDE 31

Lasso and optimal λ(10),(11)

Theorem For y = Xβ∗ + σ∗ε, ε ∼ N(0, Idn) and X satisfying the “Restricted Eigenvalue” property, if λ = 2σ∗

  • 2 log (p/δ)

n

, then 1 n

  • Xβ∗ − X ˆ

β

  • 2 ≤ 18

κ2

s∗

σ2

∗s∗

n log

p

δ

  • with probability 1 − δ, where ˆ

β is a Lasso solution Rem: optimal rate in the minimax sense (up to constant/log term) BUT σ∗ is unknown in practice !

(10)P. J. Bickel, Y. Ritov, and A. B. Tsybakov. “Simultaneous analysis of Lasso and Dantzig selector”. In: Ann.

  • Statist. 37.4 (2009), pp. 1705–1732.

(11)A. S. Dalalyan, M. Hebiri, and J. Lederer. “On the Prediction Performance of the Lasso”. In: Bernoulli 23.1

(2017), pp. 552–581. 24 / 40

slide-32
SLIDE 32

Other datafit: the √ Lasso(12)

ˆ βLasso ∈ arg min

β∈Rp

1

2n y − Xβ2 + λ β1

  • ptimal λ ∝ σ∗

Confirmed in practice: Lasso

(12)A. Belloni, V. Chernozhukov, and L. Wang. “Square-root Lasso: pivotal recovery of sparse signals via conic

programming”. In: Biometrika 98.4 (2011), pp. 791–806. 25 / 40

slide-33
SLIDE 33

Other datafit: the √ Lasso(12)

ˆ β√

Lasso ∈ arg min β∈Rp

1

√n y − Xβ + λ β1

  • ptimal λ adaptive to σ∗

Confirmed in practice: Square-root Lasso

(12)A. Belloni, V. Chernozhukov, and L. Wang. “Square-root Lasso: pivotal recovery of sparse signals via conic

programming”. In: Biometrika 98.4 (2011), pp. 791–806. 25 / 40

slide-34
SLIDE 34

Unhappy optimizer

√ Lasso : non-smooth+non-smooth ֒ → use Concomitant Lasso(13): (ˆ β, ˆ σ)∈ arg min

β∈Rp,σ>0

y − Xβ2 2nσ + σ 2 + λ β1 same solutions when y − X ˆ β√

Lasso = 0, but jointly convex,

non smooth + separable: solvable by alternate min.(14) in β and σ

a 2 2 b 1 2 f ( a , b ) 1 2

Graph of f(a, b) = a2/b (13)A. B. Owen. “A robust hybrid of lasso and ridge regression”. In: Contemporary Mathematics 443 (2007),

  • pp. 59–72.

(14)T. Sun and C.-H. Zhang. “Scaled sparse linear regression”. In: Biometrika 99.4 (2012), pp. 879–898.

26 / 40

slide-35
SLIDE 35

Unhappy optimizer

√ Lasso : non-smooth+non-smooth ֒ → use Concomitant Lasso(13): (ˆ β, ˆ σ)∈ arg min

β∈Rp,σ≥σ

y − Xβ2 2nσ + σ 2 + λ β1 same solutions when y − X ˆ β√

Lasso = 0, but jointly convex,

smooth + separable: solvable by alternate min.(14) in β and σ

a 2 2 b 1 2 f ( a , b ) 1 2

Graph of f(a, b) = a2/b (13)A. B. Owen. “A robust hybrid of lasso and ridge regression”. In: Contemporary Mathematics 443 (2007),

  • pp. 59–72.

(14)T. Sun and C.-H. Zhang. “Scaled sparse linear regression”. In: Biometrika 99.4 (2012), pp. 879–898.

26 / 40

slide-36
SLIDE 36

“Concomitant”: smoothing the √ Lasso(17)

“Huberization”: replace ·

√n by a smooth approximation

huberσ (z) =

  

z2 2nσ + σ 2

if z

√n ≤ σ z √n

if z

√n > σ

= min

σ≥σ

  • z2

2nσ + σ 2

  • =

1 √n·

  • 1

2nσ ·2 + σ 2

(z)

Leads to the Smoothed(15),(16) Concomitant Lasso formulation: (ˆ β, ˆ σ)∈ arg min

β∈Rp,σ≥σ

  • y − Xβ2

2nσ + σ 2 + λ β1

  • (15)A. Beck and M. Teboulle. “Smoothing and first order methods: A unified framework”. In: SIAM J. Optim.

22.2 (2012), pp. 557–580.

(16)Y. Nesterov. “Smooth minimization of non-smooth functions”. In: M. Prog. 103.1 (2005), pp. 127–152. (17)E. Ndiaye et al. “Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression”. In:

Journal of Physics: Conference Series 904.1 (2017), p. 012006. 27 / 40

slide-37
SLIDE 37

Smoothing aparté(18),(19)

Smoothing: for σ > 0, a “smoothed” version of f is fσ fσ = σω

·

σ

  • f,

where fg(x) = inf

u {f(u) + g(x − u)}

◮ ω is a predefined smooth function (s.t. ∇ω is Lipschitz) Kernel smoothing analogy: Fourier: F(f) Fenchel/Legendre: f∗ convolution: ⋆ inf-convolution: F(f ⋆ g) = F(f) · F(g) (fg)∗ = f∗ + g∗ Gaussian : F(g) = g ω = ·2

2

: ω∗ = ω fh = 1

hg

  • ·

h

  • ⋆ f

fσ = σω

  • ·

σ

  • f

(18)Y. Nesterov. “Smooth minimization of non-smooth functions”. In: Math. Program. 103.1 (2005),

  • pp. 127–152.

(19)A. Beck and M. Teboulle. “Smoothing and first order methods: A unified framework”. In: SIAM J. Optim.

22.2 (2012), pp. 557–580. 28 / 40

slide-38
SLIDE 38

Huber function: ω(t) = t2

2

−4 −3 −2 −1 1 2 3 4 −1 1 2 3 4 5 | · |

29 / 40

slide-39
SLIDE 39

Huber function: ω(t) = t2

2

−4 −3 −2 −1 1 2 3 4 −1 1 2 3 4 5 | · | fσ, σ = 2.5

29 / 40

slide-40
SLIDE 40

Huber function: ω(t) = t2

2

−4 −3 −2 −1 1 2 3 4 −1 1 2 3 4 5 | · | fσ, σ = 2.5 fσ, σ = 1.0

29 / 40

slide-41
SLIDE 41

Huber function: ω(t) = t2

2

−4 −3 −2 −1 1 2 3 4 −1 1 2 3 4 5 | · | fσ, σ = 2.5 fσ, σ = 1.0 fσ, σ = 0.2

29 / 40

slide-42
SLIDE 42

Huber function (bis): ω(t) = t2

2 + 1 2

−4 −3 −2 −1 1 2 3 4 −1 1 2 3 4 5 | · |

30 / 40

slide-43
SLIDE 43

Huber function (bis): ω(t) = t2

2 + 1 2

−4 −3 −2 −1 1 2 3 4 −1 1 2 3 4 5 | · | fσ, σ = 2.5

30 / 40

slide-44
SLIDE 44

Huber function (bis): ω(t) = t2

2 + 1 2

−4 −3 −2 −1 1 2 3 4 −1 1 2 3 4 5 | · | fσ, σ = 2.5 fσ, σ = 1.0

30 / 40

slide-45
SLIDE 45

Huber function (bis): ω(t) = t2

2 + 1 2

−4 −3 −2 −1 1 2 3 4 −1 1 2 3 4 5 | · | fσ, σ = 2.5 fσ, σ = 1.0 fσ, σ = 0.2

30 / 40

slide-46
SLIDE 46

Smoothing other norms

◮ Smoothing Frobenius norm yields a trivial gen. of conco Lasso ◮ More interesting: S. van de Geer introduced the pivotal multivariate √ Lasso,(20) using trace/nuclear norm for data-fitting arg min

B∈Rp×T

1 n √ T Y − XB∗ + λ B2,1 hard to solve, statistical analysis makes stringent assumptions ◮ Smoothing the datafit makes optim. and stats easier!

(20)S. van de Geer. Estimation and testing under sparsity. École d’Été de Probabilités de Saint-Flour. 2016.

31 / 40

slide-47
SLIDE 47

Smoothing the nuclear norm(21)

Nuclear norm (Schatten-1 norm, or trace norm): Z ∈ Rn×T Z∗ =

n∧T

  • i=1

γi where the γi’s are the singular values of Z ·∗

  • 1

2σ ·2 + n 2

  • (Z) =
  • i

huberσ (γi) = min

  • 1

2 Z2 S−1 + 1 2 Tr(S)

  • where Z2

S−1 Tr(Z⊤S−1Z)

(21)Q. Bertrand et al. “Handling correlated and repeated measurements with the smoothed multivariate

square-root Lasso”. In: NeurIPS. 2019. 32 / 40

slide-48
SLIDE 48

Smoothing of the multivariate √ Lasso

Smoothed Generalized Concomitant Lasso (SGCL)(22): (ˆ BSGCL, ˆ SSGCL)∈ arg min

B∈Rp×T S∈Sn

++,Sσ

  • ¯

Y − XB

  • 2

S−1

2nT + Tr(S) 2n + λ B2,1 Concomitant Lasso with Repetitions (CLaR)(23): (ˆ BCLaR, ˆ SCLaR)∈ arg min

B∈Rp×T S∈Sn

++,Sσ

r

  • l=1
  • Y (l) − XB
  • 2

S−1

2nTr + Tr(S) 2n + λ B2,1

(22)M. Massias et al. “Generalized concomitant multi-task Lasso for sparse multimodal regression”. In: AISTATS.

  • vol. 84. 2018, pp. 998–1007.

(23)Q. Bertrand et al. “Handling correlated and repeated measurements with the smoothed multivariate

square-root Lasso”. In: NeurIPS. 2019. 33 / 40

slide-49
SLIDE 49

Simulations : row support identification

◮ n = 150, p = 500, T = 100 ◮ X Toeplitz-correlated ◮ S∗ Toeplitz matrix: S∗i,j = ρ|i−j|

S∗

, ρS∗ ∈]0, 1[

CLaR SGCL ℓ2,1-MLER ℓ2,1-MLE ℓ2,1-MRCER MTL 34 / 40

slide-50
SLIDE 50

Table of Contents

Neuroimaging The M/EEG problem Stastistical model Estimation procedures Sparsity and Multi-task approaches Smoothing interpretation of concomitant and √ Lasso Optimization algorithm

35 / 40

slide-51
SLIDE 51

SGCL and CLaR: alternate updates

Alternate minimization converges B update (S fixed): standard Multi-task Lasso optimization,

  • ff-the-shelf techniques and lots of refinements

S update (B fixed): arg min

  • 1

2nTr[Z⊤S−1Z] + 1 2n Tr(S)

  • closed-form solution : clipped sqrt of eigen value decomposition of

1 T ( ¯ Y − XB)( ¯ Y − XB)⊤ or 1 rT

r

  • l=1

(Y (l) − XB)(Y (l) − XB)⊤ Rem: see online Python code https://github.com/QB3/CLaR

36 / 40

slide-52
SLIDE 52

Algorithm: Concomitant Lasso w. Repetitions (CLaR) input : X ∈ Rn×p, Y (1), . . . , Y (r) ∈ Rn×T , σ > 0, λ > 0 init : B = 0p,q, R = ¯ Y for iter = 1, . . . , do S ← SpectralClipping( 1

Tr

r

l (Y (l) − XB)(Y (l) − XB)⊤, σ)

// closed-form sol.

  • f min.

in S: EVD + clipping sqrt of eigenvalues at level σ

for j = 1, . . . , p do Lj = X⊤

:j S−1X:j

// Lipschitz constants

for j = 1, . . . , p do R ← R + X:jBj:

// partial residual update

Bj: ← BST

  • X⊤

:j S−1R/Lj, λnT/Lj

  • // coef. update

R ← R − X:jBj:

// residual update

return B, S Complexity? Fine, if we store S−1X, and S−1R instead of R. Need eigenvalue decomposition though O(n3) (here n ≈ 100)

37 / 40

slide-53
SLIDE 53

Statistical properties for i.i.d. case(24)

ˆ B ∈ arg min

B∈Rp×T S∈Sn

++,¯

σSσ

Y − XB2

S−1

2nT + Tr(S) 2n + λ B2,1 Proposition ◮ i.i.d. Gaussian noise ◮ X satisfying the “mutual incoherence” property ◮ λ ∝

√log p T√n (independent of σ∗)

◮ c1 σ ≤ σ∗ ≤ c2¯ σ = ⇒ with probability at least 1 − ne−cT/n 1 T B∗ − ˆ B2,∞ ≤ Cσ∗ 1 T

  • log p

n

(24)M. Massias et al. “Support recovery and sup-norm convergence rates for sparse pivotal regression”. In:

  • AISTATS. 2020.

38 / 40

slide-54
SLIDE 54

Real data experiments

ClaR (ours) MLER MLE MRCER MTL ◮ expected: 2 sources (one in each auditory cortex) ◮ λ chosen such that ˆ B2,0 = 2 ◮ deep sources for ℓ2,1-MRCER (not visible)

39 / 40

slide-55
SLIDE 55

Links

“All models are wrong but some come with good open source implementation and good documentation to use these.”

  • A. Gramfort

◮ Papers: arXiv / personal webpage(25),(26),(27) ◮ CLaR Python code https://github.com/QB3/CLaR

(25)M. Massias et al. “Generalized concomitant multi-task Lasso for sparse multimodal regression”. In: AISTATS.

  • vol. 84. 2018, pp. 998–1007.

(26)Q. Bertrand et al. “Handling correlated and repeated measurements with the smoothed multivariate

square-root Lasso”. In: NeurIPS. 2019.

(27)M. Massias et al. “Support recovery and sup-norm convergence rates for sparse pivotal regression”. In:

  • AISTATS. 2020.

40 / 40

slide-56
SLIDE 56

References I

Beck, A. and M. Teboulle. “Smoothing and first order methods: A unified framework”. In: SIAM J. Optim. 22.2 (2012),

  • pp. 557–580.

Belloni, A., V. Chernozhukov, and L. Wang. “Square-root Lasso: pivotal recovery of sparse signals via conic programming”. In: Biometrika 98.4 (2011), pp. 791–806.

Bertrand, Q. et al. “Handling correlated and repeated measurements with the smoothed multivariate square-root Lasso”. In: NeurIPS. 2019.

Bickel, P. J., Y. Ritov, and A. B. Tsybakov. “Simultaneous analysis

  • f Lasso and Dantzig selector”. In: Ann. Statist. 37.4 (2009),
  • pp. 1705–1732.

Candès, E. J., M. B. Wakin, and S. P. Boyd. “Enhancing Sparsity by Reweighted l1 Minimization”. In: J. Fourier Anal. Applicat. 14.5-6 (2008), pp. 877–905.

41 / 40

slide-57
SLIDE 57

References II

Chen, S. S. and D. L. Donoho. “Atomic decomposition by basis pursuit”. In: SPIE. 1995.

Dalalyan, A. S., M. Hebiri, and J. Lederer. “On the Prediction Performance of the Lasso”. In: Bernoulli 23.1 (2017),

  • pp. 552–581.

Daubechies, I. Ten lectures on wavelets. SIAM, 1992.

Delorme, A. et al. “Independent EEG sources are dipolar”. In: PloS

  • ne 7.2 (2012), e30135.

Massias, M. et al. “Generalized concomitant multi-task Lasso for sparse multimodal regression”. In: AISTATS. Vol. 84. 2018,

  • pp. 998–1007.

Massias, M. et al. “Support recovery and sup-norm convergence rates for sparse pivotal regression”. In: AISTATS. 2020.

Ndiaye, E. et al. “Efficient Smoothed Concomitant Lasso Estimation for High Dimensional Regression”. In: Journal of Physics: Conference Series 904.1 (2017), p. 012006.

42 / 40

slide-58
SLIDE 58

References III

Nesterov, Y. “Smooth minimization of non-smooth functions”. In:

  • M. Prog. 103.1 (2005), pp. 127–152.

– .“Smooth minimization of non-smooth functions”. In: Math.

  • Program. 103.1 (2005), pp. 127–152.

Obozinski, G., B. Taskar, and M. I. Jordan. “Joint covariate selection and joint subspace selection for multiple classification problems”. In: Statistics and Computing 20.2 (2010),

  • pp. 231–252.

Olshausen, B. A. and D. J. Field. “Sparse coding with an

  • vercomplete basis set: A strategy employed by V1?” In: Vision

research (1997).

Owen, A. B. “A robust hybrid of lasso and ridge regression”. In: Contemporary Mathematics 443 (2007), pp. 59–72.

Sun, T. and C.-H. Zhang. “Scaled sparse linear regression”. In: Biometrika 99.4 (2012), pp. 879–898.

43 / 40

slide-59
SLIDE 59

References IV

Tibshirani, R. “Regression Shrinkage and Selection via the Lasso”. In: J. R. Stat. Soc. Ser. B Stat. Methodol. 58.1 (1996),

  • pp. 267–288.

van de Geer, S. Estimation and testing under sparsity. École d’Été de Probabilités de Saint-Flour. 2016.

44 / 40

slide-60
SLIDE 60

Statistical assumptions

Gaussian noise: the entries Ei,j are i.i.d. N(0, σ∗2) random variables. Mutual incoherence: The Gram matrix Ψ 1

nX⊤X satisfies

Ψjj = 1 , and max

j′=j

  • Ψjj′

1 7αs, ∀j ∈ [p] ,

for some integer s ≥ 1 and some constant α > 1. Residuals bound: For the multivariate square-root Lasso, ˆ E⊤ˆ E is invertible, and there exists η such that ( 1

T ˆ

E⊤ˆ E)

1 2 2 ≤ Cσ∗

Smoothing parameter value: σ, ¯ σ and η verify: σ ≤ σ∗

√ 2 and

¯ σ = (2 + η)σ∗ with η ≥ 1.

45 / 40

slide-61
SLIDE 61

Competitors

◮ (smoothed) ℓ2,1-MLE (ˆ B, ˆ Σ) ∈ arg min

B∈Rp×T Σσ 2/r2

  • ¯

Y − XB

  • 2

Σ−1 − log det(Σ−1) + λ B2,1 ,

◮ and its repetitions version (ℓ2,1-MLER): (ˆ B, ˆ Σ) ∈ arg min

B∈Rp×T Σσ 2 r

  • 1
  • Y (l) − XB
  • 2

Σ−1−log det(Σ−1)+λ B2,1 .

Rem: ℓ2,1-MLE and ℓ2,1-MLER are bi-convex but not jointly convex ◮ MRCER has an additional term µ

  • Σ−1

w.r.t. ℓ2,1-MLER

46 / 40