Proximal Identification and Applications J er ome MALICK CNRS, - - PowerPoint PPT Presentation

proximal identification and applications
SMART_READER_LITE
LIVE PREVIEW

Proximal Identification and Applications J er ome MALICK CNRS, - - PowerPoint PPT Presentation

Proximal Identification and Applications J er ome MALICK CNRS, Lab. J. Kuntzmann, Grenoble (France) Workshop Optimization for Machine Learning Luminy March 2020 talk based on materiel from joint work with G. Peyr e J. Fadili


slide-1
SLIDE 1

Proximal Identification and Applications

J´ erˆ

  • me MALICK

CNRS, Lab. J. Kuntzmann, Grenoble (France) Workshop Optimization for Machine Learning – Luminy – March 2020 talk based on materiel from joint work with

  • G. Peyr´

e

  • J. Fadili
  • G. Garrigos
  • F. Iutzeler
  • D. Grishchenko
slide-2
SLIDE 2

Example of stability

min

x∈Rd

1 2A x − y2 + λx1

(LASSO)

Stability: the support of optimal solutions is stable under small perturbations Illustration (on an instance with d = 2)

−2 −1 1 2 3 4 −2 2 4

. 5 . 5 0.5 1 1 1 2 2 2 2 4 4 4 4 4 6 6 6 6 6 1 1 1 10 2 20 30

1

slide-3
SLIDE 3

Example of stability

min

x∈Rd

1 2A x − y2 + λx1

(LASSO)

Stability: the support of optimal solutions is stable under small perturbations Illustration (on an instance with d = 2)

−2 −1 1 2 3 4 −2 2 4

. 5 . 5 0.5 1 1 1 2 2 2 2 4 4 4 4 4 6 6 6 6 6 1 1 1 10 2 20 30

−2 −1 1 2 3 4 −2 2 4

0.5 . 5 . 5 1 1 1 1 2 2 2 2 2 4 4 4 4 4 6 6 6 6 10 10 1 10 2 30

1

slide-4
SLIDE 4

Example of stability

min

x∈Rd

1 2A x − y2 + λx1

(LASSO)

Stability: the support of optimal solutions is stable under small perturbations Illustration (on an instance with d = 2)

−2 −1 1 2 3 4 −2 2 4

. 5 . 5 0.5 1 1 1 2 2 2 2 4 4 4 4 4 6 6 6 6 6 1 1 1 10 2 20 30

−2 −1 1 2 3 4 −2 2 4

0.5 . 5 0.5 1 1 1 2 2 2 4 4 4 4 6 6 6 6 10 10 1 20 20 2 30 3 40

1

slide-5
SLIDE 5

Example of stability

min

x∈Rd

1 2A x − y2 + λx1

(LASSO)

Stability: the support of optimal solutions is stable under small perturbations Illustration (on an instance with d = 2)

−2 −1 1 2 3 4 −2 2 4

. 5 . 5 0.5 1 1 1 2 2 2 2 4 4 4 4 4 6 6 6 6 6 1 1 1 10 2 20 30

−2 −1 1 2 3 4 −2 2 4

0.5 . 5 0.5 1 1 1 2 2 2 4 4 4 4 6 6 6 6 10 10 1 20 20 2 30 3 40

More generally: [Lewis ’02] sensitivity analysis of partly-smooth functions

(remind Clarice’s talk, this morning)

1

slide-6
SLIDE 6

Example of identification

min

x∈Rd

1 2A x − y2 + λx1

(LASSO)

Identification: (proximal-gradient) algorithms produce iterates... ...that eventually have the same support as the optimal solution

−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2

1.1 1.1 2.3 2.3 2.3 3.4 3.4 3.4 3.4 4.5 4.5 4.5 5.7 5.7 5 . 7 6.8 6.8 8 8 9.1 10.2 1 1 . 4

x⋆

Proximal Gradient Accelerated Proximal Gradient

Runs of two proximal-gradient algos

(same instance with d = 2 )

2

slide-7
SLIDE 7

Example of identification

min

x∈Rd

1 2A x − y2 + λx1

(LASSO)

Identification: (proximal-gradient) algorithms produce iterates... ...that eventually have the same support as the optimal solution

−1 1 2 3 4 −1 −0.5 0.5 1 1.5 2

1.1 1.1 2.3 2.3 2.3 3.4 3.4 3.4 3.4 4.5 4.5 4.5 5.7 5.7 5 . 7 6.8 6.8 8 8 9.1 10.2 1 1 . 4

x⋆

Proximal Gradient Accelerated Proximal Gradient

Runs of two proximal-gradient algos

(same instance with d = 2 )

Well-studied, see e.g. [Bertsekas ’76], [Wright ’96], [Lewis Drusvyatskiy ’13]...

2

slide-8
SLIDE 8

Outline

1

General stability of regularized problems

2

Enlarged identification of proximal algorithms

3

Application: communication-efficient federated learning

4

Application: model consistency for regularized least-squares

slide-9
SLIDE 9

Outline

1

General stability of regularized problems

2

Enlarged identification of proximal algorithms

3

Application: communication-efficient federated learning

4

Application: model consistency for regularized least-squares

slide-10
SLIDE 10

General stability of regularized problems

Stability or sensitivity analysis

Parameterized composite optimization problem (smooth + nonsmooth) min

x∈Rd

F(x, p) + R(x), Typically nonsmooth R traps solutions in low-dimensional manifolds Stability: Optimal solutions lie on a manifold: x⋆(p) ∈ M for p∼p0

Studied in e.g. [Hare Lewis ’10] [Vaiter et al ’15] [Liang et al ’16]...

Example 1: R = · 1, supp

  • x⋆(p)
  • = supp
  • x⋆(p0)
  • 3
slide-11
SLIDE 11

General stability of regularized problems

Stability or sensitivity analysis

Parameterized composite optimization problem (smooth + nonsmooth) min

x∈Rd

F(x, p) + R(x), Typically nonsmooth R traps solutions in low-dimensional manifolds Stability: Optimal solutions lie on a manifold: x⋆(p) ∈ M for p∼p0

Studied in e.g. [Hare Lewis ’10] [Vaiter et al ’15] [Liang et al ’16]...

Example 1: R = · 1, supp

  • x⋆(p)
  • = supp
  • x⋆(p0)
  • Example 2: R = ιB∞

(indicator function)

projection onto the ℓ∞ ball

p0

p

Many examples in machine learning...

3

slide-12
SLIDE 12

General stability of regularized problems

Structure of nonsmooth regularizers

Many of the regularizers used in machine learning or image processing have a strong primal-dual structure (“mirror-stratifiable” [Fadili, M., Peyr´

e ’18])

...that can be exploit to get (enlarged) stability/identification results Examples:

(associated unit ball and low-dimensional manifold where x belongs)

R = · 1

( and · ∞ or other polyedral gauges)

x Mx

Mx ={z:supp(z)=supp(x)}

4

slide-13
SLIDE 13

General stability of regularized problems

Structure of nonsmooth regularizers

Many of the regularizers used in machine learning or image processing have a strong primal-dual structure (“mirror-stratifiable” [Fadili, M., Peyr´

e ’18])

...that can be exploit to get (enlarged) stability/identification results Examples:

(associated unit ball and low-dimensional manifold where x belongs)

R = · 1

( and · ∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =

i |σi(X)| = σ(X)1

x Mx

Mx ={z:supp(z)=supp(x)}

x

Mx

Mx ={z:rank(z)=rank(x)}

4

slide-14
SLIDE 14

General stability of regularized problems

Structure of nonsmooth regularizers

Many of the regularizers used in machine learning or image processing have a strong primal-dual structure (“mirror-stratifiable” [Fadili, M., Peyr´

e ’18])

...that can be exploit to get (enlarged) stability/identification results Examples:

(associated unit ball and low-dimensional manifold where x belongs)

R = · 1

( and · ∞ or other polyedral gauges)

nuclear norm (aka trace-norm) R(X) =

i |σi(X)| = σ(X)1

group-ℓ1 R(x) =

b∈B xb2

( e.g. R(x) = x1,2 + |x3| )

x Mx

Mx ={z:supp(z)=supp(x)}

x

Mx

Mx ={z:rank(z)=rank(x)}

x Mx

Mx = {0} × {0} × R

4

slide-15
SLIDE 15

General stability of regularized problems

Recall on stratifications

A stratification of a set D ⊂ Rd is a (finite) partition M = {Mi}i∈I D =

  • i∈I

Mi with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely: M ∩ cl(M′) = ∅ = ⇒ M ⊂ cl(M′) Example: B∞ the unit ℓ∞-ball in R2 a stratification with 9 (affine) strata

M1 M2 M3 M4

Other examples: “tame” sets, remind Edouard’s talk

5

slide-16
SLIDE 16

General stability of regularized problems

Recall on stratifications

A stratification of a set D ⊂ Rd is a (finite) partition M = {Mi}i∈I D =

  • i∈I

Mi with so-called “strata” (e.g. smooth/affine manifolds) which fit nicely: M ∩ cl(M′) = ∅ = ⇒ M ⊂ cl(M′) This relation induces a (partial) ordering M M′ Example: B∞ the unit ℓ∞-ball in R2 a stratification with 9 (affine) strata M1 M2 M4 M1 M3 M4

M1 M2 M3 M4

Other examples: “tame” sets, remind Edouard’s talk

5

slide-17
SLIDE 17

General stability of regularized problems

Mirror-stratifiable regularizations

(primal) stratification M = {Mi}i∈I and (dual) stratification M∗ = {M∗

i }i∈I

in one-to-one decreasing correspondence

through the transfert operator JR(S) =

  • x∈S

ri(∂R(x))

Simple example: R = ιB∞ R∗ = · 1

M1 M2 M3 M4

M ∗

1

M ∗

2

M ∗

3

M ∗

4

JR JR∗

JR(Mi) =

  • x∈Mi

ri ∂R(x) = ri N B∞(x) = M∗

i

Mi = ri ∂x1 =

  • x∈M∗

i

ri ∂R∗(x) = JR∗(M∗

i ) 6

slide-18
SLIDE 18

General stability of regularized problems

Enlarged stability result

Theorem (Fadili, M., Peyr´

e ’18)

For the composite optimization problem (smooth + nonsmooth) min

x∈Rd

F(x, p) + R(x), satisfying mild assumptions (unique minimizer x⋆(p0) at p0 and objective uniformly

level-bounded in x), if R is mirror-stratifiable, then for p∼p0,

Mx⋆(p0) Mx⋆(p) JR∗(M∗

u⋆(p0))

If R = · 1, then supp(x⋆(p0)) ⊆ supp(x⋆(p)) ⊆ {i : |u⋆(p0)i| = 1}

7

slide-19
SLIDE 19

General stability of regularized problems

Enlarged stability result

Theorem (Fadili, M., Peyr´

e ’18)

For the composite optimization problem (smooth + nonsmooth) min

x∈Rd

F(x, p) + R(x), satisfying mild assumptions (unique minimizer x⋆(p0) at p0 and objective uniformly

level-bounded in x), if R is mirror-stratifiable, then for p∼p0,

Mx⋆(p0) Mx⋆(p) JR∗(M∗

u⋆(p0))

If R = · 1, then supp(x⋆(p0)) ⊆ supp(x⋆(p)) ⊆ {i : |u⋆(p0)i| = 1}

Remark: Optimality conditions for a primal-dual solution (x⋆(p), u⋆(p)) u⋆(p) = −∇F(x⋆(p), p) ∈ ∂R(x⋆(p)) In the non-degenerate case: u⋆(p0) ∈ ri

  • ∂R(x⋆(p0))
  • Mx⋆(p0) = Mx⋆(p)
  • = JR∗(M∗

u⋆(p0))

  • we have the exact stability, expected [Lewis ’02]

7

slide-20
SLIDE 20

General stability of regularized problems

Enlarged stability illustrated

  • min

1 2x − p2

x∞ 1

  • min

1 2u − p2 + u1

u ∈ Rn Non-degenerate case: u⋆(p0) = p0 − x⋆(p0) ∈ ri NB∞(x⋆(p0)) = ⇒ M1 = Mx⋆(p0) = Mx⋆(p)

(in this case x⋆(p) = x⋆(p0))

x?(p0)

p0

u?(p0)

M ∗

1

p

u?(p)

8

slide-21
SLIDE 21

General stability of regularized problems

Enlarged stability illustrated

  • min

1 2x − p2

x∞ 1

  • min

1 2u − p2 + u1

u ∈ Rn General case: u⋆(p0) = p0 − x⋆(p0) ∈ ✓ ri NB∞(x⋆(p)) x?(p0) p0 u?(p0) M ∗

1

M ∗

2

8

slide-22
SLIDE 22

General stability of regularized problems

Enlarged stability illustrated

  • min

1 2x − p2

x∞ 1

  • min

1 2u − p2 + u1

u ∈ Rn General case: u⋆(p0) = p0 − x⋆(p0) ∈ ✓ ri NB∞(x⋆(p)) = ⇒ M1 = Mx⋆(p0) Mx⋆(p) JR∗(M∗

u⋆(p0)) = M2

x?(p0)

p0

u?(p0) M ∗

1

M ∗

2

M2

8

slide-23
SLIDE 23

Outline

1

General stability of regularized problems

2

Enlarged identification of proximal algorithms

3

Application: communication-efficient federated learning

4

Application: model consistency for regularized least-squares

slide-24
SLIDE 24

Enlarged identification of proximal algorithms

Activity identification

Composite optimization problem (smooth + nonsmooth) min

x∈Rd

F(x) + R(x) Basic proximal-gradient algorithm xk+1 = proxγR

  • xk − γ∇F(xk)
  • proxγR(x) =

argmin

y

R(y) + 1 2γ y − x2 proxγR(x) easy to compute in some important cases e.g. explicit expression for R = · 1 (soft-thresholding)

9

slide-25
SLIDE 25

Enlarged identification of proximal algorithms

Activity identification

Composite optimization problem (smooth + nonsmooth) min

x∈Rd

F(x) + R(x) Basic proximal-gradient algorithm xk+1 = proxγR

  • xk − γ∇F(xk)
  • proxγR(x) =

argmin

y

R(y) + 1 2γ y − x2 proxγR(x) easy to compute in some important cases e.g. explicit expression for R = · 1 (soft-thresholding)

Identification: Beyond convergence after a finite moment K, all iterates xk (k K) lie in an active set M – Used in e.g. safe screening [El Gahoui ’12] [Salmon et al ’19] [Sun et al ’20] – We even have bounds on K [Sun et al ’19] – When the problem is well-posed e.g. [Wright ’96], [Lewis Drusvyatskiy ’13]

9

slide-26
SLIDE 26

Enlarged identification of proximal algorithms

Enlarged activity identification

Theorem (Fadili, M., Peyr´

e ’18)

Under convergence assumptions, if R is mirror-stratifiable, then for k K Mx⋆ Mxk JR∗(M∗

−∇F(x⋆))

Optimality condition −∇F(x⋆) ∈ ∂R(x⋆) In the non-degenerate case: −∇F(x⋆) ∈ ri

  • ∂R(x⋆))
  • we have exact identification Mx⋆ = Mxk
  • = JR∗(M∗

−∇F(x⋆))

  • [Liang et al 15]

10

slide-27
SLIDE 27

Enlarged identification of proximal algorithms

Enlarged activity identification

Theorem (Fadili, M., Peyr´

e ’18)

Under convergence assumptions, if R is mirror-stratifiable, then for k K Mx⋆ Mxk JR∗(M∗

−∇F(x⋆))

Optimality condition −∇F(x⋆) ∈ ∂R(x⋆) In the non-degenerate case: −∇F(x⋆) ∈ ri

  • ∂R(x⋆))
  • we have exact identification Mx⋆ = Mxk
  • = JR∗(M∗

−∇F(x⋆))

  • [Liang et al 15]

In the general case: δ quantifies the degeneracy of the problem δ = dim(JR∗(M∗

−∇F(x⋆))) − dim(Mx⋆)

δ = 0 : well-posedness (fast convergence and identification) δ large : strong degeneracy (slow convergence and identification) Note: δ and K are not computable beforehand in general...

10

slide-28
SLIDE 28

Enlarged identification of proximal algorithms

Illustration with nuclear norm

Matrix least-squares regularized by nuclear norm X∗ = σ(X)1 min

X∈Rd=m×m

1 2A(X) − y2 + λX∗ Generate many random problems (with m = 20 and n = 300), solve them Select those with rank(X⋆)=4 and δ = 0 or 3

(δ =#{i : |σi(U⋆)|=1} − rank(X⋆))

Plot the decrease of rank(Xk) with Xk+1 = proxγ·∗

  • Xk − γ A∗(A(Xk) − y))
  • δ = 0: well-posed

vs. δ = 3: degenerate

11

slide-29
SLIDE 29

Outline

1

General stability of regularized problems

2

Enlarged identification of proximal algorithms

3

Application: communication-efficient federated learning

4

Application: model consistency for regularized least-squares

slide-30
SLIDE 30

Application: communication-efficient federated learning

Basic distributed learning set-up

(Standard) centralized learning

Data Data Data Data Data

Data (aj, yj)j=1,...,n, prediction function h(·, x), model parameters x ∈ Rd

12

slide-31
SLIDE 31

Application: communication-efficient federated learning

Basic distributed learning set-up

(Standard) centralized learning

Data Data Data Data Model Data

Data (aj, yj)j=1,...,n, prediction function h(·, x), model parameters x ∈ Rd

12

slide-32
SLIDE 32

Application: communication-efficient federated learning

Basic distributed learning set-up

(Standard) centralized learning

Data Model Data Data Data Data

Data (aj, yj)j=1,...,n, prediction function h(·, x), model parameters x ∈ Rd

12

slide-33
SLIDE 33

Application: communication-efficient federated learning

Basic distributed learning set-up

(Standard) centralized learning

Data Model Data Data Data Data

Data (aj, yj)j=1,...,n, prediction function h(·, x), model parameters x ∈ Rd Empirical risk minimization min

x∈Rd

1 n

n

  • j=1

  • yj, h(aj, x)
  • +

λ R(x)

12

slide-34
SLIDE 34

Application: communication-efficient federated learning

Basic distributed learning set-up

(Standard) centralized learning needs of lot of storage is highly privacy invasive

Data Model Data Data Data Data

Data (aj, yj)j=1,...,n, prediction function h(·, x), model parameters x ∈ Rd Empirical risk minimization min

x∈Rd

1 n

n

  • j=1

  • yj, h(aj, x)
  • +

λ R(x)

12

slide-35
SLIDE 35

Application: communication-efficient federated learning

Move the model, not the data !

Collaborative/Federative learning

(introduction of Aur´ elien’s talk this morning)

Data Data Data Data Model Data

13

slide-36
SLIDE 36

Application: communication-efficient federated learning

Move the model, not the data !

Collaborative/Federative learning

(introduction of Aur´ elien’s talk this morning)

Data Model Model Data Model Data Model Data Model

13

slide-37
SLIDE 37

Application: communication-efficient federated learning

Move the model, not the data !

Collaborative/Federative learning

(introduction of Aur´ elien’s talk this morning)

Data Model Model Data Model Data Model Data Model

13

slide-38
SLIDE 38

Application: communication-efficient federated learning

Move the model, not the data !

Collaborative/Federative learning

(introduction of Aur´ elien’s talk this morning)

Data Model Model Data Model Data Model Data Model

13

slide-39
SLIDE 39

Application: communication-efficient federated learning

Move the model, not the data !

Collaborative/Federative learning

(introduction of Aur´ elien’s talk this morning)

Data Model Model Data Model Data Model Data Model

Communication is the bottleneck We need compression ! Mikael talk, yesterday morning (?)

13

slide-40
SLIDE 40

Application: communication-efficient federated learning

Move the model, not the data !

Collaborative/Federative learning

(introduction of Aur´ elien’s talk this morning)

Data Model Model Data Model Data Model Data Model Update 
 compression

Communication is the bottleneck We need compression ! Mikael talk, yesterday morning (?)

13

slide-41
SLIDE 41

Application: communication-efficient federated learning

Move the model, not the data !

Collaborative/Federative learning

(introduction of Aur´ elien’s talk this morning)

Data Model Data Model Data Model Data Model Model Model compression Update 
 compression

Communication is the bottleneck We need compression ! Mikael talk, yesterday morning (?) Many compression techniques... recall Martin’s talk yesteday afternoon Let’s discuss another one, complementary to existing ones

13

slide-42
SLIDE 42

Application: communication-efficient federated learning

Application of identification to federated learning

Data Model Data Model Data Model Data Model Model Model compression Update 
 compression

14

slide-43
SLIDE 43

Application: communication-efficient federated learning

Application of identification to federated learning

With nonsmooth regularizers, identification comes into play

Data

  • Reg. Model

Data

  • Reg. Model

Data

  • Reg. Model

Data

  • Reg. Model
  • Reg. Model

automatic compression Identification Update 
 compression

Observation: identification gives automatic model compression

e.g. for R= · 1, model becomes sparse... just communicate nonzero entries!

14

slide-44
SLIDE 44

Application: communication-efficient federated learning

Application of identification to federated learning

With nonsmooth regularizers, identification comes into play

Data

  • Reg. Model

Data

  • Reg. Model

Data

  • Reg. Model

Data

  • Reg. Model
  • Reg. Model

automatic compression Identification adaptative 
 compression

Observation: identification gives automatic model compression

e.g. for R= · 1, model becomes sparse... just communicate nonzero entries!

[Grishchenko, Iutzeler, M. ’19] uses again identification for update comp.

Project update onto Mxk + randomly selected M

e.g. for R= · 1, select current support + random entries

Algo with intricate convergence analysis due to non-uniform selection...

14

slide-45
SLIDE 45

Application: communication-efficient federated learning

Illustration of communication-efficient proximal method

On an instance of TV-regularized logistic regression (a1a dataset on 10 machines) min

x∈Rd

1 n

n

  • j=1

log

  • 1+exp(−yjaj, x
  • + λ TV(x)

Total Variation TV(x) =

n−1

  • i=1

|xi+1 − xi|

Comparison of Usual distributed proximal-gradient (black) Adaptive distributed proximal-subspace descent (red)

for different selections Mxk + random others

1,000 2,000 3,000 4,000 10−11 10−8 10−5 10−2 101

10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%

Iterations Suboptimality

Standard Prox-Grad Mxk +1 10% Mxk +10% 20% Mxk +20% 50% Mxk +50%

15

slide-46
SLIDE 46

Application: communication-efficient federated learning

Illustration of communication-efficient proximal method

On an instance of TV-regularized logistic regression (a1a dataset on 10 machines) min

x∈Rd

1 n

n

  • j=1

log

  • 1+exp(−yjaj, x
  • + λ TV(x)

Total Variation TV(x) =

n−1

  • i=1

|xi+1 − xi|

Comparison of Usual distributed proximal-gradient (black) Adaptive distributed proximal-subspace descent (red)

for different selections Mxk + random others

1,000 2,000 3,000 4,000 10−11 10−8 10−5 10−2 101

10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%

Iterations Suboptimality

Standard Prox-Grad Mxk +1 10% Mxk +10% 20% Mxk +20% 50% Mxk +50%

1 2 3 4 ·105 10−11 10−8 10−5 10−2 101

10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%

communications Suboptimality

Standard Prox-Grad Mxk +1 10% Mxk +10% 20% Mxk +20% 50% Mxk +50%

Acceleration... with respect to size of communication

15

slide-47
SLIDE 47

Application: communication-efficient federated learning

Illustration of communication-efficient proximal method

On an instance of TV-regularized logistic regression (a1a dataset on 10 machines) min

x∈Rd

1 n

n

  • j=1

log

  • 1+exp(−yjaj, x
  • + λ TV(x)

Total Variation TV(x) =

n−1

  • i=1

|xi+1 − xi|

Comparison of Usual distributed proximal-gradient (black) Adaptive distributed proximal-subspace descent (red)

for different selections Mxk + random others

1,000 2,000 3,000 4,000 10−11 10−8 10−5 10−2 101

10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%

Iterations Suboptimality

Standard Prox-Grad Mxk +1 10% Mxk +10% 20% Mxk +20% 50% Mxk +50%

1 2 3 4 ·105 10−11 10−8 10−5 10−2 101

10% 10% 10% 10% 20% 20% 20% 20% 50% 50% 50% 50%

communications Suboptimality

Standard Prox-Grad Mxk +1 10% Mxk +10% 20% Mxk +20% 50% Mxk +50%

Acceleration... with respect to size of communication Tradeoff between compression (less comm.) and identification (faster cv)

15

slide-48
SLIDE 48

Outline

1

General stability of regularized problems

2

Enlarged identification of proximal algorithms

3

Application: communication-efficient federated learning

4

Application: model consistency for regularized least-squares

slide-49
SLIDE 49

Application: model consistency for regularized least-squares

Supervised learning: model consistency ?

Assume data (ai, yi)i=1,...,n are sampled from linear model y = a, ¯ x + ν with random (a, ν) Structural assumption: ¯ x has a low-complexity for R

¯ x = argminx∈Rd

  • R(x) : x ∈ argminz∈Rd E
  • (a, z − y)2

Regularized least-squares

(if R= · 1, this is LASSO)

min

x∈Rd

1 2n

n

  • i=1

(ai, x − yi)2 + λn R(x) Stochastic (proximal-)gradient algorithms (at iteration k, pick randomly i(k)) xk+1 = proxγkλnR

  • xk − γk
  • (ai(k), xk − yi(k)) ai(k) + εk
  • E.g. SGD, SAGA [Delfazio et al ’14], SVRG [Xiao-Zhang ’14]

Do we have model recovery/consistency i.e. xk ∈ M¯

x ?

(if we have enough observations, i.e. when n → +∞)

16

slide-50
SLIDE 50

Application: model consistency for regularized least-squares

Enlarged identification of stochastic algorithms

Theorem (Garrigos, Fadili, M., Peyr´

e ’19)

Take λn → 0 with λn

  • n/(log log n) → +∞. If n large enough and for

xk+1 = proxγkλnR

  • xk − γk
  • (ai(k), xk − yi(k)) ai(k) + εk
  • with mild assumptions on errors εk and stepsizes γk. Then, for k large, a.s.

x M xk JR∗(M∗ ¯ η)

with ¯ η = argmin

η∈Rp

  • η

⊤C†η : η ∈ ∂R(¯

x) ∩ Im C

  • and

C = E

  • aa

Comments: key dual object ¯ η ∈∂ R(¯ x) [Vaiter et al ’16] λn decreases to 0, but not too fast SAGA and SVRG satisfy the “mild” assumption [Poon et al ’18] (Prox-)SGD does not – and does not identify (e.g. [Lee Wright ’12])

17

slide-51
SLIDE 51

Application: model consistency for regularized least-squares

Enlarged identification of stochastic algorithms

Theorem (Garrigos, Fadili, M., Peyr´

e ’19)

Take λn → 0 with λn

  • n/(log log n) → +∞. If n large enough and for

xk+1 = proxγkλnR

  • xk − γk
  • (ai(k), xk − yi(k)) ai(k) + εk
  • with mild assumptions on errors εk and stepsizes γk. Then, for k large, a.s.

x M xk JR∗(M∗ ¯ η)

with ¯ η = argmin

η∈Rp

  • η

⊤C†η : η ∈ ∂R(¯

x) ∩ Im C

  • and

C = E

  • aa

Comments: key dual object ¯ η ∈∂ R(¯ x) [Vaiter et al ’16] λn decreases to 0, but not too fast SAGA and SVRG satisfy the “mild” assumption [Poon et al ’18] (Prox-)SGD does not – and does not identify (e.g. [Lee Wright ’12])

(on a LASSO instance)

17

slide-52
SLIDE 52

Conclusion

Take-home message: identification often holds... and can be used Enlarged identification results (explaining observed phenomena) Better understanding of optim. algos (beyond convergence) Sparsify communications by adaptative dimension reduction

slide-53
SLIDE 53

Conclusion

Take-home message: identification often holds... and can be used Enlarged identification results (explaining observed phenomena) Better understanding of optim. algos (beyond convergence) Sparsify communications by adaptative dimension reduction Extensions, on-going work Many possible refinements of sensitivity results

  • ther data fidelity terms, a priori control on strata dimension, explaining transition curves...

Use identification to accelerate convergence

interplay between identification and acceleration (PhD of Gilles Bareilles)

Subspace descent algorithms generalizing coordinate descent

“coordinate” descent for nonseparable functions −

→ − → − → Franck’s talk tomorow

slide-54
SLIDE 54

Conclusion

Take-home message: identification often holds... and can be used Enlarged identification results (explaining observed phenomena) Better understanding of optim. algos (beyond convergence) Sparsify communications by adaptative dimension reduction Extensions, on-going work Many possible refinements of sensitivity results

  • ther data fidelity terms, a priori control on strata dimension, explaining transition curves...

Use identification to accelerate convergence

interplay between identification and acceleration (PhD of Gilles Bareilles)

Subspace descent algorithms generalizing coordinate descent

“coordinate” descent for nonseparable functions −

→ − → − → Franck’s talk tomorow

thanks !!