[PPT] - Convex Optimization ( EE227A: UC Berkeley ) Lecture 18 (Proximal PowerPoint Presentation

SLIDE 1

Convex Optimization

(EE227A: UC Berkeley)

Lecture 18

(Proximal methods; Incremental methods – I) 21 March, 2013

Suvrit Sra

SLIDE 2

Douglas-Rachford method

0 ∈ ∂f(x) + ∂g(x)

2 / 19

SLIDE 3

Douglas-Rachford method

0 ∈ ∂f(x) + ∂g(x)

DR method: given z0, iterate for k ≥ 0 xk = proxg(zk) vk = proxf(2xk − zk) zk+1 = zk + γk(vk − xk)

2 / 19

SLIDE 4

Douglas-Rachford method

0 ∈ ∂f(x) + ∂g(x)

DR method: given z0, iterate for k ≥ 0 xk = proxg(zk) vk = proxf(2xk − zk) zk+1 = zk + γk(vk − xk) For γk = 1, we have zk+1 = zk + vk − xk zk+1 = zk + proxf(2 proxg(zk) − zk) − proxg(zk)

2 / 19

SLIDE 5

Douglas-Rachford method

zk+1 = zk + proxf(2 proxg(zk) − zk) − proxg(zk)

3 / 19

SLIDE 6

Douglas-Rachford method

zk+1 = zk + proxf(2 proxg(zk) − zk) − proxg(zk) Dropping superscripts, we have the fixed-point iteration

z ← Tz T = I + Pf(2Pg − I) − Pg

3 / 19

SLIDE 7

Douglas-Rachford method

zk+1 = zk + proxf(2 proxg(zk) − zk) − proxg(zk) Dropping superscripts, we have the fixed-point iteration

z ← Tz T = I + Pf(2Pg − I) − Pg

Lemma DR can be written as: z ← 1

2(RfRg+I)z, where Rf denotes

the reflection operator 2Pf − I (similarly Rg). Exercise: Prove this claim.

3 / 19

SLIDE 8

Proximity for several functions

Optimizing sums of functions f(x) :=

1 2x − y2 2 +

i fi(x)

f(x) :=

i fi(x)

4 / 19

SLIDE 9

Proximity for several functions

Optimizing sums of functions f(x) :=

1 2x − y2 2 +

i fi(x)

f(x) :=

i fi(x)

DR does not work immediately

4 / 19

SLIDE 10

Product space trick

◮ Original problem over H = Rn

5 / 19

SLIDE 11

Product space trick

◮ Original problem over H = Rn ◮ Suppose we have m

i=1 fi(x)

5 / 19

SLIDE 12

Product space trick

◮ Original problem over H = Rn ◮ Suppose we have m

i=1 fi(x)

◮ Introduce new variables (x1, . . . , xm)

5 / 19

SLIDE 13

Product space trick

◮ Original problem over H = Rn ◮ Suppose we have m

i=1 fi(x)

◮ Introduce new variables (x1, . . . , xm) ◮ Now problem is over domain Hm := H × H × · · · × H (m-times)

5 / 19

SLIDE 14

Product space trick

◮ Original problem over H = Rn ◮ Suppose we have m

i=1 fi(x)

◮ Introduce new variables (x1, . . . , xm) ◮ Now problem is over domain Hm := H × H × · · · × H (m-times) ◮ New constraint: x1 = x2 = . . . = xm min

(x1,...,xm)

i fi(xi)

s.t. x1 = x2 = · · · = xm.

5 / 19

SLIDE 15

Product space trick

min

x f(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

6 / 19

SLIDE 16

Product space trick

min

x f(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

◮ Let y = (y1, . . . , ym)

6 / 19

SLIDE 17

Product space trick

min

x f(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

◮ Let y = (y1, . . . , ym) ◮ proxf(y) = (proxf1(y1), . . . , proxfm(ym))

6 / 19

SLIDE 18

Product space trick

min

x f(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

◮ Let y = (y1, . . . , ym) ◮ proxf(y) = (proxf1(y1), . . . , proxfm(ym)) ◮ PB(y) can be solved as follows:

6 / 19

SLIDE 19

Product space trick

min

x f(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

◮ Let y = (y1, . . . , ym) ◮ proxf(y) = (proxf1(y1), . . . , proxfm(ym)) ◮ PB(y) can be solved as follows: minz∈B

1 2z − y2 2

minx∈H

i

1 2x − yi2 2

= ⇒ x = 1

m

i yi

Exercise: Work out the details of DR with the above ideas. Note: this trick works for all other situations!

6 / 19

SLIDE 20

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

7 / 19

SLIDE 21

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

7 / 19

SLIDE 22

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

7 / 19

SLIDE 23

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

wk = proxg(xk + zk) uk+1 = xk + uk − wk

7 / 19

SLIDE 24

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

wk = proxg(xk + zk) uk+1 = xk + uk − wk xk+1 = proxh(wk + zk) zk+1 = wk + zk − xk+1

7 / 19

SLIDE 25

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

wk = proxg(xk + zk) uk+1 = xk + uk − wk xk+1 = proxh(wk + zk) zk+1 = wk + zk − xk+1

Why does it work?

7 / 19

SLIDE 26

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

wk = proxg(xk + zk) uk+1 = xk + uk − wk xk+1 = proxh(wk + zk) zk+1 = wk + zk − xk+1

Why does it work? After the break...!

7 / 19

SLIDE 27

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

wk = proxg(xk + zk) uk+1 = xk + uk − wk xk+1 = proxh(wk + zk) zk+1 = wk + zk − xk+1

Why does it work? After the break...! Exercise: Use the product-space trick to extend this to a parallel Dykstra-like method for m ≥ 3 functions.

7 / 19

SLIDE 28

Incremental methods

8 / 19

SLIDE 29

Separable objectives

min f(x) = m

i fi(x) + λr(x)

9 / 19

SLIDE 30

Separable objectives

min f(x) = m

i fi(x) + λr(x)

Gradient / subgradient methods xk+1 = xk − αk∇f(xk) λ = 0, xk+1 = xk − αkg(xk), g(xk) ∈ ∂f(xk) + λ∂r(xk) xk+1 = proxαkr(xk − αk∇f(xk))

9 / 19

SLIDE 31

Separable objectives

min f(x) = m

i fi(x) + λr(x)

Gradient / subgradient methods xk+1 = xk − αk∇f(xk) λ = 0, xk+1 = xk − αkg(xk), g(xk) ∈ ∂f(xk) + λ∂r(xk) xk+1 = proxαkr(xk − αk∇f(xk)) How much computation does one iteration take?

9 / 19

SLIDE 32

Incremental gradient methods

What if at iteration k, we randomly pick an integer i(k) ∈ {1, 2, . . . , m}?

10 / 19

SLIDE 33

Incremental gradient methods

What if at iteration k, we randomly pick an integer i(k) ∈ {1, 2, . . . , m}? And instead just perform the update? xk+1 = xk − αk∇fi(k)(xk)

10 / 19

SLIDE 34

Incremental gradient methods

What if at iteration k, we randomly pick an integer i(k) ∈ {1, 2, . . . , m}? And instead just perform the update? xk+1 = xk − αk∇fi(k)(xk) ◮ The update requires only gradient for fi(k) ◮ One iteration now m times faster than with ∇f(x)

10 / 19

SLIDE 35

Incremental gradient methods

What if at iteration k, we randomly pick an integer i(k) ∈ {1, 2, . . . , m}? And instead just perform the update? xk+1 = xk − αk∇fi(k)(xk) ◮ The update requires only gradient for fi(k) ◮ One iteration now m times faster than with ∇f(x) But does this make sense?

10 / 19

SLIDE 36

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc.

11 / 19

SLIDE 37

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly

11 / 19

SLIDE 38

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly ♥ If m is very large, many of the fi(x) may have similar minimizers;

11 / 19

SLIDE 39

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly ♥ If m is very large, many of the fi(x) may have similar minimizers; by using the fi only individually we hope to take advantage of this fact, and greatly speed up.

11 / 19

SLIDE 40

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly ♥ If m is very large, many of the fi(x) may have similar minimizers; by using the fi only individually we hope to take advantage of this fact, and greatly speed up. ♥ Incremental methods usually effective far from the eventual limit (solution) — become very slow close to the solution.

11 / 19

SLIDE 41

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly ♥ If m is very large, many of the fi(x) may have similar minimizers; by using the fi only individually we hope to take advantage of this fact, and greatly speed up. ♥ Incremental methods usually effective far from the eventual limit (solution) — become very slow close to the solution.

Example!

11 / 19

SLIDE 42

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

12 / 19

SLIDE 43

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Solving f′(x) = 0 we obtain x∗ =

i aibi
i a2

i

12 / 19

SLIDE 44

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Solving f′(x) = 0 we obtain x∗ =

i aibi
i a2

i

◮ Minimum of a single fi(x) = 1

2(aix − bi)2 is x∗ i = bi/ai

12 / 19

SLIDE 45

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Solving f′(x) = 0 we obtain x∗ =

i aibi
i a2

i

◮ Minimum of a single fi(x) = 1

2(aix − bi)2 is x∗ i = bi/ai

◮ Notice now that x∗ ∈ [mini x∗

i , maxi x∗ i ] =: R

12 / 19

SLIDE 46

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Notice: x∗ ∈ [mini x∗

i , maxi x∗ i ] =: R

13 / 19

SLIDE 47

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Notice: x∗ ∈ [mini x∗

i , maxi x∗ i ] =: R

◮ If we have a scalar x that lies outside R? ◮ We see that ∇fi(x) = ai(aix − bi) ∇f(x) =

i ai(aix − bi)

13 / 19

SLIDE 48

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Notice: x∗ ∈ [mini x∗

i , maxi x∗ i ] =: R

◮ If we have a scalar x that lies outside R? ◮ We see that ∇fi(x) = ai(aix − bi) ∇f(x) =

i ai(aix − bi)

◮ ∇fi(x) has same sign as ∇f(x) So using ∇fi(x) instead of ∇f(x) also ensures progress.

13 / 19

SLIDE 49

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Notice: x∗ ∈ [mini x∗

i , maxi x∗ i ] =: R

◮ If we have a scalar x that lies outside R? ◮ We see that ∇fi(x) = ai(aix − bi) ∇f(x) =

i ai(aix − bi)

◮ ∇fi(x) has same sign as ∇f(x) So using ∇fi(x) instead of ∇f(x) also ensures progress. ◮ But once inside region R, no guarantee that incremental method will make progress towards optimum.

13 / 19

SLIDE 50

Incremental proximal method

min f(x) =

i fi(x)

What if the fi are nonsmooth?

14 / 19

SLIDE 51

Incremental proximal method

min f(x) =

i fi(x)

What if the fi are nonsmooth?

xk+1 = proxαkf(xk)

14 / 19

SLIDE 52

Incremental proximal method

min f(x) =

i fi(x)

What if the fi are nonsmooth?

xk+1 = proxαkf(xk) xk+1 = proxαkfi(k)(xk) xk+1 = argmin 1

2x − xk2 2 + fi(k)(x)

i(k) ∈ {1, 2, . . . , m} picked uniformly at random.

14 / 19

SLIDE 53

Incremental proximal-gradients

min

i fi(x) + r(x).

15 / 19

SLIDE 54

Incremental proximal-gradients

min

i fi(x) + r(x).

xk+1 = proxηkr

xk − ηk

m

i=1 ∇fi(zi)

,

k = 0, 1, . . . ,

15 / 19

SLIDE 55

Incremental proximal-gradients

min

i fi(x) + r(x).

xk+1 = proxηkr

xk − ηk

m

i=1 ∇fi(zi)

,

k = 0, 1, . . . , z1 = xk zi+1 = zi − ηk∇fi(zi), i = 1, . . . , m − 1.

15 / 19

SLIDE 56

Incremental proximal-gradients

min

i fi(x) + r(x).

xk+1 = proxηkr

xk − ηk

m

i=1 ∇fi(zi)

,

k = 0, 1, . . . , z1 = xk zi+1 = zi − ηk∇fi(zi), i = 1, . . . , m − 1. We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)

15 / 19

SLIDE 57

Incremental proximal-gradients

min

i fi(x) + r(x).

xk+1 = proxηkr

xk − ηk

m

i=1 ∇fi(zi)

,

k = 0, 1, . . . , z1 = xk zi+1 = zi − ηk∇fi(zi), i = 1, . . . , m − 1. We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)

Does this work?

15 / 19

SLIDE 58

Incremental methods: key realization

min (f(x) =

i fi(x)) + r(x)

Gradient with error ∇fi(k)(x) = ∇f(x) + e(x) xk+1 = proxαr[xk − αk(∇f(xk) + e(xk))]

16 / 19

SLIDE 59

Incremental methods: key realization

min (f(x) =

i fi(x)) + r(x)

Gradient with error ∇fi(k)(x) = ∇f(x) + e(x) xk+1 = proxαr[xk − αk(∇f(xk) + e(xk))]

So if in the limit error αke(xk) disappears, we should be ok!

16 / 19

SLIDE 60

Incremental gradient methods

Incremental gradient methods may be viewed as Gradient methods with error in gradient computation

17 / 19

SLIDE 61

Incremental gradient methods

Incremental gradient methods may be viewed as Gradient methods with error in gradient computation ◮ If we can control this error, we can control convergence

17 / 19

SLIDE 62

Incremental gradient methods

Incremental gradient methods may be viewed as Gradient methods with error in gradient computation ◮ If we can control this error, we can control convergence ◮ Error makes even smooth case more like nonsmooth case

17 / 19

SLIDE 63

Incremental gradient methods

Incremental gradient methods may be viewed as Gradient methods with error in gradient computation ◮ If we can control this error, we can control convergence ◮ Error makes even smooth case more like nonsmooth case ◮ So, convergence crucially depends on stepsize αk

17 / 19

SLIDE 64

Incremental gradient methods

Incremental gradient methods may be viewed as Gradient methods with error in gradient computation ◮ If we can control this error, we can control convergence ◮ Error makes even smooth case more like nonsmooth case ◮ So, convergence crucially depends on stepsize αk Some stepsize choices ♠ αk = c, a small enough constant ♠ αk → 0,

k αk = ∞ (diminishing scalar)

♠ Constant for some iterations, diminish, again constant, repeat ♠ αk = min(c, a/(b + k)), where a, b, c > 0 (user chosen).

17 / 19

SLIDE 65

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

18 / 19

SLIDE 66

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small)

18 / 19

SLIDE 67

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence

18 / 19

SLIDE 68

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence

18 / 19

SLIDE 69

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence ♠ Slow, sublinear rate of convergence

18 / 19

SLIDE 70

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence ♠ Slow, sublinear rate of convergence ♠ Optimal, incremental method seems not to be known

18 / 19

SLIDE 71

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence ♠ Slow, sublinear rate of convergence ♠ Optimal, incremental method seems not to be known ♠ Idea extends to subgradient, and proximal setups

18 / 19

SLIDE 72

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence ♠ Slow, sublinear rate of convergence ♠ Optimal, incremental method seems not to be known ♠ Idea extends to subgradient, and proximal setups ♠ Some extensions also apply to nonconvex problems

18 / 19

SLIDE 73

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence ♠ Slow, sublinear rate of convergence ♠ Optimal, incremental method seems not to be known ♠ Idea extends to subgradient, and proximal setups ♠ Some extensions also apply to nonconvex problems ♠ Some extend to parallel and distributed computation Read (omit proofs): “Incremental methods survey” by D. P. Bertsekas (2010) – see bSpace.

18 / 19

SLIDE 74

References

1 Combettes and Pesquet. Proximal splitting methods in signal

processing. (2010)

2 Bertsekas. Nonlinear Programming. (1999).

19 / 19