Convex Optimization ( EE227A: UC Berkeley ) Lecture 18 (Proximal - - PowerPoint PPT Presentation

convex optimization
SMART_READER_LITE
LIVE PREVIEW

Convex Optimization ( EE227A: UC Berkeley ) Lecture 18 (Proximal - - PowerPoint PPT Presentation

Convex Optimization ( EE227A: UC Berkeley ) Lecture 18 (Proximal methods; Incremental methods I) 21 March, 2013 Suvrit Sra Douglas-Rachford method 0 f ( x ) + g ( x ) 2 / 19 Douglas-Rachford method 0 f ( x ) + g (


slide-1
SLIDE 1

Convex Optimization

(EE227A: UC Berkeley)

Lecture 18

(Proximal methods; Incremental methods – I) 21 March, 2013

  • Suvrit Sra
slide-2
SLIDE 2

Douglas-Rachford method

0 ∈ ∂f(x) + ∂g(x)

2 / 19

slide-3
SLIDE 3

Douglas-Rachford method

0 ∈ ∂f(x) + ∂g(x)

DR method: given z0, iterate for k ≥ 0 xk = proxg(zk) vk = proxf(2xk − zk) zk+1 = zk + γk(vk − xk)

2 / 19

slide-4
SLIDE 4

Douglas-Rachford method

0 ∈ ∂f(x) + ∂g(x)

DR method: given z0, iterate for k ≥ 0 xk = proxg(zk) vk = proxf(2xk − zk) zk+1 = zk + γk(vk − xk) For γk = 1, we have zk+1 = zk + vk − xk zk+1 = zk + proxf(2 proxg(zk) − zk) − proxg(zk)

2 / 19

slide-5
SLIDE 5

Douglas-Rachford method

zk+1 = zk + proxf(2 proxg(zk) − zk) − proxg(zk)

3 / 19

slide-6
SLIDE 6

Douglas-Rachford method

zk+1 = zk + proxf(2 proxg(zk) − zk) − proxg(zk) Dropping superscripts, we have the fixed-point iteration

z ← Tz T = I + Pf(2Pg − I) − Pg

3 / 19

slide-7
SLIDE 7

Douglas-Rachford method

zk+1 = zk + proxf(2 proxg(zk) − zk) − proxg(zk) Dropping superscripts, we have the fixed-point iteration

z ← Tz T = I + Pf(2Pg − I) − Pg

Lemma DR can be written as: z ← 1

2(RfRg+I)z, where Rf denotes

the reflection operator 2Pf − I (similarly Rg). Exercise: Prove this claim.

3 / 19

slide-8
SLIDE 8

Proximity for several functions

Optimizing sums of functions f(x) :=

1 2x − y2 2 +

  • i fi(x)

f(x) :=

  • i fi(x)

4 / 19

slide-9
SLIDE 9

Proximity for several functions

Optimizing sums of functions f(x) :=

1 2x − y2 2 +

  • i fi(x)

f(x) :=

  • i fi(x)

DR does not work immediately

4 / 19

slide-10
SLIDE 10

Product space trick

◮ Original problem over H = Rn

5 / 19

slide-11
SLIDE 11

Product space trick

◮ Original problem over H = Rn ◮ Suppose we have m

i=1 fi(x)

5 / 19

slide-12
SLIDE 12

Product space trick

◮ Original problem over H = Rn ◮ Suppose we have m

i=1 fi(x)

◮ Introduce new variables (x1, . . . , xm)

5 / 19

slide-13
SLIDE 13

Product space trick

◮ Original problem over H = Rn ◮ Suppose we have m

i=1 fi(x)

◮ Introduce new variables (x1, . . . , xm) ◮ Now problem is over domain Hm := H × H × · · · × H (m-times)

5 / 19

slide-14
SLIDE 14

Product space trick

◮ Original problem over H = Rn ◮ Suppose we have m

i=1 fi(x)

◮ Introduce new variables (x1, . . . , xm) ◮ Now problem is over domain Hm := H × H × · · · × H (m-times) ◮ New constraint: x1 = x2 = . . . = xm min

(x1,...,xm)

  • i fi(xi)

s.t. x1 = x2 = · · · = xm.

5 / 19

slide-15
SLIDE 15

Product space trick

min

x f(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

6 / 19

slide-16
SLIDE 16

Product space trick

min

x f(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

◮ Let y = (y1, . . . , ym)

6 / 19

slide-17
SLIDE 17

Product space trick

min

x f(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

◮ Let y = (y1, . . . , ym) ◮ proxf(y) = (proxf1(y1), . . . , proxfm(ym))

6 / 19

slide-18
SLIDE 18

Product space trick

min

x f(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

◮ Let y = (y1, . . . , ym) ◮ proxf(y) = (proxf1(y1), . . . , proxfm(ym)) ◮ PB(y) can be solved as follows:

6 / 19

slide-19
SLIDE 19

Product space trick

min

x f(x) + IB(x)

where x ∈ Hm and B = {z ∈ Hm | z = (x, x, . . . , x)}

◮ Let y = (y1, . . . , ym) ◮ proxf(y) = (proxf1(y1), . . . , proxfm(ym)) ◮ PB(y) can be solved as follows: minz∈B

1 2z − y2 2

minx∈H

  • i

1 2x − yi2 2

= ⇒ x = 1

m

  • i yi

Exercise: Work out the details of DR with the above ideas. Note: this trick works for all other situations!

6 / 19

slide-20
SLIDE 20

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

7 / 19

slide-21
SLIDE 21

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

7 / 19

slide-22
SLIDE 22

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

7 / 19

slide-23
SLIDE 23

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

wk = proxg(xk + zk) uk+1 = xk + uk − wk

7 / 19

slide-24
SLIDE 24

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

wk = proxg(xk + zk) uk+1 = xk + uk − wk xk+1 = proxh(wk + zk) zk+1 = wk + zk − xk+1

7 / 19

slide-25
SLIDE 25

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

wk = proxg(xk + zk) uk+1 = xk + uk − wk xk+1 = proxh(wk + zk) zk+1 = wk + zk − xk+1

Why does it work?

7 / 19

slide-26
SLIDE 26

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

wk = proxg(xk + zk) uk+1 = xk + uk − wk xk+1 = proxh(wk + zk) zk+1 = wk + zk − xk+1

Why does it work? After the break...!

7 / 19

slide-27
SLIDE 27

Proximity operator for sums

minx

1 2x − y2 2 + g(x) + h(x)

Usually proxf+g = proxf ◦ proxg

Proximal-Dykstra method

1 Let x0 = y; u0 = 0, z0 = 0 2 k-th iteration (k ≥ 0)

wk = proxg(xk + zk) uk+1 = xk + uk − wk xk+1 = proxh(wk + zk) zk+1 = wk + zk − xk+1

Why does it work? After the break...! Exercise: Use the product-space trick to extend this to a parallel Dykstra-like method for m ≥ 3 functions.

7 / 19

slide-28
SLIDE 28

Incremental methods

8 / 19

slide-29
SLIDE 29

Separable objectives

min f(x) = m

i fi(x) + λr(x)

9 / 19

slide-30
SLIDE 30

Separable objectives

min f(x) = m

i fi(x) + λr(x)

Gradient / subgradient methods xk+1 = xk − αk∇f(xk) λ = 0, xk+1 = xk − αkg(xk), g(xk) ∈ ∂f(xk) + λ∂r(xk) xk+1 = proxαkr(xk − αk∇f(xk))

9 / 19

slide-31
SLIDE 31

Separable objectives

min f(x) = m

i fi(x) + λr(x)

Gradient / subgradient methods xk+1 = xk − αk∇f(xk) λ = 0, xk+1 = xk − αkg(xk), g(xk) ∈ ∂f(xk) + λ∂r(xk) xk+1 = proxαkr(xk − αk∇f(xk)) How much computation does one iteration take?

9 / 19

slide-32
SLIDE 32

Incremental gradient methods

What if at iteration k, we randomly pick an integer i(k) ∈ {1, 2, . . . , m}?

10 / 19

slide-33
SLIDE 33

Incremental gradient methods

What if at iteration k, we randomly pick an integer i(k) ∈ {1, 2, . . . , m}? And instead just perform the update? xk+1 = xk − αk∇fi(k)(xk)

10 / 19

slide-34
SLIDE 34

Incremental gradient methods

What if at iteration k, we randomly pick an integer i(k) ∈ {1, 2, . . . , m}? And instead just perform the update? xk+1 = xk − αk∇fi(k)(xk) ◮ The update requires only gradient for fi(k) ◮ One iteration now m times faster than with ∇f(x)

10 / 19

slide-35
SLIDE 35

Incremental gradient methods

What if at iteration k, we randomly pick an integer i(k) ∈ {1, 2, . . . , m}? And instead just perform the update? xk+1 = xk − αk∇fi(k)(xk) ◮ The update requires only gradient for fi(k) ◮ One iteration now m times faster than with ∇f(x) But does this make sense?

10 / 19

slide-36
SLIDE 36

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc.

11 / 19

slide-37
SLIDE 37

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly

11 / 19

slide-38
SLIDE 38

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly ♥ If m is very large, many of the fi(x) may have similar minimizers;

11 / 19

slide-39
SLIDE 39

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly ♥ If m is very large, many of the fi(x) may have similar minimizers; by using the fi only individually we hope to take advantage of this fact, and greatly speed up.

11 / 19

slide-40
SLIDE 40

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly ♥ If m is very large, many of the fi(x) may have similar minimizers; by using the fi only individually we hope to take advantage of this fact, and greatly speed up. ♥ Incremental methods usually effective far from the eventual limit (solution) — become very slow close to the solution.

11 / 19

slide-41
SLIDE 41

Incremental gradient methods

♥ Old idea; has been used extensively as backpropagation in neural networks, Widrow-Hoff least mean squares, gradient methods with errors, stochastic gradient, etc. ♥ Can effectively use to “stream” through data — go through components one by one, say cyclically instead of randomly ♥ If m is very large, many of the fi(x) may have similar minimizers; by using the fi only individually we hope to take advantage of this fact, and greatly speed up. ♥ Incremental methods usually effective far from the eventual limit (solution) — become very slow close to the solution.

Example!

11 / 19

slide-42
SLIDE 42

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

12 / 19

slide-43
SLIDE 43

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Solving f′(x) = 0 we obtain x∗ =

  • i aibi
  • i a2

i

12 / 19

slide-44
SLIDE 44

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Solving f′(x) = 0 we obtain x∗ =

  • i aibi
  • i a2

i

◮ Minimum of a single fi(x) = 1

2(aix − bi)2 is x∗ i = bi/ai

12 / 19

slide-45
SLIDE 45

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Solving f′(x) = 0 we obtain x∗ =

  • i aibi
  • i a2

i

◮ Minimum of a single fi(x) = 1

2(aix − bi)2 is x∗ i = bi/ai

◮ Notice now that x∗ ∈ [mini x∗

i , maxi x∗ i ] =: R

12 / 19

slide-46
SLIDE 46

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Notice: x∗ ∈ [mini x∗

i , maxi x∗ i ] =: R

13 / 19

slide-47
SLIDE 47

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Notice: x∗ ∈ [mini x∗

i , maxi x∗ i ] =: R

◮ If we have a scalar x that lies outside R? ◮ We see that ∇fi(x) = ai(aix − bi) ∇f(x) =

  • i ai(aix − bi)

13 / 19

slide-48
SLIDE 48

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Notice: x∗ ∈ [mini x∗

i , maxi x∗ i ] =: R

◮ If we have a scalar x that lies outside R? ◮ We see that ∇fi(x) = ai(aix − bi) ∇f(x) =

  • i ai(aix − bi)

◮ ∇fi(x) has same sign as ∇f(x) So using ∇fi(x) instead of ∇f(x) also ensures progress.

13 / 19

slide-49
SLIDE 49

Example (Bertsekas)

◮ Assume all variables involved are scalars. min f(x) = 1

2

m

i=1(aix − bi)2

◮ Notice: x∗ ∈ [mini x∗

i , maxi x∗ i ] =: R

◮ If we have a scalar x that lies outside R? ◮ We see that ∇fi(x) = ai(aix − bi) ∇f(x) =

  • i ai(aix − bi)

◮ ∇fi(x) has same sign as ∇f(x) So using ∇fi(x) instead of ∇f(x) also ensures progress. ◮ But once inside region R, no guarantee that incremental method will make progress towards optimum.

13 / 19

slide-50
SLIDE 50

Incremental proximal method

min f(x) =

i fi(x)

What if the fi are nonsmooth?

14 / 19

slide-51
SLIDE 51

Incremental proximal method

min f(x) =

i fi(x)

What if the fi are nonsmooth?

xk+1 = proxαkf(xk)

14 / 19

slide-52
SLIDE 52

Incremental proximal method

min f(x) =

i fi(x)

What if the fi are nonsmooth?

xk+1 = proxαkf(xk) xk+1 = proxαkfi(k)(xk) xk+1 = argmin 1

2x − xk2 2 + fi(k)(x)

  • i(k) ∈ {1, 2, . . . , m} picked uniformly at random.

14 / 19

slide-53
SLIDE 53

Incremental proximal-gradients

min

  • i fi(x) + r(x).

15 / 19

slide-54
SLIDE 54

Incremental proximal-gradients

min

  • i fi(x) + r(x).

xk+1 = proxηkr

  • xk − ηk

m

i=1 ∇fi(zi)

  • ,

k = 0, 1, . . . ,

15 / 19

slide-55
SLIDE 55

Incremental proximal-gradients

min

  • i fi(x) + r(x).

xk+1 = proxηkr

  • xk − ηk

m

i=1 ∇fi(zi)

  • ,

k = 0, 1, . . . , z1 = xk zi+1 = zi − ηk∇fi(zi), i = 1, . . . , m − 1.

15 / 19

slide-56
SLIDE 56

Incremental proximal-gradients

min

  • i fi(x) + r(x).

xk+1 = proxηkr

  • xk − ηk

m

i=1 ∇fi(zi)

  • ,

k = 0, 1, . . . , z1 = xk zi+1 = zi − ηk∇fi(zi), i = 1, . . . , m − 1. We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)

15 / 19

slide-57
SLIDE 57

Incremental proximal-gradients

min

  • i fi(x) + r(x).

xk+1 = proxηkr

  • xk − ηk

m

i=1 ∇fi(zi)

  • ,

k = 0, 1, . . . , z1 = xk zi+1 = zi − ηk∇fi(zi), i = 1, . . . , m − 1. We can choose ηk = 1/L, where L is Lipschitz constant of ∇f(x)

Does this work?

15 / 19

slide-58
SLIDE 58

Incremental methods: key realization

min (f(x) =

i fi(x)) + r(x)

Gradient with error ∇fi(k)(x) = ∇f(x) + e(x) xk+1 = proxαr[xk − αk(∇f(xk) + e(xk))]

16 / 19

slide-59
SLIDE 59

Incremental methods: key realization

min (f(x) =

i fi(x)) + r(x)

Gradient with error ∇fi(k)(x) = ∇f(x) + e(x) xk+1 = proxαr[xk − αk(∇f(xk) + e(xk))]

So if in the limit error αke(xk) disappears, we should be ok!

16 / 19

slide-60
SLIDE 60

Incremental gradient methods

Incremental gradient methods may be viewed as Gradient methods with error in gradient computation

17 / 19

slide-61
SLIDE 61

Incremental gradient methods

Incremental gradient methods may be viewed as Gradient methods with error in gradient computation ◮ If we can control this error, we can control convergence

17 / 19

slide-62
SLIDE 62

Incremental gradient methods

Incremental gradient methods may be viewed as Gradient methods with error in gradient computation ◮ If we can control this error, we can control convergence ◮ Error makes even smooth case more like nonsmooth case

17 / 19

slide-63
SLIDE 63

Incremental gradient methods

Incremental gradient methods may be viewed as Gradient methods with error in gradient computation ◮ If we can control this error, we can control convergence ◮ Error makes even smooth case more like nonsmooth case ◮ So, convergence crucially depends on stepsize αk

17 / 19

slide-64
SLIDE 64

Incremental gradient methods

Incremental gradient methods may be viewed as Gradient methods with error in gradient computation ◮ If we can control this error, we can control convergence ◮ Error makes even smooth case more like nonsmooth case ◮ So, convergence crucially depends on stepsize αk Some stepsize choices ♠ αk = c, a small enough constant ♠ αk → 0,

k αk = ∞ (diminishing scalar)

♠ Constant for some iterations, diminish, again constant, repeat ♠ αk = min(c, a/(b + k)), where a, b, c > 0 (user chosen).

17 / 19

slide-65
SLIDE 65

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence

18 / 19

slide-66
SLIDE 66

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small)

18 / 19

slide-67
SLIDE 67

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence

18 / 19

slide-68
SLIDE 68

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence

18 / 19

slide-69
SLIDE 69

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence ♠ Slow, sublinear rate of convergence

18 / 19

slide-70
SLIDE 70

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence ♠ Slow, sublinear rate of convergence ♠ Optimal, incremental method seems not to be known

18 / 19

slide-71
SLIDE 71

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence ♠ Slow, sublinear rate of convergence ♠ Optimal, incremental method seems not to be known ♠ Idea extends to subgradient, and proximal setups

18 / 19

slide-72
SLIDE 72

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence ♠ Slow, sublinear rate of convergence ♠ Optimal, incremental method seems not to be known ♠ Idea extends to subgradient, and proximal setups ♠ Some extensions also apply to nonconvex problems

18 / 19

slide-73
SLIDE 73

Incremental gradient – summary

♠ Usually much faster (large m) when far from convergence ♠ Slow progress near optimum (because αk often too small) ♠ Constant step αk = α, doesn’t always yield convergence ♠ Diminishing step αk = O(1/k) leads to convergence ♠ Slow, sublinear rate of convergence ♠ Optimal, incremental method seems not to be known ♠ Idea extends to subgradient, and proximal setups ♠ Some extensions also apply to nonconvex problems ♠ Some extend to parallel and distributed computation Read (omit proofs): “Incremental methods survey” by D. P. Bertsekas (2010) – see bSpace.

18 / 19

slide-74
SLIDE 74

References

1 Combettes and Pesquet. Proximal splitting methods in signal

  • processing. (2010)

2 Bertsekas. Nonlinear Programming. (1999).

19 / 19