Adaptive primal-dual stochastic gradient methods Yangyang Xu - - PowerPoint PPT Presentation

adaptive primal dual stochastic gradient methods
SMART_READER_LITE
LIVE PREVIEW

Adaptive primal-dual stochastic gradient methods Yangyang Xu - - PowerPoint PPT Presentation

Adaptive primal-dual stochastic gradient methods Yangyang Xu Mathematical Sciences, Rensselaer Polytechnic Institute October 26, 2019 1 / 22 Stochastic gradient method stochastic program: F ( x ; ) min x X f ( x ) = E N


slide-1
SLIDE 1

Adaptive primal-dual stochastic gradient methods

Yangyang Xu Mathematical Sciences, Rensselaer Polytechnic Institute October 26, 2019

1 / 22

slide-2
SLIDE 2

Stochastic gradient method

stochastic program: min

x∈X f(x) = Eξ

  • F(x; ξ)
  • if ξ uniform on {ξ1, . . . , ξN}, then f(x) =

1 N

N

i=1 F(x; ξi)

  • stochastic gradient (that requires samples of ξ):

xk+1 = ProjX

  • xk − αkgk
  • where gk is a stochastic approximation of ∇f(xk)
  • low per-update complexity compared to deterministic gradient descent
  • Literature: tons of works (e.g., [Robbins-Monro’51, Polyak-Juditsky’92,

Nemirovski et. al. ’09, Ghadimi-Lan’13, Davis et. al’18])

2 / 22

slide-3
SLIDE 3

adaptive learning

  • adaptive gradient [Duchi-Hazan-Singer’11]:

xk+1 = Projvk

X

  • xk − αk · gk ⊘ vk
  • where vk =

k

t=0(gk)2

  • many other adaptive variants: Adam [Kingma-Ba’14], AMSGrad

[Reddi-Kale-Kumar’18], and so on

  • extremely popular in training deep neural networks

3 / 22

slide-4
SLIDE 4

Adaptiveness improves convergence speed

1 2 3 4 5

pass of data

0.1 0.2 0.3 0.4 0.5 0.6 0.7

  • bjective value

AdaGrad Adam tuned SGD

  • test on solving a neural network with one hidden layer

Observation: adaptive methods much faster, and all methods have similar per-update cost

4 / 22

slide-5
SLIDE 5

Take a close look: xk+1 = Projvk

X

  • xk − αk · gk ⊘ vk
  • Projvk

X is assumed simple (holds if X is simple)

  • Not (easily) implementable if X is complicated

This talk: adaptive primal-dual stochastic gradient for problems with complicated constraints

5 / 22

slide-6
SLIDE 6

Outline

  • 1. Problem formula and motivating examples
  • 2. Review of existing methods
  • 3. Proposed primal-dual stochastic gradient method
  • 4. Numerical and convergence results and conclusions

6 / 22

slide-7
SLIDE 7

Stochastic functional constrained stochastic program

min

x∈X f0(x) = Eξ0

  • F0(x; ξ0)

, s.t. fj(x) = Eξj

  • Fj(x; ξj)

≤ 0, j = 1, . . . , m (P)

  • X is a simple closed convex set (but the feasible set is complicated)
  • fj is convex and possibly nondifferentiable
  • m could be very big: expensive to access all fj’s at every update

Goal: design an efficient stochastic method without complicated projection that can guarantees (near) optimality and feasibility

7 / 22

slide-8
SLIDE 8

Example I: linear programming of Markov decision process

discounted Markov decision process: (S, A, P, r, γ)

  • state space S = {s1, . . . , sm}, action space A = {a1, . . . , an}
  • transition probability P = [Pa(s, s′)], reward r = [ra(s, s′)]
  • discount factor: γ ∈ (0, 1]

Bellman optimality equation: v(s) = max

a∈A

  • s′∈S

Pa(s, s′) ra(s, s′) + γv(s′) , ∀s ∈ S equivalent to linear programming [Puterman’14]: min

v e⊤v, s.t. (I − γPa)v − ra ≥ 0, ∀a ∈ A

  • ra(s) =

s′∈S Pa(s, s′)ra(s, s′)

  • huge number of constraints if m and/or n is big

8 / 22

slide-9
SLIDE 9

Example II: robust optimization by sampling

Robust optimization: min

x∈X f0(x), s.t. g(x; ξ) ≤ 0, ∀ξ ∈ Ξ

Sampled approximation [Calafiore-Campi’05]: min

x∈X f0(x), s.t. g(x; ξi) ≤ 0, ∀i = 1, . . . , m

  • {ξ1, . . . , ξm}: m independently extracted samples
  • solution of the sampled approximation problem is a (1 − τ)-level robustly

feasible solution with probability at least 1 − ε if m ≥ n τε − 1, where τ ∈ (0, 1) and ε ∈ (0, 1).

9 / 22

slide-10
SLIDE 10

Literature

Few for problems with functional constraints

  • penalty method with stochastic approximation [Wang-Ma-Yuan’17]
  • uses exact function/gradient information of all constraint functions
  • stochastic mirror-prox descent for saddle-point problems

[Baes-Brgisser-Nemirovski’13]

  • cooperative stochastic approximation (CSA) for problems with expectation

constraint [Lan-Zhou’16]

  • level-set methods [Lin et. al’18]

10 / 22

slide-11
SLIDE 11

Stochastic mirror-prox method [Baes-Brgisser-Nemirovski’13]

For a saddle-point problem: min

x∈X max z∈Z L(x, z)

Iterative update scheme:

  • ˆ

xk, ˆ zk = ProjX×Z

  • xk − αkgk

x, zk + αkgk z

  • ,
  • xk+1, zk+1

= ProjX×Z

  • xk − αkˆ

gk

x, zk + αkˆ

gk

z

  • (gk

x; gk z): a stochastic approximation of ∇L(xk, zk)

gk

x; ˆ

gk

z): a stochastic approximation of ∇L(ˆ

xk, ˆ zk)

  • O(1/

√ k) rate in terms of primal-dual gap

11 / 22

slide-12
SLIDE 12

Cooperative stochastic approximation [Lan-Zhou’16]

For the problem with expectation constraint: min

x∈X f(x) = Eξ[F(x, ξ)], s.t. Eξ[G(x, ξ)] ≤ 0

For k = 0, 1, . . ., do

  • 1. sample ξk;
  • 2. If G(xk, ξk) ≤ ηk, set gk = ˜

∇F(xk, ξk); otherwise, gk = ˜ ∇G(xk, ξk)

  • 3. Update x by

xk+1 = arg min

x∈X

gk, x + 1 2γk x − xk2

  • purely primal method
  • O(1/

√ k) rate for convex problems

  • O(1/k) if both objective and constraint functions are strongly convex

12 / 22

slide-13
SLIDE 13

proposed method by the augmented Lagrangian function

13 / 22

slide-14
SLIDE 14

Augmented Lagrangian function

With slack variables s ≥ 0, (P) is equivalent to min

x∈X,s≥0 f0(x), s.t. fi(x) + si = 0, i = 1, . . . , m.

By quadratic penalty, the augmented Lagrangian function is ˜ Lβ(x, s, z) = f0(x) +

m

  • i=1

zi

  • fi(x) + si
  • + β

2

m

  • i=1
  • fi(x) + si

2.

Fix (x, z) and minimize ˜ Lβ about s ≥ 0 (through solving ∇s ˜ Lβ = 0): si =

  • −zi

β − fi(x)

  • +

, i = 1, . . . , m.

14 / 22

slide-15
SLIDE 15

Augmented Lagrangian function

Eliminate s to have the classic augmented Lagrangian function of (P): Lβ(x, z) = f0(x) +

m

  • i=1

ψβ(fi(x), zi), where ψβ(u, v) =

  • uv + β

2 u2,

if βu + v ≥ 0, − v2

2β ,

if βu + v < 0.

  • ψβ(fi(x), zi) convex in x and concave in zi for each i
  • thus Lβ convex in x and concave in z

15 / 22

slide-16
SLIDE 16

Augmented Lagrangian method

Choose (x1, y1, z1). For k = 1, 2, . . ., iteratively do: xk+1 ∈ Arg min

x∈X

Lβ(x, zk), zk+1 = zk + ρ∇zLβ(xk+1, zk)

  • if ρ < 2β, globally convergent with rate O 1

  • bigger ρ and β gives faster convergence in term of iteration number but

yields harder x-subproblem

16 / 22

slide-17
SLIDE 17

Proposed primal-dual stochastic gradient method

Consider the case:

  • exact fj and ˜

∇fj can be obtained for each j = 1, . . . , m

  • m is big: expensive to access all fj’s every update

Examples: MDP, robust optimization by sampling, multi-class SVM Remarks:

  • if fj is stochastic, AL function is a compositional expectation form
  • difficult to obtain unbiased stochastic estimation of ˜

∇xLβ

  • ordinary Lagrangian function can be used to handle the most general case

17 / 22

slide-18
SLIDE 18

Proposed primal-dual stochastic gradient method

For k = 0, 1, . . ., do

  • 1. Sample ξk and pick jk ∈ [m] uniformly at random;
  • 2. Let gk = ˜

∇F0(xk, ξk) + ˜ ∇xψβ

  • fjk(xk), zk

jk

  • ;
  • 3. Update the primal variable x by

xk+1 = ProjX

  • xk − D−1

k gk

  • 4. Let zk+1

j

= zk

j for j = jk and update zjk by

zk+1

j

= zk

j + ρk · max

zk

j

β , fj(xk)

  • , for j = jk.
  • gk unbiased stochastic estimation of ˜

∇xLβ at xk

  • ˜

∇fjk(xk) required, and fjk(xk) and fjk(xk) needed for the updates

  • Dk = I/αk + η · diag

k

t=0 |˜

gk|2

  • with ˜

gk scaled version of gk

18 / 22

slide-19
SLIDE 19

How the proposed method performs

Test on convex quadratically constrained quadratic programming min

x∈X 1 2N

N

i=1 Hix − ci2, s.t. 1 2 x⊤Qjx + a⊤ j x ≤ bj, j = 1, . . . , m,

where N = m = 10, 000.

10 20 30 40 50 number of epochs 10-6 10-4 10-2 100 102

  • bjective distance to optimality

PDSG-nonadp PDSG-adp CSA mirror-prox 10 20 30 40 50 number of epochs 10-6 10-4 10-2 100 102 average feasibility residual PDSG-nonadp PDSG-adp CSA mirror-prox

Observations:

  • proposed methods better than mirror-prox and CSA
  • adaptiveness significantly improves convergence speed
  • all methods have roughly the same asymptotic convergence rate

19 / 22

slide-20
SLIDE 20

Sublinear convergence result

Assumptions:

  • 1. existence of a primal-dual solution (x∗, z∗)
  • 2. unbiased estimate and bounded variance
  • 3. bounded constraint function and subgradient

Theorem: Given K, let αk =

α √ K , ρk = ρ √ K , β ≥ ρ. Then

max E

f0(¯

xK) − f0(x∗)

, E[f(¯

xK)]+ = O 1/ √ K If f0 is strongly convex, let αk = α

k , ρk = ρ log(K+1) , β ≥ 2ρ log 2 . Then

ExK − x∗2 = O

log(K + 1)

K

  • ¯

xK weighted average of {xk}K+1

k=1

Remark: CSA [Lan-Zhou’16] requires strong convexity of both objective and

constraint functions to achieve O( 1

K )

20 / 22

slide-21
SLIDE 21

Conclusions

  • Proposed an adaptive primal-dual stochastic gradient method for

stochastic programs with many functional constraints

  • Based on the classic augmented Lagrangian function
  • O(1/

√ k) convergence for convex problems

  • O

(log k)/k convergence if the objective is strongly convex

  • Numerical experiment on a convex quadratically constrained quadratic

program

  • better than two state-of-the-art methods
  • adaptiveness can significantly improve the convergence speed

21 / 22

slide-22
SLIDE 22

References

  • Y. Xu. Primal-dual stochastic gradient method for convex programs with many

functional constraints, arXiv:1802.02724, 2018.

Thank you!!!

22 / 22