A Composite Randomized Incremental Gradient Method Junyu Zhang - - PowerPoint PPT Presentation

a composite randomized incremental gradient method
SMART_READER_LITE
LIVE PREVIEW

A Composite Randomized Incremental Gradient Method Junyu Zhang - - PowerPoint PPT Presentation

A Composite Randomized Incremental Gradient Method Junyu Zhang (University of Minnesota) and Lin Xiao (Microsoft Research) International Conference on Machine Learning (ICML) Long Beach, California June 11, 2019 1 Composite finite-sum


slide-1
SLIDE 1

A Composite Randomized Incremental Gradient Method

Junyu Zhang (University of Minnesota) and Lin Xiao (Microsoft Research) International Conference on Machine Learning (ICML) Long Beach, California June 11, 2019

1

slide-2
SLIDE 2

Composite finite-sum optimization

  • problem of focus

minimize

x∈Rd

f 1 n

n

  • i=1

gi(x)

  • + r(x)

– f : Rp → R smooth and possibly nonconvex – gi : Rd → Rp smooth vector mapping, i = 1, . . . , n – r : Rd → R ∪ {∞} convex but possibly nonsmooth

2

slide-3
SLIDE 3

Composite finite-sum optimization

  • problem of focus

minimize

x∈Rd

f 1 n

n

  • i=1

gi(x)

  • + r(x)

– f : Rp → R smooth and possibly nonconvex – gi : Rd → Rp smooth vector mapping, i = 1, . . . , n – r : Rd → R ∪ {∞} convex but possibly nonsmooth

  • extensions for two-level finite-sum problem

minimize

x∈Rd

1 m

m

  • j=1

fj 1 n

n

  • i=1

gi(x)

  • + r(x)
  • applications beyond ERM

– reinforcement learning (policy evaluation) – risk-averse optimization, financial mathematics – . . .

2

slide-4
SLIDE 4

Examples

  • policy evaluation with linear function approximation

minimizex∈Rd

  • E[A]x − E[b]
  • 2

A, b random, generated by MDP under fixed policy

3

slide-5
SLIDE 5

Examples

  • policy evaluation with linear function approximation

minimizex∈Rd

  • E[A]x − E[b]
  • 2

A, b random, generated by MDP under fixed policy

  • risk-averse optimization

maximize

x∈Rd

1 n

n

  • j=1

hj(x)

  • average reward

−λ 1

n

n

  • j=1
  • hj(x) − 1

n

n

  • i=1

hi(x) 2

  • variance of rewards (risk)

– often treated as two-level composite finite-sum optimization

3

slide-6
SLIDE 6

Examples

  • policy evaluation with linear function approximation

minimizex∈Rd

  • E[A]x − E[b]
  • 2

A, b random, generated by MDP under fixed policy

  • risk-averse optimization

maximize

x∈Rd

1 n

n

  • j=1

hj(x)

  • average reward

−λ 1

n

n

  • j=1
  • hj(x) − 1

n

n

  • i=1

hi(x) 2

  • variance of rewards (risk)

– often treated as two-level composite finite-sum optimization – simple transformation using Var(a) = E[a2] − (E[a])2 maximize

x∈Rd

1 n

n

  • j=1

hj(x) − λ 1 n

n

  • j=1

h2

j (x) −

1 n

n

  • i=1

hi(x) 2 actually a one-level composite finite-sum problem

3

slide-7
SLIDE 7

Technical challenge and related work

  • challenge: biased gradient estimator

– denote F(x) := f (g(x)) where g(x) := 1

n

n

i=1 gi(x)

F ′(x) = [g′(x)]Tf ′(g(x)) – subsampled estimators

y = 1 |S|

  • i∈S

gi(x), z = 1 |S|

  • i∈S

g ′

i (x),

where S ⊂ {1, . . . , n}

E[y] = g(x) and E[z] = g′(x), but E

  • [z]Tf ′(y)
  • = F ′(x)

4

slide-8
SLIDE 8

Technical challenge and related work

  • challenge: biased gradient estimator

– denote F(x) := f (g(x)) where g(x) := 1

n

n

i=1 gi(x)

F ′(x) = [g′(x)]Tf ′(g(x)) – subsampled estimators

y = 1 |S|

  • i∈S

gi(x), z = 1 |S|

  • i∈S

g ′

i (x),

where S ⊂ {1, . . . , n}

E[y] = g(x) and E[z] = g′(x), but E

  • [z]Tf ′(y)
  • = F ′(x)
  • related work

– more general composite stochastic optimization

(Wang, Fang & Liu 2017; Wang, Liu & Fang 2017; . . . )

– two-level composite finite-sum: extending SVRG

(Lian, Wang & Liu 2017; Huo, Gu, Liu & Huang 2018; Lin, Fan, Wang & Jordan 2018; . . . )

4

slide-9
SLIDE 9

Main results

  • composite-SAGA: single loop vs double loops of composite-SVRG

5

slide-10
SLIDE 10

Main results

  • composite-SAGA: single loop vs double loops of composite-SVRG
  • sample complexity for E
  • G(xt)2

≤ ǫ (with G = F ′ if r ≡ 0) – nonconvex smooth f and gi: O

  • n + n2/3ǫ−1

– + gradient dominant or strongly convex: O

  • (n+κn2/3) log ǫ−1

same as SVRG/SAGA for nonconvex finite-sum problems

(Allen-Zhu & Hazan 2016; Reddi et al. 2016; Let et al. 2017)

5

slide-11
SLIDE 11

Main results

  • composite-SAGA: single loop vs double loops of composite-SVRG
  • sample complexity for E
  • G(xt)2

≤ ǫ (with G = F ′ if r ≡ 0) – nonconvex smooth f and gi: O

  • n + n2/3ǫ−1

– + gradient dominant or strongly convex: O

  • (n+κn2/3) log ǫ−1

same as SVRG/SAGA for nonconvex finite-sum problems

(Allen-Zhu & Hazan 2016; Reddi et al. 2016; Let et al. 2017)

  • extensions to two-level problem

– nonconvex smooth f and gi: O

  • m + n + (m + n)2/3ǫ−1

(same as composite-SVRG (Huo et al. 2018)) – + gradient dominant or optimally strongly convex: O

  • (m + n + κ(m + n)2/3) log ǫ−1

(better than composite-SVRG (Lian et al. 2017))

5

slide-12
SLIDE 12

Composite SAGA algorithm (C-SAGA)

  • input: x0 ∈ Rd, α0

i for i = 1, . . . , n, and step size η > 0

  • initialize Y0 = 1

n

n

i=1 gi(α0 i ) ,

Z0 = 1

n

n

i=1 g ′ i (α0 i )

  • for t = 0, ..., T − 1

– sample with replacement St ⊂ {1, ..., n} with |St| = s – compute

  • yt = Yt + 1

s

  • j∈St(gj(xt) − gj(αt

j ))

zt = Zt + 1

s

  • j∈St(g ′

j (xt) − g ′ j (αt j ))

– xt+1 = proxη

r

  • xt − η
  • zT

t f ′(yt)

  • – update αt+1

j

= xt if j ∈ St and αt+1

j

= αt

j otherwise

– update

  • Yt+1 = Yt + 1

n

  • j∈St(gj(xt) − gj(αt

j ))

Zt+1 = Zt + 1

n

  • j∈St(g ′

j (xt) − g ′ j (αt j ))

  • output: randomly choose t∗ ∈{1, ..., T} and output xt∗

6

slide-13
SLIDE 13

Convergence analysis

minimize

x∈Rd

f 1 n

n

  • i=1

gi(x)

  • F(x)

+r(x)

  • assumptions

– f is ℓf -Lipschitz and f ′ is Lf -Lipschitz – gi is ℓg-Lipschitz and g′

i is Lg-Lipschitz, i = 1, . . . , n

– r convex but can be non-smooth implication: F ′ is LF-Lipschitz with LF = ℓ2

gLf + ℓf Lg 7

slide-14
SLIDE 14

Convergence analysis

minimize

x∈Rd

f 1 n

n

  • i=1

gi(x)

  • F(x)

+r(x)

  • assumptions

– f is ℓf -Lipschitz and f ′ is Lf -Lipschitz – gi is ℓg-Lipschitz and g′

i is Lg-Lipschitz, i = 1, . . . , n

– r convex but can be non-smooth implication: F ′ is LF-Lipschitz with LF = ℓ2

gLf + ℓf Lg

  • sample complexity for E
  • G(xt)2

≤ ǫ, where G(x) = 1 η

  • x − proxη

r

  • x − ηF ′(x)
  • = F ′(x) if r ≡ 0

– if s = 1 and η = O

  • 1/(nLF)
  • , then complexity O
  • n/ǫ
  • – if s = n2/3 and η = O
  • 1/LF
  • , then complexity O
  • n + n2/3/ǫ
  • 7
slide-15
SLIDE 15

Linear convergence results

  • gradient-dominant functions

– assumption: r ≡ 0 and F(x) := f 1

n

n

i=1 gi(x)

  • satisfies

F(x) − inf

y F(y) ≤ ν

2F ′(x)2, ∀x ∈ Rd – if s = n2/3 and η = O(1/LF), complexity O

  • (n+νn2/3) log ǫ−1
  • optimally strongly convex functions

– assumption: Φ(x) := F(x) + r(x) satisfies Φ(x) − Φ(x⋆) ≥ µ 2 x − x⋆2, ∀x ∈ Rd – if s =n2/3 and η=O(1/LF), complexity O

  • (n+µ−1n2/3) log ǫ−1
  • extension to two-level case: O
  • (m + n + κ(m + n)2/3) log ǫ−1

8

slide-16
SLIDE 16

Experiments

  • risk-averse optimization

0.5 1 1.5 2 2.5 3

# of samples

105 10-6 10-4 10-2 100 102 104

|| F(xk) + r(xk)|| n = 5000, d = 500 C-SAGA ASC-PG VRSC-PG

0.5 1 1.5 2 2.5 3

# of samples

105 10-10 10-8 10-6 10-4 10-2 100 102 104

(F(xk)+r(xk)) - (F(x*)+r(x*)) n = 5000, d = 500 C-SAGA ASC-PG VRSC-PG

  • policy evaluation for MDP

1000 2000 3000 4000 5000

# of samples

10-6 10-4 10-2 100 102 104

|| F(wk)|| S = 100 SCGD ASCGD ASC-PG C-SAGA VRSC-PG

1000 2000 3000 4000 5000

# of samples

10-15 10-10 10-5 100 105

F(wk)- F(w*) S = 100 SCGD ASCGD ASC-PG C-SAGA VRSC-PG

9