[PPT] - A Composite Randomized Incremental Gradient Method Junyu Zhang PowerPoint Presentation

SLIDE 1

A Composite Randomized Incremental Gradient Method

Junyu Zhang (University of Minnesota) and Lin Xiao (Microsoft Research) International Conference on Machine Learning (ICML) Long Beach, California June 11, 2019

1

SLIDE 2

Composite finite-sum optimization

problem of focus

minimize

x∈Rd

f 1 n

n

i=1

gi(x)

+ r(x)

– f : Rp → R smooth and possibly nonconvex – gi : Rd → Rp smooth vector mapping, i = 1, . . . , n – r : Rd → R ∪ {∞} convex but possibly nonsmooth

2

SLIDE 3

Composite finite-sum optimization

problem of focus

minimize

x∈Rd

f 1 n

n

i=1

gi(x)

+ r(x)

– f : Rp → R smooth and possibly nonconvex – gi : Rd → Rp smooth vector mapping, i = 1, . . . , n – r : Rd → R ∪ {∞} convex but possibly nonsmooth

extensions for two-level finite-sum problem

minimize

x∈Rd

1 m

m

j=1

fj 1 n

n

i=1

gi(x)

+ r(x)
applications beyond ERM

– reinforcement learning (policy evaluation) – risk-averse optimization, financial mathematics – . . .

2

SLIDE 4

Examples

policy evaluation with linear function approximation

minimizex∈Rd

E[A]x − E[b]
2

A, b random, generated by MDP under fixed policy

3

SLIDE 5

Examples

policy evaluation with linear function approximation

minimizex∈Rd

E[A]x − E[b]
2

A, b random, generated by MDP under fixed policy

risk-averse optimization

maximize

x∈Rd

1 n

n

j=1

hj(x)

average reward

−λ 1

n

j=1
hj(x) − 1

n

i=1

hi(x) 2

variance of rewards (risk)

– often treated as two-level composite finite-sum optimization

3

SLIDE 6

Examples

policy evaluation with linear function approximation

minimizex∈Rd

E[A]x − E[b]
2

A, b random, generated by MDP under fixed policy

risk-averse optimization

maximize

x∈Rd

1 n

n

j=1

hj(x)

average reward

−λ 1

n

j=1
hj(x) − 1

n

i=1

hi(x) 2

variance of rewards (risk)

– often treated as two-level composite finite-sum optimization – simple transformation using Var(a) = E[a2] − (E[a])2 maximize

x∈Rd

1 n

n

j=1

hj(x) − λ 1 n

n

j=1

h2

j (x) −

1 n

n

i=1

hi(x) 2 actually a one-level composite finite-sum problem

3

SLIDE 7

Technical challenge and related work

challenge: biased gradient estimator

– denote F(x) := f (g(x)) where g(x) := 1

n

i=1 gi(x)

F ′(x) = [g′(x)]Tf ′(g(x)) – subsampled estimators

y = 1 |S|

i∈S

gi(x), z = 1 |S|

i∈S

g ′

i (x),

where S ⊂ {1, . . . , n}

E[y] = g(x) and E[z] = g′(x), but E

[z]Tf ′(y)
= F ′(x)

4

SLIDE 8

Technical challenge and related work

challenge: biased gradient estimator

– denote F(x) := f (g(x)) where g(x) := 1

n

i=1 gi(x)

F ′(x) = [g′(x)]Tf ′(g(x)) – subsampled estimators

y = 1 |S|

i∈S

gi(x), z = 1 |S|

i∈S

g ′

i (x),

where S ⊂ {1, . . . , n}

E[y] = g(x) and E[z] = g′(x), but E

[z]Tf ′(y)
= F ′(x)
related work

– more general composite stochastic optimization

(Wang, Fang & Liu 2017; Wang, Liu & Fang 2017; . . . )

– two-level composite finite-sum: extending SVRG

(Lian, Wang & Liu 2017; Huo, Gu, Liu & Huang 2018; Lin, Fan, Wang & Jordan 2018; . . . )

4

SLIDE 9

Main results

composite-SAGA: single loop vs double loops of composite-SVRG

5

SLIDE 10

Main results

composite-SAGA: single loop vs double loops of composite-SVRG
sample complexity for E
G(xt)2

≤ ǫ (with G = F ′ if r ≡ 0) – nonconvex smooth f and gi: O

n + n2/3ǫ−1

– + gradient dominant or strongly convex: O

(n+κn2/3) log ǫ−1

same as SVRG/SAGA for nonconvex finite-sum problems

(Allen-Zhu & Hazan 2016; Reddi et al. 2016; Let et al. 2017)

5

SLIDE 11

Main results

composite-SAGA: single loop vs double loops of composite-SVRG
sample complexity for E
G(xt)2

≤ ǫ (with G = F ′ if r ≡ 0) – nonconvex smooth f and gi: O

n + n2/3ǫ−1

– + gradient dominant or strongly convex: O

(n+κn2/3) log ǫ−1

same as SVRG/SAGA for nonconvex finite-sum problems

(Allen-Zhu & Hazan 2016; Reddi et al. 2016; Let et al. 2017)

extensions to two-level problem

– nonconvex smooth f and gi: O

m + n + (m + n)2/3ǫ−1

(same as composite-SVRG (Huo et al. 2018)) – + gradient dominant or optimally strongly convex: O

(m + n + κ(m + n)2/3) log ǫ−1

(better than composite-SVRG (Lian et al. 2017))

5

SLIDE 12

Composite SAGA algorithm (C-SAGA)

input: x0 ∈ Rd, α0

i for i = 1, . . . , n, and step size η > 0

initialize Y0 = 1

n

i=1 gi(α0 i ) ,

Z0 = 1

n

i=1 g ′ i (α0 i )

for t = 0, ..., T − 1

– sample with replacement St ⊂ {1, ..., n} with |St| = s – compute

yt = Yt + 1

s

j∈St(gj(xt) − gj(αt

j ))

zt = Zt + 1

s

j∈St(g ′

j (xt) − g ′ j (αt j ))

– xt+1 = proxη

r

xt − η
zT

t f ′(yt)

– update αt+1

j

= xt if j ∈ St and αt+1

j

= αt

j otherwise

– update

Yt+1 = Yt + 1

n

j∈St(gj(xt) − gj(αt

j ))

Zt+1 = Zt + 1

n

j∈St(g ′

j (xt) − g ′ j (αt j ))

output: randomly choose t∗ ∈{1, ..., T} and output xt∗

6

SLIDE 13

Convergence analysis

minimize

x∈Rd

f 1 n

n

i=1

gi(x)

F(x)

+r(x)

assumptions

– f is ℓf -Lipschitz and f ′ is Lf -Lipschitz – gi is ℓg-Lipschitz and g′

i is Lg-Lipschitz, i = 1, . . . , n

– r convex but can be non-smooth implication: F ′ is LF-Lipschitz with LF = ℓ2

gLf + ℓf Lg 7

SLIDE 14

Convergence analysis

minimize

x∈Rd

f 1 n

n

i=1

gi(x)

F(x)

+r(x)

assumptions

– f is ℓf -Lipschitz and f ′ is Lf -Lipschitz – gi is ℓg-Lipschitz and g′

i is Lg-Lipschitz, i = 1, . . . , n

– r convex but can be non-smooth implication: F ′ is LF-Lipschitz with LF = ℓ2

gLf + ℓf Lg

sample complexity for E
G(xt)2

≤ ǫ, where G(x) = 1 η

x − proxη

r

x − ηF ′(x)
= F ′(x) if r ≡ 0

– if s = 1 and η = O

1/(nLF)
, then complexity O
n/ǫ
– if s = n2/3 and η = O
1/LF
, then complexity O
n + n2/3/ǫ
7

SLIDE 15

Linear convergence results

gradient-dominant functions

– assumption: r ≡ 0 and F(x) := f 1

n

i=1 gi(x)

satisfies

F(x) − inf

y F(y) ≤ ν

2F ′(x)2, ∀x ∈ Rd – if s = n2/3 and η = O(1/LF), complexity O

(n+νn2/3) log ǫ−1
optimally strongly convex functions

– assumption: Φ(x) := F(x) + r(x) satisfies Φ(x) − Φ(x⋆) ≥ µ 2 x − x⋆2, ∀x ∈ Rd – if s =n2/3 and η=O(1/LF), complexity O

(n+µ−1n2/3) log ǫ−1
extension to two-level case: O
(m + n + κ(m + n)2/3) log ǫ−1

8

SLIDE 16

Experiments

risk-averse optimization

0.5 1 1.5 2 2.5 3

# of samples

105 10-6 10-4 10-2 100 102 104

|| F(xk) + r(xk)|| n = 5000, d = 500 C-SAGA ASC-PG VRSC-PG

0.5 1 1.5 2 2.5 3

# of samples

105 10-10 10-8 10-6 10-4 10-2 100 102 104

(F(xk)+r(xk)) - (F(x*)+r(x*)) n = 5000, d = 500 C-SAGA ASC-PG VRSC-PG

policy evaluation for MDP

1000 2000 3000 4000 5000

# of samples

10-6 10-4 10-2 100 102 104

|| F(wk)|| S = 100 SCGD ASCGD ASC-PG C-SAGA VRSC-PG

1000 2000 3000 4000 5000

# of samples

10-15 10-10 10-5 100 105

F(wk)- F(w*) S = 100 SCGD ASCGD ASC-PG C-SAGA VRSC-PG