[PPT] - Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 PowerPoint Presentation

SLIDE 1

Efficient Algorithms for Smooth Minimax Optimization

NeurIPS 2019 Kiran Koshy Thekumparampil†, Prateek Jain‡, Praneeth Netrapalli‡, Sewoong Oh±

†University of Illinois at Urbana-Champaign, ‡Microsoft Research, India, ±University of Washington, Seattle

Oct 27, 2019

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 1 / 25

SLIDE 2

Outline

Minimax Optimization problem Efficient algorithm for Nonconvex–Concave minimax problem Optimal algorithm for Strongly-Convex—Concave minimax problem

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 2 / 25

SLIDE 3

Minimax problem

Consider the general minimax problem min

x∈X max y∈Y g(x, y)

Two player game: y tries to maximize and x tries to minimize. The order of min & max or who plays first (x above) is important max

y∈Y min x∈X g(x, y) ≤ min x∈X max y∈Y g(x, y)

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 3 / 25

SLIDE 4

Examples of Minimax problem

1 GAN: minG maxD V (G, D):

min

G max D E x∼PX

log (D(x))
+

E z∼QZ

log (1 − D(G(z)))
= JS(PX||QX)

2 Constrained optimization: minx f (x), s.t. fi(x) ≤ 0, ∀ i ∈ [m]

min

x max y≥0

L(x, y) = f (x) +

m

i=1

yifi(x)

3 Robust estimation/optimization:

min

x

i

max

ˆ zi

f (x, ˆ zi) ∆(ˆ zi, zi) ≤ ε , ∀ i ∈ [m].

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 4 / 25

SLIDE 5

Nonconvex minimax

In general g(x, y) is non-convex in both x and y. E.g. Neural network based GAN Very few works on nonconvex minimax We focus on smooth nonconvex–concave minimax problem, i.e. g(x, ·) is concave, and g is L-smooth: max

a∈{x,y}

∇ag(x, y) − ∇ag(x′, y′)
≤ L
x − x′

+

y − y′
.

E.g. smooth constrained optimization. In general: maxy∈Y minx∈X g(x, y) < minx∈X maxy∈Y g(x, y) We focus on the non-smooth nonconvex Primal problem: f (x) = maxy g(x, y)

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 5 / 25

SLIDE 6

f (x) = maxy∈Y g(x, y) is non-smooth and weakly convex

f is non-smooth due to maximization over y

ρ-weakly convex function

We say that f is a ρ-weakly convex f if f + ρ

2 · 2 is convex, i.e.,

f (x) +

ux, x′ − x
− ρ

2x′ − x2 ≤ f (x′) , for all Fr´ echet subgradients ux ∈ ∂f (x), for all x, x′ ∈ X.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.4 0.6 0.8 1.0 1.2 1.4

f (x) = max{|x|, 1−x2

2 } ←

−

f (x) + x2

2 ←

− f is 1-weakly convex as f + ·2

2

is convex

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 6 / 25

SLIDE 7

f (x) = maxy∈Y g(x, y) is non-smooth and weakly convex

f is non-smooth due to maximization over y

ρ-weakly convex function

We say that f is a ρ-weakly convex f if f + ρ

2 · 2 is convex, i.e.,

f (x) +

ux, x′ − x
− ρ

2x′ − x2 ≤ f (x′) , for all Fr´ echet subgradients ux ∈ ∂f (x), for all x, x′ ∈ X. Any L-smooth function is L-weakly convex f (x) +

∇xf (x), x′ − x
− L

2x′ − x2 ≤ f (x′) −x is not weakly convex (due to upward pointing cusp).

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 7 / 25

SLIDE 8

f (x) = maxy∈Y g(x, y) is non-smooth and weakly convex

f is non-smooth due to maximization over y

ρ-weakly convex function

We say that f is a ρ-weakly convex f if f + ρ

2 · 2 is convex, i.e.,

f (x) +

ux, x′ − x
− ρ

2x′ − x2 ≤ f (x′) , for all Fr´ echet subgradients ux ∈ ∂f (x), for all x, x′ ∈ X. f (x) = maxy∈Y g(x, y) is L-weakly convex, if g is L-smooth. g(x, y) +

∇xg(x, y), x′ − x
− L

2x′ − x2 ≤ g(x′, y) = ⇒ f (x) +

ux, x′ − x
− L

2x′ − x2 ≤ f (x′) Cannot define approx. stationary point directly using subgradients

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 8 / 25

SLIDE 9

First order stationary point of weakly-convex function

Moreau envelope fλ of a L-weakly convex function ( L < 1

λ):

fλ(x) = min

x′ f (x′) + 1

2λx − x′2 . fλ is a smooth lower bound of f : ∇fλ(x) = 0 = ⇒ 0 ∈ ∂f (x)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.4 0.5 0.6 0.7 0.8 0.9 1.0

f (x) = max{|x|, 1−x2

2 } ←

− f0.5(x) ← −

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 9 / 25

SLIDE 10

First order stationary point of weakly-convex function

Moreau envelope fλ of a L-weakly convex function ( L < 1

λ):

fλ(x) = min

x′ f (x′) + 1

2λx − x′2 . fλ is a smooth lower bound of f : ∇fλ(x) = 0 = ⇒ 0 ∈ ∂f (x)

ε-first order stationary point (ε-FOSP)

We say that x is an ε-first order stationary point of a L-weakly convex f if ∇f 1

2 L (x) ≤ ε. Further this implies that there exists ˆ

x s.t., ˆ x − x ≤ ε/2L and min

u∈∂f (ˆ x) u ≤ ε

Algorithm complexity is the no. of first-order oracle calls to obtain ε-FOSP. Convergence rate is εk if after k oracle calls we get εk-FOSP.

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 10 / 25

SLIDE 11

Smooth nonconvex–concave minimax results

Setting Previous state-of-the-art Our result maxy g(x, y) O

ε−5

[1]

O
ε−3

maxi fi(x) = max

y∈∆m

m

i yifi(x)

O

ε−4

[2]

O
ε−3

∆m is the simplex of dimension m.

[1] Jin, C., Netrapalli, P., & Jordan, M. I. (2019). Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618. [2] Davis, D., & Drusvyatskiy, D. (2018). Stochastic subgradient method converges at the rate O(k−1/4) on weakly convex

functions. arXiv preprint arXiv:1802.02988.

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 11 / 25

SLIDE 12

Baseline: Subgradient method O(ε−5) [1, 2]

Apply (inexact) subgradient method uxk = ∇xg(xk, yk) , where, yk ≈ y∗(x) = arg max

y∈Y g(xk, y)

xk+1 = PX (xk − ηuxk) Sufficient condition: maxy g(xk, y) − g(xk, yk) ≤ O(ε2) [1] Setting Per-step (AGD) # iterations (Subgrad. method) Total complexity maxy g(x, y) O

ε−1

O

ε−4

O

ε−5

maxi fi(x) O (1) O

ε−4

O

ε−4

Does not utilize the smooth minimax structure of f (x) = maxyg(x, y)

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 12 / 25

SLIDE 13

Proximal Point method (PPM)

(Inexact) Proximal point method xk+1 ≈ arg min

x∈X f (x) + Lx − xk2

⇐ ⇒ xk+1 ≈ xk − 2L uxk+1, uxk+1 ∈ ∂f (xk+1) Iterations complexity to get ε-FOSP is O( 1

ε2 )

Proof sketch.

L-weak convexity implies, f (xk+1) +

uxk+1, xk − xk+1
− L/2xk − xk+12 ≤ f (xk)

Using update xk+1 = xk − 2L uxk+1 we get a Descent Lemma: f (xk+1) − f (xk) ≤ −3L/2uxk+12 After O( f (x0)−minx f (x)

ε2

) steps, mink uxk+1 = O(ε) . Generalized to ∇ 1

2L f (xk) due to inexact update and non-smooth f .

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 13 / 25

SLIDE 14

Per-step complexity of PPM

L-weakly convex + 2L-strongly convex = L-strongly convex f (x) + Lx − xk2 Each iteration solves L-strongly-convex–concave problem: xk+1 = arg min

x∈X max y∈Y [˜

gk(x, y) = g(x, y) + 2L/2x − xk2] Primal dual gap of O(ε2) is sufficient: max

y∈Y ˜

gk(xk+1, y) − min

x∈X ˜

gk(x, yk+1) = O(ε2) Algorithm for minx maxy ˜ gk(x, y) Per-step complexity Total complexity O

k−1

Cvx–Cve [Mirror-Prox, 3] O

ε−2

O

ε−4

O

k−2

Strongly-Cvx–Cve [ours] O

ε−1

O

ε−3

[3] A. Nemirovski. “Prox-method with rate of convergence O (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex–concave saddle point problems”. In: SIAM Journal on Optimization 15.1 (2004), pp. 229–251. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 14 / 25

SLIDE 15

Nonconvex–concave experiment

minx∈R2

f (x) = max1≤i≤m=9 fi(x)
, where fi(x) = aix − bi2

2 + ci.

100 101 102 103 104 105 106 107 10

8

10

6

10

4

10

2

100 102

∇f 1

2L (xk)

no. of gradient oracle accesses k

− → Subgradient − → method − → PPM (ours) − → Adaptive − → PPM (ours)

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 15 / 25

SLIDE 16

Smooth Convex–Concave minimax problem

g(·, y) is convex and g(x, ·) is concave, and g is L-smooth: max

a∈{x,y}

∇ag(x, y) − ∇ag(x′, y′)
≤ L
x − x′

+

y − y′
.

Primal: f (x) = maxy∈Y g(x, y). Dual: h(y) = minx∈X g(x, y) If X, Y are compact, then there is a saddle point (x∗, y∗) (Sion’s minimax theorem): min

x∈X f (x) = min x∈X max y∈Y g(x, y) = g(x∗, y∗) = min x∈X max y∈Y g(x, y) = max y∈Y h(y)

ε-primal dual pair (ε-PD pair)

(ˆ x, ˆ y) is an ε-primal dual pair if the primal-dual gap is less than ε f (ˆ x) − h(ˆ y) = max

y∈Y g(ˆ

x, y) − min

x∈X g(x, ˆ

y) ≤ ε

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 16 / 25

SLIDE 17

Optimal algorithms for smooth Convex–Concave minimax

Setting Previous state-of-the-art Our results Lower bound Strongly convex O

k−1

[3]

O
k−2

Ω(k−2) [4]

[3] A. Nemirovski. “Prox-method with rate of convergence O (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems”. In: SIAM Journal on Optimization 15.1 (2004), pp. 229–251. [4] Y. Ouyang, & Y. Xu (2018). Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point

problems. arXiv preprint arXiv:1808.02901.

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 17 / 25

SLIDE 18

Mirror-Descent (MD) algorithm [5]

For Euclidean norm MD has the following iteration xk+1 = PX (xk − η∇xg (xk, yk)) . yk+1 = PY (yk + η∇yg (xk, yk)) . Iterates and function value do not converge. Let z = (x, y).

g(xk, y) − g(x, yk) ≤ 1 2η

z − zk2 − z − zk+12
telescopes

+ zk − zk+12

residual
g( 1

k

k−1

i=0

xi, y) − g(x, 1 k

k−1

i=0

yi) ≤ 1 2 k η

z − z02 +

k−1

i=0

zi − zi+12 η = O( 1 √ k ) = ⇒ g( 1 k

k−1

i=0

xi, y) − g(x, 1 k

k−1

i=0

yi) = O( 1 √ k )

[5] A. Nemirovski, D. Yudin, Problem complexity and Method Efficiency in Optimization, Wiley, New York, 1983 Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 18 / 25

SLIDE 19

(Conceptual) Mirror-Prox (MP) algorithm [3]

For Euclidean norm MP has the following iteration xk+1 = PX (xk − η∇xg (xk+1, yk+1)) yk+1 = PY (yk + η∇yg (xk+1, yk+1)) Iterates and function value converge (z = (x, y))

g(xk+1, y) − g(x, yk+1) ≤ 1 2η

z − zk2 − z − zk+12
telescopes

− zk − zk+12

neg. residual
g( 1

k

k−1

i=0

xi, y) − g(x, 1 k

k−1

i=0

yi) ≤ 1 2 k η

z − z02

Implementable since PX×Y((xk, yk) − η∇g(x, y)) is contraction when ηL < 1

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 19 / 25

SLIDE 20

Smooth Strongly-Convex–Concave minimax problem

g(·, ·) is L-smooth and g(x, ·) is concave Additionally, assume that g(·, y) is σ-strongly-convex (σ < L) g(x, y) +

∇xg(x, y), x′ − x
+ σ

2 x′ − x2 ≤ g(x′, y) Then by duality of strong convexity and smoothness the dual problem h(y) = minx∈X g(x, y) is 2 L2

σ -smooth and hence differentiable

Further by Danskin’s theorem [6, Section 6.11] ∇h(y) = ∇yg(x∗(y), y) where x∗(y) = arg minx∈X g(x, y) Dual problem miny∈Y h(y) is smooth concave minimization problem.

[6] D. P. Bertsekas. Convex optimization theory. Athena Scientific Belmont, 2009. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 20 / 25

SLIDE 21

Dual Accelerated Gradient Ascent (AGA) method [ours]

O(k−2) AGA [7] on h(y) with η < σ/2 L2: τk = 2 (k + 2), ηk = (k + 1)η 2 wk = (1 − τk)yk + τkvk xk = min

x∈X g(x, wk), and yk+1 = PY (wk + η∇yg (xk, wk))

vk+1 = PY (vk + ηk∇yg (xk, wk)) AGA on g(xk, ·) at yk where xk = arg minx∈X g(x, wk) Accelerated rate on the dual, h(yk) − h(y∗) = O(k−2). Still slow rate for primal-dual gap, f (xk) − h(yk) = O(k−1).

[7] Y.E. Nesterov. “A method for solving the convex programming problem with convergence rate O(1/k2)”. In: Dokl. akad. nauk Sssr. Vol. 269. 1983, pp. 543–547. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 21 / 25

SLIDE 22

Dual Accelerated Gradient Ascent (AGA) method is slow

Consider minx∈[−1,1] maxy∈[−1,1] g(x, y) = x2/2 + xy. Then h(y) = −y2/2, f (x) = x2/2 + |x|, and (x∗, y∗) = (0, 0) Let h(yk) − h(y∗) = Θ(k−2) = ⇒ |yk| = Θ(k−1). Let xk = arg minx∈X g(x, yk) = −yk, = ⇒ |xk| = |yk| = Θ(k−1). Thus f (xk) − f (x∗) = x2

k/2 + |xk| = Θ(k−1)

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 22 / 25

SLIDE 23

Dual Implicit Accelerated Gradient (DIAG) method [ours]

For each k, apply AGA step of g(xk+1, ·) τk = 2 (k + 2), ηk = (k + 1)η 2 wk = (1 − τk)yk + τkvk xk+1 = arg min

x∈X g(x, yk+1), and yk+1 = PY (wk + η∇yg (xk+1, wk))

vk+1 = PY (vk + ηk∇yg (xk+1, wk)) AGA on g(xk, ·) at yk where xk = arg minx∈X g(x, yk+1) Primal-dual gap inherits the accelerated O(k−2) convergence of dual h(yk) = minx∈X g(x, yk)

g( 1 k

k

i=1

(2i) · xi, y) − g(x, yk) ≤ 2 y − y02 k (k + 1) η

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 23 / 25

SLIDE 24

Implementable DIAG

xk+1 = arg min

x∈X g(x, yk+1), and yk+1 = PY (wk + η∇yg (xk+1, wk))

Since η < 2L2/σ, the following operator (·)+ : Y → Y is a 1/2-contraction x∗(y) = arg min

x∈X g(x, y)

(y)+ = PY (wk + η∇yg (x∗(y), wk)) . Thus (x(i)

k , y(i) k ) converges approximately to (xk+1, yk+1) in O(log( 1 ε))

steps x(i)

k

= arg min

x∈X g(x, y(i) k )

y(i+1)

k

= PY

wk + η∇yg
x(i)

k , wk

.

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 24 / 25

SLIDE 25

Summary and my contributions

We studied smooth minimax opitmization problem Improved O(ε−3) algorithm for smooth Nonconvex–Concave problem Optimal O(k−2) algorithm for smooth Strongly-convex–Concave problem

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 25 / 25