Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 - - PowerPoint PPT Presentation

efficient algorithms for smooth minimax optimization
SMART_READER_LITE
LIVE PREVIEW

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 - - PowerPoint PPT Presentation

Efficient Algorithms for Smooth Minimax Optimization NeurIPS 2019 Kiran Koshy Thekumparampil , Prateek Jain , Praneeth Netrapalli , Sewoong Oh University of Illinois at Urbana-Champaign, Microsoft Research, India,


slide-1
SLIDE 1

Efficient Algorithms for Smooth Minimax Optimization

NeurIPS 2019 Kiran Koshy Thekumparampil†, Prateek Jain‡, Praneeth Netrapalli‡, Sewoong Oh±

†University of Illinois at Urbana-Champaign, ‡Microsoft Research, India, ±University of Washington, Seattle

Oct 27, 2019

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 1 / 25

slide-2
SLIDE 2

Outline

Minimax Optimization problem Efficient algorithm for Nonconvex–Concave minimax problem Optimal algorithm for Strongly-Convex—Concave minimax problem

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 2 / 25

slide-3
SLIDE 3

Minimax problem

Consider the general minimax problem min

x∈X max y∈Y g(x, y)

Two player game: y tries to maximize and x tries to minimize. The order of min & max or who plays first (x above) is important max

y∈Y min x∈X g(x, y) ≤ min x∈X max y∈Y g(x, y)

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 3 / 25

slide-4
SLIDE 4

Examples of Minimax problem

1 GAN: minG maxD V (G, D):

min

G max D E x∼PX

  • log (D(x))
  • +

E z∼QZ

  • log (1 − D(G(z)))
  • = JS(PX||QX)

2 Constrained optimization: minx f (x), s.t. fi(x) ≤ 0, ∀ i ∈ [m]

min

x max y≥0

  • L(x, y) = f (x) +

m

  • i=1

yifi(x)

  • 3 Robust estimation/optimization:

min

x

  • i

max

ˆ zi

f (x, ˆ zi) ∆(ˆ zi, zi) ≤ ε , ∀ i ∈ [m].

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 4 / 25

slide-5
SLIDE 5

Nonconvex minimax

In general g(x, y) is non-convex in both x and y. E.g. Neural network based GAN Very few works on nonconvex minimax We focus on smooth nonconvex–concave minimax problem, i.e. g(x, ·) is concave, and g is L-smooth: max

a∈{x,y}

  • ∇ag(x, y) − ∇ag(x′, y′)
  • ≤ L
  • x − x′

+

  • y − y′
  • .

E.g. smooth constrained optimization. In general: maxy∈Y minx∈X g(x, y) < minx∈X maxy∈Y g(x, y) We focus on the non-smooth nonconvex Primal problem: f (x) = maxy g(x, y)

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 5 / 25

slide-6
SLIDE 6

f (x) = maxy∈Y g(x, y) is non-smooth and weakly convex

f is non-smooth due to maximization over y

ρ-weakly convex function

We say that f is a ρ-weakly convex f if f + ρ

2 · 2 is convex, i.e.,

f (x) +

  • ux, x′ − x
  • − ρ

2x′ − x2 ≤ f (x′) , for all Fr´ echet subgradients ux ∈ ∂f (x), for all x, x′ ∈ X.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.4 0.6 0.8 1.0 1.2 1.4

f (x) = max{|x|, 1−x2

2 } ←

f (x) + x2

2 ←

− f is 1-weakly convex as f + ·2

2

is convex

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 6 / 25

slide-7
SLIDE 7

f (x) = maxy∈Y g(x, y) is non-smooth and weakly convex

f is non-smooth due to maximization over y

ρ-weakly convex function

We say that f is a ρ-weakly convex f if f + ρ

2 · 2 is convex, i.e.,

f (x) +

  • ux, x′ − x
  • − ρ

2x′ − x2 ≤ f (x′) , for all Fr´ echet subgradients ux ∈ ∂f (x), for all x, x′ ∈ X. Any L-smooth function is L-weakly convex f (x) +

  • ∇xf (x), x′ − x
  • − L

2x′ − x2 ≤ f (x′) −x is not weakly convex (due to upward pointing cusp).

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 7 / 25

slide-8
SLIDE 8

f (x) = maxy∈Y g(x, y) is non-smooth and weakly convex

f is non-smooth due to maximization over y

ρ-weakly convex function

We say that f is a ρ-weakly convex f if f + ρ

2 · 2 is convex, i.e.,

f (x) +

  • ux, x′ − x
  • − ρ

2x′ − x2 ≤ f (x′) , for all Fr´ echet subgradients ux ∈ ∂f (x), for all x, x′ ∈ X. f (x) = maxy∈Y g(x, y) is L-weakly convex, if g is L-smooth. g(x, y) +

  • ∇xg(x, y), x′ − x
  • − L

2x′ − x2 ≤ g(x′, y) = ⇒ f (x) +

  • ux, x′ − x
  • − L

2x′ − x2 ≤ f (x′) Cannot define approx. stationary point directly using subgradients

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 8 / 25

slide-9
SLIDE 9

First order stationary point of weakly-convex function

Moreau envelope fλ of a L-weakly convex function ( L < 1

λ):

fλ(x) = min

x′ f (x′) + 1

2λx − x′2 . fλ is a smooth lower bound of f : ∇fλ(x) = 0 = ⇒ 0 ∈ ∂f (x)

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.4 0.5 0.6 0.7 0.8 0.9 1.0

f (x) = max{|x|, 1−x2

2 } ←

− f0.5(x) ← −

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 9 / 25

slide-10
SLIDE 10

First order stationary point of weakly-convex function

Moreau envelope fλ of a L-weakly convex function ( L < 1

λ):

fλ(x) = min

x′ f (x′) + 1

2λx − x′2 . fλ is a smooth lower bound of f : ∇fλ(x) = 0 = ⇒ 0 ∈ ∂f (x)

ε-first order stationary point (ε-FOSP)

We say that x is an ε-first order stationary point of a L-weakly convex f if ∇f 1

2 L (x) ≤ ε. Further this implies that there exists ˆ

x s.t., ˆ x − x ≤ ε/2L and min

u∈∂f (ˆ x) u ≤ ε

Algorithm complexity is the no. of first-order oracle calls to obtain ε-FOSP. Convergence rate is εk if after k oracle calls we get εk-FOSP.

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 10 / 25

slide-11
SLIDE 11

Smooth nonconvex–concave minimax results

Setting Previous state-of-the-art Our result maxy g(x, y) O

  • ε−5

[1]

  • O
  • ε−3

maxi fi(x) = max

y∈∆m

m

i yifi(x)

O

  • ε−4

[2]

  • O
  • ε−3

∆m is the simplex of dimension m.

[1] Jin, C., Netrapalli, P., & Jordan, M. I. (2019). Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618. [2] Davis, D., & Drusvyatskiy, D. (2018). Stochastic subgradient method converges at the rate O(k−1/4) on weakly convex

  • functions. arXiv preprint arXiv:1802.02988.

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 11 / 25

slide-12
SLIDE 12

Baseline: Subgradient method O(ε−5) [1, 2]

Apply (inexact) subgradient method uxk = ∇xg(xk, yk) , where, yk ≈ y∗(x) = arg max

y∈Y g(xk, y)

xk+1 = PX (xk − ηuxk) Sufficient condition: maxy g(xk, y) − g(xk, yk) ≤ O(ε2) [1] Setting Per-step (AGD) # iterations (Subgrad. method) Total complexity maxy g(x, y) O

  • ε−1

O

  • ε−4

O

  • ε−5

maxi fi(x) O (1) O

  • ε−4

O

  • ε−4

Does not utilize the smooth minimax structure of f (x) = maxyg(x, y)

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 12 / 25

slide-13
SLIDE 13

Proximal Point method (PPM)

(Inexact) Proximal point method xk+1 ≈ arg min

x∈X f (x) + Lx − xk2

⇐ ⇒ xk+1 ≈ xk − 2L uxk+1, uxk+1 ∈ ∂f (xk+1) Iterations complexity to get ε-FOSP is O( 1

ε2 )

Proof sketch.

L-weak convexity implies, f (xk+1) +

  • uxk+1, xk − xk+1
  • − L/2xk − xk+12 ≤ f (xk)

Using update xk+1 = xk − 2L uxk+1 we get a Descent Lemma: f (xk+1) − f (xk) ≤ −3L/2uxk+12 After O( f (x0)−minx f (x)

ε2

) steps, mink uxk+1 = O(ε) . Generalized to ∇ 1

2L f (xk) due to inexact update and non-smooth f .

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 13 / 25

slide-14
SLIDE 14

Per-step complexity of PPM

L-weakly convex + 2L-strongly convex = L-strongly convex f (x) + Lx − xk2 Each iteration solves L-strongly-convex–concave problem: xk+1 = arg min

x∈X max y∈Y [˜

gk(x, y) = g(x, y) + 2L/2x − xk2] Primal dual gap of O(ε2) is sufficient: max

y∈Y ˜

gk(xk+1, y) − min

x∈X ˜

gk(x, yk+1) = O(ε2) Algorithm for minx maxy ˜ gk(x, y) Per-step complexity Total complexity O

  • k−1

Cvx–Cve [Mirror-Prox, 3] O

  • ε−2

O

  • ε−4

O

  • k−2

Strongly-Cvx–Cve [ours] O

  • ε−1

O

  • ε−3

[3] A. Nemirovski. “Prox-method with rate of convergence O (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex–concave saddle point problems”. In: SIAM Journal on Optimization 15.1 (2004), pp. 229–251. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 14 / 25

slide-15
SLIDE 15

Nonconvex–concave experiment

minx∈R2

  • f (x) = max1≤i≤m=9 fi(x)
  • , where fi(x) = aix − bi2

2 + ci.

100 101 102 103 104 105 106 107 10

8

10

6

10

4

10

2

100 102

∇f 1

2L (xk)

  • no. of gradient oracle accesses k

− → Subgradient − → method − → PPM (ours) − → Adaptive − → PPM (ours)

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 15 / 25

slide-16
SLIDE 16

Smooth Convex–Concave minimax problem

g(·, y) is convex and g(x, ·) is concave, and g is L-smooth: max

a∈{x,y}

  • ∇ag(x, y) − ∇ag(x′, y′)
  • ≤ L
  • x − x′

+

  • y − y′
  • .

Primal: f (x) = maxy∈Y g(x, y). Dual: h(y) = minx∈X g(x, y) If X, Y are compact, then there is a saddle point (x∗, y∗) (Sion’s minimax theorem): min

x∈X f (x) = min x∈X max y∈Y g(x, y) = g(x∗, y∗) = min x∈X max y∈Y g(x, y) = max y∈Y h(y)

ε-primal dual pair (ε-PD pair)

(ˆ x, ˆ y) is an ε-primal dual pair if the primal-dual gap is less than ε f (ˆ x) − h(ˆ y) = max

y∈Y g(ˆ

x, y) − min

x∈X g(x, ˆ

y) ≤ ε

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 16 / 25

slide-17
SLIDE 17

Optimal algorithms for smooth Convex–Concave minimax

Setting Previous state-of-the-art Our results Lower bound Strongly convex O

  • k−1

[3]

  • O
  • k−2

Ω(k−2) [4]

[3] A. Nemirovski. “Prox-method with rate of convergence O (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems”. In: SIAM Journal on Optimization 15.1 (2004), pp. 229–251. [4] Y. Ouyang, & Y. Xu (2018). Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point

  • problems. arXiv preprint arXiv:1808.02901.

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 17 / 25

slide-18
SLIDE 18

Mirror-Descent (MD) algorithm [5]

For Euclidean norm MD has the following iteration xk+1 = PX (xk − η∇xg (xk, yk)) . yk+1 = PY (yk + η∇yg (xk, yk)) . Iterates and function value do not converge. Let z = (x, y).

g(xk, y) − g(x, yk) ≤ 1 2η

  • z − zk2 − z − zk+12
  • telescopes

+ zk − zk+12

  • residual
  • g( 1

k

k−1

  • i=0

xi, y) − g(x, 1 k

k−1

  • i=0

yi) ≤ 1 2 k η

  • z − z02 +

k−1

  • i=0

zi − zi+12 η = O( 1 √ k ) = ⇒ g( 1 k

k−1

  • i=0

xi, y) − g(x, 1 k

k−1

  • i=0

yi) = O( 1 √ k )

[5] A. Nemirovski, D. Yudin, Problem complexity and Method Efficiency in Optimization, Wiley, New York, 1983 Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 18 / 25

slide-19
SLIDE 19

(Conceptual) Mirror-Prox (MP) algorithm [3]

For Euclidean norm MP has the following iteration xk+1 = PX (xk − η∇xg (xk+1, yk+1)) yk+1 = PY (yk + η∇yg (xk+1, yk+1)) Iterates and function value converge (z = (x, y))

g(xk+1, y) − g(x, yk+1) ≤ 1 2η

  • z − zk2 − z − zk+12
  • telescopes

− zk − zk+12

  • neg. residual
  • g( 1

k

k−1

  • i=0

xi, y) − g(x, 1 k

k−1

  • i=0

yi) ≤ 1 2 k η

  • z − z02

Implementable since PX×Y((xk, yk) − η∇g(x, y)) is contraction when ηL < 1

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 19 / 25

slide-20
SLIDE 20

Smooth Strongly-Convex–Concave minimax problem

g(·, ·) is L-smooth and g(x, ·) is concave Additionally, assume that g(·, y) is σ-strongly-convex (σ < L) g(x, y) +

  • ∇xg(x, y), x′ − x
  • + σ

2 x′ − x2 ≤ g(x′, y) Then by duality of strong convexity and smoothness the dual problem h(y) = minx∈X g(x, y) is 2 L2

σ -smooth and hence differentiable

Further by Danskin’s theorem [6, Section 6.11] ∇h(y) = ∇yg(x∗(y), y) where x∗(y) = arg minx∈X g(x, y) Dual problem miny∈Y h(y) is smooth concave minimization problem.

[6] D. P. Bertsekas. Convex optimization theory. Athena Scientific Belmont, 2009. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 20 / 25

slide-21
SLIDE 21

Dual Accelerated Gradient Ascent (AGA) method [ours]

O(k−2) AGA [7] on h(y) with η < σ/2 L2: τk = 2 (k + 2), ηk = (k + 1)η 2 wk = (1 − τk)yk + τkvk xk = min

x∈X g(x, wk), and yk+1 = PY (wk + η∇yg (xk, wk))

vk+1 = PY (vk + ηk∇yg (xk, wk)) AGA on g(xk, ·) at yk where xk = arg minx∈X g(x, wk) Accelerated rate on the dual, h(yk) − h(y∗) = O(k−2). Still slow rate for primal-dual gap, f (xk) − h(yk) = O(k−1).

[7] Y.E. Nesterov. “A method for solving the convex programming problem with convergence rate O(1/k2)”. In: Dokl. akad. nauk Sssr. Vol. 269. 1983, pp. 543–547. Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 21 / 25

slide-22
SLIDE 22

Dual Accelerated Gradient Ascent (AGA) method is slow

Consider minx∈[−1,1] maxy∈[−1,1] g(x, y) = x2/2 + xy. Then h(y) = −y2/2, f (x) = x2/2 + |x|, and (x∗, y∗) = (0, 0) Let h(yk) − h(y∗) = Θ(k−2) = ⇒ |yk| = Θ(k−1). Let xk = arg minx∈X g(x, yk) = −yk, = ⇒ |xk| = |yk| = Θ(k−1). Thus f (xk) − f (x∗) = x2

k/2 + |xk| = Θ(k−1)

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 22 / 25

slide-23
SLIDE 23

Dual Implicit Accelerated Gradient (DIAG) method [ours]

For each k, apply AGA step of g(xk+1, ·) τk = 2 (k + 2), ηk = (k + 1)η 2 wk = (1 − τk)yk + τkvk xk+1 = arg min

x∈X g(x, yk+1), and yk+1 = PY (wk + η∇yg (xk+1, wk))

vk+1 = PY (vk + ηk∇yg (xk+1, wk)) AGA on g(xk, ·) at yk where xk = arg minx∈X g(x, yk+1) Primal-dual gap inherits the accelerated O(k−2) convergence of dual h(yk) = minx∈X g(x, yk)

g( 1 k

k

  • i=1

(2i) · xi, y) − g(x, yk) ≤ 2 y − y02 k (k + 1) η

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 23 / 25

slide-24
SLIDE 24

Implementable DIAG

xk+1 = arg min

x∈X g(x, yk+1), and yk+1 = PY (wk + η∇yg (xk+1, wk))

Since η < 2L2/σ, the following operator (·)+ : Y → Y is a 1/2-contraction x∗(y) = arg min

x∈X g(x, y)

(y)+ = PY (wk + η∇yg (x∗(y), wk)) . Thus (x(i)

k , y(i) k ) converges approximately to (xk+1, yk+1) in O(log( 1 ε))

steps x(i)

k

= arg min

x∈X g(x, y(i) k )

y(i+1)

k

= PY

  • wk + η∇yg
  • x(i)

k , wk

  • .

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 24 / 25

slide-25
SLIDE 25

Summary and my contributions

We studied smooth minimax opitmization problem Improved O(ε−3) algorithm for smooth Nonconvex–Concave problem Optimal O(k−2) algorithm for smooth Strongly-convex–Concave problem

Thekumparampil, Jain, Netrapalli, Oh Oct 27 2019 25 / 25