[PPT] - Variance Reduction for Matrix Games Yair Carmon Yujia Jin PowerPoint Presentation

SLIDE 1

Poster #212

Variance Reduction for Matrix Games

Yair Carmon Yujia Jin Aaron Sidford Kevin Tian

(presenting)

SLIDE 2

Poster #212

Zero-sum games

min

x∈𝒴 max y∈𝒵 f(x, y)

Super useful!

๏ Constraints: checks feasibility (e.g. GAN’s) ๏ Robustness: represents uncertainty (e.g. adversarial training)

y y

SLIDE 3

Poster #212

Zero-sum games

min

x∈𝒴 max y∈𝒵 f(x, y)

Super useful!

๏ Constraints: checks feasibility (e.g. GAN’s) ๏ Robustness: represents uncertainty (e.g. adversarial training)

y y

Ideal (approximate) solution -Nash equilibrium

ϵ

player is happy

x

f(x, y) ≤ min

x′∈𝒴 f(x′, y) + ϵ

f(x, y) ≥ max

y′∈𝒵 f(x, y′) − ϵ

player is happy

y

SLIDE 4

Poster #212

Zero-sum games

min

x∈𝒴 max y∈𝒵 f(x, y)

Super useful!

๏ Constraints: checks feasibility (e.g. GAN’s) ๏ Robustness: represents uncertainty (e.g. adversarial training)

y y

Ideal (approximate) solution -Nash equilibrium

ϵ

player is happy

x

We assume is convex-concave Nash equilibrium exists

f ⟹

f(x, y) ≤ min

x′∈𝒴 f(x′, y) + ϵ

f(x, y) ≥ max

y′∈𝒵 f(x, y′) − ϵ

player is happy

y

SLIDE 5

Poster #212

Our contributions

SLIDE 6

Poster #212

Our contributions

1. Variance reduction framework

for general (convex-concave)

min

x∈𝒴 max y∈𝒵 f(x, y)

SLIDE 7

Poster #212

Our contributions

1. Variance reduction framework

for general (convex-concave)

min

x∈𝒴 max y∈𝒵 f(x, y)

Centered

gradient estimator

Fast algorithm

SLIDE 8

Poster #212

Our contributions

1. Variance reduction framework

for general (convex-concave)

min

x∈𝒴 max y∈𝒵 f(x, y)

Geometry matters

Centered

gradient estimator

Fast algorithm

SLIDE 9

Poster #212

Our contributions

1. Variance reduction framework

for general (convex-concave)

min

x∈𝒴 max y∈𝒵 f(x, y)

“Sampling from the difference”

2. Concrete centered gradient estimators

for

f(x) = y⊤Ax

Geometry matters

Centered

gradient estimator

Fast algorithm

SLIDE 10

Poster #212

Our contributions

1. Variance reduction framework

for general (convex-concave)

min

x∈𝒴 max y∈𝒵 f(x, y)

“Sampling from the difference” }

New runtimes for min

x∈𝒴 max y∈𝒵 y⊤Ax

2. Concrete centered gradient estimators

for

f(x) = y⊤Ax

Geometry matters

Centered

gradient estimator

Fast algorithm

SLIDE 11

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

SLIDE 12

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

SLIDE 13

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

simplex Matrix games / LP

𝒴 = 𝒵 =

SLIDE 14

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

simplex Matrix games / LP

𝒴 = 𝒵 =

Euclidean, simplex Hard margin SVM

𝒴 = 𝒵 =

SLIDE 15

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

simplex Matrix games / LP

𝒴 = 𝒵 =

Euclidean, simplex Hard margin SVM

𝒴 = 𝒵 =

Euclidean Linear regression

𝒴 = 𝒵 =

SLIDE 16

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

simplex Matrix games / LP

𝒴 = 𝒵 =

Euclidean, simplex Hard margin SVM

𝒴 = 𝒵 =

Euclidean Linear regression

𝒴 = 𝒵 =

Balamurugan & Bach `16

SLIDE 17

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

simplex Matrix games / LP

𝒴 = 𝒵 =

Euclidean, simplex Hard margin SVM

𝒴 = 𝒵 =

Euclidean Linear regression

𝒴 = 𝒵 =

Our work Balamurugan & Bach `16

SLIDE 18

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

SLIDE 19

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

for simplicity

m ≍ n x ↦ Ax takes n2 time

SLIDE 20

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

SLIDE 21

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

SLIDE 22

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters

SLIDE 23

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters

SLIDE 24

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters

SLIDE 25

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters Image credit: Chawit Waewsawangwong

SLIDE 26

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters Image credit: Chawit Waewsawangwong

SLIDE 27

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters Image credit: Chawit Waewsawangwong

SLIDE 28

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters

VR always better

Image credit: Chawit Waewsawangwong Geometry matters

SLIDE 29

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters

VR always better VR better for passes

ver data

Ω(1)

Image credit: Chawit Waewsawangwong Geometry matters

SLIDE 30

Poster #212

It’s all in the gradient estimator

Reference point x0

SLIDE 31

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

SLIDE 32

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

variance

SLIDE 33

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

distance from reference point variance

SLIDE 34

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

∇f gx0

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

distance from reference point variance

SLIDE 35

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

∇f gx0

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

distance from reference point variance

SLIDE 36

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

∇f gx0

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

distance from reference point variance

SLIDE 37

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

∇f gx0

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16) Typical cost: ๏ Preprocessing (exact gradient computation): ( ) ๏ Per stochastic gradient: ( )

Texact ∝ n2 Tstoch ∝ n

distance from reference point variance

SLIDE 38

Poster #212

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

[A⊤y, Ax] ∇f(x, y) =

SLIDE 39

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

∇f(x, y) = [A⊤y0, Ax0] +

gradient at reference point

SLIDE 40

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + [A⊤y0, Ax0] +

preprocessing cost

Texact = O(n2)

gradient at reference point

SLIDE 41

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + [A⊤y0, Ax0] +

Sampling from the difference

preprocessing cost

Texact = O(n2)

gradient at reference point

SLIDE 42

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + Ai:

∥y − y0∥1 sign([y − y0]i)

[A⊤y0, Ax0] +

Sampling from the difference

preprocessing cost

Texact = O(n2)

gradient at reference point

SLIDE 43

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + j ∼

|x − x0| ∥x − x0∥1

Ai:

∥y − y0∥1 sign([y − y0]i)

[A⊤y0, Ax0] +

Sampling from the difference

preprocessing cost

Texact = O(n2)

gradient at reference point

SLIDE 44

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + j ∼

|x − x0| ∥x − x0∥1

Ai:

∥y − y0∥1 sign([y − y0]i)

A:j

∥x − x0∥1 sign([x − x0]j)

[A⊤y0, Ax0] +

Sampling from the difference

preprocessing cost

Texact = O(n2)

gradient at reference point

SLIDE 45

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + j ∼

|x − x0| ∥x − x0∥1

Ai:

∥y − y0∥1 sign([y − y0]i)

A:j

∥x − x0∥1 sign([x − x0]j)

[A⊤y0, Ax0] +

Sampling from the difference

per-estimation cost

Tstoch = O(n)

preprocessing cost

Texact = O(n2)

gradient at reference point

SLIDE 46

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + j ∼

|x − x0| ∥x − x0∥1

Ai:

∥y − y0∥1 sign([y − y0]i)

A:j

∥x − x0∥1 sign([x − x0]j)

𝔽∥gx0,y0(x, y) − ∇f(x, y)∥2

∞ ≤

L2∥[x, y] − [x0, y0]∥2

1

[A⊤y0, Ax0] +

Sampling from the difference

per-estimation cost

Tstoch = O(n)

preprocessing cost

Texact = O(n2)

gradient at reference point

SLIDE 47

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + j ∼

|x − x0| ∥x − x0∥1

Ai:

∥y − y0∥1 sign([y − y0]i)

A:j

∥x − x0∥1 sign([x − x0]j)

𝔽∥gx0,y0(x, y) − ∇f(x, y)∥2

∞ ≤

L2∥[x, y] − [x0, y0]∥2

1

[A⊤y0, Ax0] +

Sampling from the difference

per-estimation cost

Tstoch = O(n)

preprocessing cost

Texact = O(n2)

Geometry matters gradient at reference point

SLIDE 48

Poster #212

Variance reduction framework

(xk+1, yk+1) ← arg min

x∈𝒴 max y∈𝒵 {f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }

Basic proximal method (with parameter )

α

# of iterations cost per iteration Method

α ϵ cost of prox

SLIDE 49

Poster #212

Variance reduction framework

(xk+1, yk+1) ← arg min

x∈𝒴 max y∈𝒵 {f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }

Basic proximal method (with parameter )

α

# of iterations cost per iteration Method

α ϵ cost of prox

rough solution to proximal problem extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← (xk+1, yk+1) ←

Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

SLIDE 50

Poster #212

Variance reduction framework

(xk+1, yk+1) ← arg min

x∈𝒴 max y∈𝒵 {f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }

Basic proximal method (with parameter )

α

# of iterations cost per iteration Method

α ϵ cost of prox

rough solution to proximal problem extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← (xk+1, yk+1) ←

Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

(xk, yk) (xk+1/2, yk+1/2)

α

SLIDE 51

Poster #212

Variance reduction framework

(xk+1, yk+1) ← arg min

x∈𝒴 max y∈𝒵 {f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }

Basic proximal method (with parameter )

α

# of iterations cost per iteration Method

α ϵ cost of prox

rough solution to proximal problem extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← (xk+1, yk+1) ←

Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

α

SLIDE 52

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

SLIDE 53

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1) Mirror-prox: rough solution = a gradient step

SLIDE 54

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1) Mirror-prox: rough solution = a gradient step

SLIDE 55

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1) Mirror-prox: rough solution = a gradient step ,

α = L

SLIDE 56

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

α ϵ = L ϵ

Mirror-prox: rough solution = a gradient step ,

α = L

SLIDE 57

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step ,

α = L

SLIDE 58

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step

L ϵ ⋅ Texact

( )

= n2 ,

α = L

SLIDE 59

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

SLIDE 60

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

…

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

SLIDE 61

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

…

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

Tstoch ⋅ L2 α2 +Texact

(main technical development)

SLIDE 62

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

…

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

,

α = L

Tstoch Texact

Tstoch ⋅ L2 α2 +Texact

(main technical development)

SLIDE 63

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

…

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

,

α = L

Tstoch Texact

Tstoch ⋅ L2 α2 +Texact

(main technical development)

Texact

SLIDE 64

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

…

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

stochastic steps each taking time

Texact/Tstoch Tstoch

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

,

α = L

Tstoch Texact

Tstoch ⋅ L2 α2 +Texact

(main technical development)

Texact

SLIDE 65

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

…

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Tstoch Texact

stochastic steps each taking time

Texact/Tstoch Tstoch

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

,

α = L

Tstoch Texact

Tstoch ⋅ L2 α2 +Texact

(main technical development)

Texact

SLIDE 66

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

…

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Tstoch Texact

stochastic steps each taking time

Texact/Tstoch Tstoch

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

,

α = L

Tstoch Texact

Tstoch ⋅ L2 α2 +Texact

(main technical development)

Texact

SLIDE 67

min

x∈𝒴 max y∈𝒵 f(x, y)

Poster #212

Summary

SLIDE 68

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

Poster #212

Summary

SLIDE 69

α

(x0, y0) (x1/2, y1/2) (x1, y1)

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

Poster #212

Summary

SLIDE 70

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

Poster #212

Summary

SLIDE 71

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

Geometry matters

Poster #212

Summary

SLIDE 72

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

(x − x0) (y − y0) A⊤ i ∼

|y − y0| ∥y − y0∥1

[ ] ,

j ∼

|x − x0| ∥x − x0∥1

A

Geometry matters

sampling from the difference

Poster #212

Summary

SLIDE 73

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

(x − x0) (y − y0) A⊤ i ∼

|y − y0| ∥y − y0∥1

[ ] ,

j ∼

|x − x0| ∥x − x0∥1

A

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

Geometry matters

sampling from the difference

Poster #212

Summary

SLIDE 74

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

(x − x0) (y − y0) A⊤ i ∼

|y − y0| ∥y − y0∥1

[ ] ,

j ∼

|x − x0| ∥x − x0∥1

A

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

Image credit: Chawit Waewsawangwong Geometry matters

sampling from the difference

Poster #212

Summary

SLIDE 75

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

(x − x0) (y − y0) A⊤ i ∼

|y − y0| ∥y − y0∥1

[ ] ,

j ∼

|x − x0| ∥x − x0∥1

A

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

Image credit: Chawit Waewsawangwong Geometry matters

sampling from the difference

Poster #212

Summary

SLIDE 76

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

(x − x0) (y − y0) A⊤ i ∼

|y − y0| ∥y − y0∥1

[ ] ,

j ∼

|x − x0| ∥x − x0∥1

A

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

VR always better VR better for passes

ver data

Ω(1)

Image credit: Chawit Waewsawangwong Geometry matters Geometry matters

sampling from the difference

Poster #212

Summary

Poster #212