Variance Reduction for Matrix Games Yair Carmon Yujia Jin - - PowerPoint PPT Presentation

variance reduction for matrix games
SMART_READER_LITE
LIVE PREVIEW

Variance Reduction for Matrix Games Yair Carmon Yujia Jin - - PowerPoint PPT Presentation

Poster #212 Variance Reduction for Matrix Games Yair Carmon Yujia Jin Aaron Sidford Kevin Tian ( presenting ) Poster #212 Zero-sum games min x max y f ( x , y ) Super useful! Constraints: checks


slide-1
SLIDE 1

Poster #212

Variance Reduction for Matrix Games

Yair Carmon Yujia Jin Aaron Sidford Kevin Tian

(presenting)

slide-2
SLIDE 2

Poster #212

Zero-sum games

min

x∈𝒴 max y∈𝒵 f(x, y)

Super useful!

๏ Constraints: checks feasibility (e.g. GAN’s) ๏ Robustness: represents uncertainty (e.g. adversarial training)

y y

slide-3
SLIDE 3

Poster #212

Zero-sum games

min

x∈𝒴 max y∈𝒵 f(x, y)

Super useful!

๏ Constraints: checks feasibility (e.g. GAN’s) ๏ Robustness: represents uncertainty (e.g. adversarial training)

y y

Ideal (approximate) solution -Nash equilibrium

ϵ

player is happy

x

f(x, y) ≤ min

x′∈𝒴 f(x′, y) + ϵ

f(x, y) ≥ max

y′∈𝒵 f(x, y′) − ϵ

player is happy

y

slide-4
SLIDE 4

Poster #212

Zero-sum games

min

x∈𝒴 max y∈𝒵 f(x, y)

Super useful!

๏ Constraints: checks feasibility (e.g. GAN’s) ๏ Robustness: represents uncertainty (e.g. adversarial training)

y y

Ideal (approximate) solution -Nash equilibrium

ϵ

player is happy

x

We assume is convex-concave Nash equilibrium exists

f ⟹

f(x, y) ≤ min

x′∈𝒴 f(x′, y) + ϵ

f(x, y) ≥ max

y′∈𝒵 f(x, y′) − ϵ

player is happy

y

slide-5
SLIDE 5

Poster #212

Our contributions

slide-6
SLIDE 6

Poster #212

Our contributions

  • 1. Variance reduction framework


for general (convex-concave)

min

x∈𝒴 max y∈𝒵 f(x, y)

slide-7
SLIDE 7

Poster #212

Our contributions

  • 1. Variance reduction framework


for general (convex-concave)

min

x∈𝒴 max y∈𝒵 f(x, y)

Centered

gradient estimator

Fast algorithm

slide-8
SLIDE 8

Poster #212

Our contributions

  • 1. Variance reduction framework


for general (convex-concave)

min

x∈𝒴 max y∈𝒵 f(x, y)

Geometry matters

Centered

gradient estimator

Fast algorithm

slide-9
SLIDE 9

Poster #212

Our contributions

  • 1. Variance reduction framework


for general (convex-concave)

min

x∈𝒴 max y∈𝒵 f(x, y)

“Sampling from the difference”

  • 2. Concrete centered gradient estimators


for

f(x) = y⊤Ax

Geometry matters

Centered

gradient estimator

Fast algorithm

slide-10
SLIDE 10

Poster #212

Our contributions

  • 1. Variance reduction framework


for general (convex-concave)

min

x∈𝒴 max y∈𝒵 f(x, y)

“Sampling from the difference” }

New runtimes for min

x∈𝒴 max y∈𝒵 y⊤Ax

  • 2. Concrete centered gradient estimators


for

f(x) = y⊤Ax

Geometry matters

Centered

gradient estimator

Fast algorithm

slide-11
SLIDE 11

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

slide-12
SLIDE 12

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

slide-13
SLIDE 13

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

simplex Matrix games / LP

𝒴 = 𝒵 =

slide-14
SLIDE 14

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

simplex Matrix games / LP

𝒴 = 𝒵 =

Euclidean, simplex Hard margin SVM

𝒴 = 𝒵 =

slide-15
SLIDE 15

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

simplex Matrix games / LP

𝒴 = 𝒵 =

Euclidean, simplex Hard margin SVM

𝒴 = 𝒵 =

Euclidean Linear regression

𝒴 = 𝒵 =

slide-16
SLIDE 16

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

simplex Matrix games / LP

𝒴 = 𝒵 =

Euclidean, simplex Hard margin SVM

𝒴 = 𝒵 =

Euclidean Linear regression

𝒴 = 𝒵 =

Balamurugan & Bach `16

slide-17
SLIDE 17

Poster #212

Bilinear games

min

x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n

๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves

Geometry matters

simplex Matrix games / LP

𝒴 = 𝒵 =

Euclidean, simplex Hard margin SVM

𝒴 = 𝒵 =

Euclidean Linear regression

𝒴 = 𝒵 =

Our work Balamurugan & Bach `16

slide-18
SLIDE 18

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

slide-19
SLIDE 19

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

for simplicity

m ≍ n x ↦ Ax takes n2 time

slide-20
SLIDE 20

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

slide-21
SLIDE 21

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

slide-22
SLIDE 22

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters

slide-23
SLIDE 23

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters

slide-24
SLIDE 24

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters

slide-25
SLIDE 25

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters Image credit: Chawit Waewsawangwong

slide-26
SLIDE 26

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters Image credit: Chawit Waewsawangwong

slide-27
SLIDE 27

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters Image credit: Chawit Waewsawangwong

slide-28
SLIDE 28

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters

VR always better

Image credit: Chawit Waewsawangwong Geometry matters

slide-29
SLIDE 29

Poster #212

Algorithms and rates

A ∈ ℝm×n

min

x∈𝒴 max y∈𝒵 y⊤Ax

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, NJLS09, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

for simplicity

m ≍ n x ↦ Ax takes n2 time

L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball

Geometry matters

VR always better VR better for passes

  • ver data

Ω(1)

Image credit: Chawit Waewsawangwong Geometry matters

slide-30
SLIDE 30

Poster #212

It’s all in the gradient estimator

Reference point x0

slide-31
SLIDE 31

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

slide-32
SLIDE 32

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

variance

slide-33
SLIDE 33

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

distance from reference point variance

slide-34
SLIDE 34

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

∇f gx0

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

distance from reference point variance

slide-35
SLIDE 35

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

∇f gx0

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

distance from reference point variance

slide-36
SLIDE 36

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

∇f gx0

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)

distance from reference point variance

slide-37
SLIDE 37

Poster #212

It’s all in the gradient estimator

Reference point x0

Centered gradient estimator and

gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2

* ≤ L2∥x − x0∥2

∇f gx0

Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16) Typical cost: ๏ Preprocessing (exact gradient computation): ( ) ๏ Per stochastic gradient: ( )

Texact ∝ n2 Tstoch ∝ n

distance from reference point variance

slide-38
SLIDE 38

Poster #212

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

[A⊤y, Ax] ∇f(x, y) =

slide-39
SLIDE 39

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

∇f(x, y) = [A⊤y0, Ax0] +

gradient at reference point

slide-40
SLIDE 40

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + [A⊤y0, Ax0] +

preprocessing cost

Texact = O(n2)

gradient at reference point

slide-41
SLIDE 41

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + [A⊤y0, Ax0] +

Sampling from the difference

preprocessing cost

Texact = O(n2)

gradient at reference point

slide-42
SLIDE 42

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + Ai:

∥y − y0∥1 sign([y − y0]i)

[A⊤y0, Ax0] +

Sampling from the difference

preprocessing cost

Texact = O(n2)

gradient at reference point

slide-43
SLIDE 43

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + j ∼

|x − x0| ∥x − x0∥1

Ai:

∥y − y0∥1 sign([y − y0]i)

[A⊤y0, Ax0] +

Sampling from the difference

preprocessing cost

Texact = O(n2)

gradient at reference point

slide-44
SLIDE 44

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + j ∼

|x − x0| ∥x − x0∥1

Ai:

∥y − y0∥1 sign([y − y0]i)

A:j

∥x − x0∥1 sign([x − x0]j)

[A⊤y0, Ax0] +

Sampling from the difference

preprocessing cost

Texact = O(n2)

gradient at reference point

slide-45
SLIDE 45

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + j ∼

|x − x0| ∥x − x0∥1

Ai:

∥y − y0∥1 sign([y − y0]i)

A:j

∥x − x0∥1 sign([x − x0]j)

[A⊤y0, Ax0] +

Sampling from the difference

per-estimation cost

Tstoch = O(n)

preprocessing cost

Texact = O(n2)

gradient at reference point

slide-46
SLIDE 46

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + j ∼

|x − x0| ∥x − x0∥1

Ai:

∥y − y0∥1 sign([y − y0]i)

A:j

∥x − x0∥1 sign([x − x0]j)

𝔽∥gx0,y0(x, y) − ∇f(x, y)∥2

∞ ≤

L2∥[x, y] − [x0, y0]∥2

1

[A⊤y0, Ax0] +

Sampling from the difference

per-estimation cost

Tstoch = O(n)

preprocessing cost

Texact = O(n2)

gradient at reference point

slide-47
SLIDE 47

Poster #212

(x − x0) (y − y0) A⊤

[ ]

A

,

Constructing a centered estimator

min

x∈simplex

max

y∈simplex y⊤Ax

i ∼

|y − y0| ∥y − y0∥1

∇f(x, y) =

[ ] ,

gx0,y0(x, y) = [A⊤y0, Ax0] + j ∼

|x − x0| ∥x − x0∥1

Ai:

∥y − y0∥1 sign([y − y0]i)

A:j

∥x − x0∥1 sign([x − x0]j)

𝔽∥gx0,y0(x, y) − ∇f(x, y)∥2

∞ ≤

L2∥[x, y] − [x0, y0]∥2

1

[A⊤y0, Ax0] +

Sampling from the difference

per-estimation cost

Tstoch = O(n)

preprocessing cost

Texact = O(n2)

Geometry matters gradient at reference point

slide-48
SLIDE 48

Poster #212

Variance reduction framework

(xk+1, yk+1) ← arg min

x∈𝒴 max y∈𝒵 {f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }

Basic proximal method (with parameter )

α

# of iterations cost per iteration Method

α ϵ cost of prox

slide-49
SLIDE 49

Poster #212

Variance reduction framework

(xk+1, yk+1) ← arg min

x∈𝒴 max y∈𝒵 {f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }

Basic proximal method (with parameter )

α

# of iterations cost per iteration Method

α ϵ cost of prox

rough solution to proximal problem extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← (xk+1, yk+1) ←

Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

slide-50
SLIDE 50

Poster #212

Variance reduction framework

(xk+1, yk+1) ← arg min

x∈𝒴 max y∈𝒵 {f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }

Basic proximal method (with parameter )

α

# of iterations cost per iteration Method

α ϵ cost of prox

rough solution to proximal problem extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← (xk+1, yk+1) ←

Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

(xk, yk) (xk+1/2, yk+1/2)

α

slide-51
SLIDE 51

Poster #212

Variance reduction framework

(xk+1, yk+1) ← arg min

x∈𝒴 max y∈𝒵 {f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }

Basic proximal method (with parameter )

α

# of iterations cost per iteration Method

α ϵ cost of prox

rough solution to proximal problem extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← (xk+1, yk+1) ←

Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

α

slide-52
SLIDE 52

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

slide-53
SLIDE 53

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1) Mirror-prox: rough solution = a gradient step

slide-54
SLIDE 54

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1) Mirror-prox: rough solution = a gradient step

slide-55
SLIDE 55

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1) Mirror-prox: rough solution = a gradient step ,

α = L

slide-56
SLIDE 56

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

α ϵ = L ϵ

Mirror-prox: rough solution = a gradient step ,

α = L

slide-57
SLIDE 57

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step ,

α = L

slide-58
SLIDE 58

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step

L ϵ ⋅ Texact

( )

= n2 ,

α = L

slide-59
SLIDE 59

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

slide-60
SLIDE 60

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

slide-61
SLIDE 61

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

Tstoch ⋅ L2 α2 +Texact

(main technical development)

slide-62
SLIDE 62

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

,

α = L

Tstoch Texact

Tstoch ⋅ L2 α2 +Texact

(main technical development)

slide-63
SLIDE 63

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

,

α = L

Tstoch Texact

Tstoch ⋅ L2 α2 +Texact

(main technical development)

  • Texact
slide-64
SLIDE 64

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

stochastic steps each taking time

Texact/Tstoch Tstoch

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

,

α = L

Tstoch Texact

Tstoch ⋅ L2 α2 +Texact

(main technical development)

  • Texact
slide-65
SLIDE 65

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Tstoch Texact

stochastic steps each taking time

Texact/Tstoch Tstoch

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

,

α = L

Tstoch Texact

Tstoch ⋅ L2 α2 +Texact

(main technical development)

  • Texact
slide-66
SLIDE 66

Poster #212

Variance reduction framework

rough solution to extra-gradient step (exact gradient)

(xk+1/2, yk+1/2) ← f(x, y) + α

2 ∥x − xk∥2 − α 2 ∥y − yk∥2

(xk+1, yk+1) ←

# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”

α ϵ cost of rough prox +

Texact

[re-center]

α

(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)

Total runtime

α ϵ = L ϵ

Tstoch Texact

stochastic steps each taking time

Texact/Tstoch Tstoch

α ϵ = L ϵ

Texact

Mirror-prox: rough solution = a gradient step Our appoach:

rough solution = centered stochastic gradient steps

L ϵ TexactTstoch

( )

= n3/2 ,

α = L

,

α = L

Tstoch Texact

Tstoch ⋅ L2 α2 +Texact

(main technical development)

  • Texact
slide-67
SLIDE 67

min

x∈𝒴 max y∈𝒵 f(x, y)

Poster #212

Summary

slide-68
SLIDE 68

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

Poster #212

Summary

slide-69
SLIDE 69

α

(x0, y0) (x1/2, y1/2) (x1, y1)

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

Poster #212

Summary

slide-70
SLIDE 70

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

Poster #212

Summary

slide-71
SLIDE 71

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

Geometry matters

Poster #212

Summary

slide-72
SLIDE 72

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

(x − x0) (y − y0) A⊤ i ∼

|y − y0| ∥y − y0∥1

[ ] ,

j ∼

|x − x0| ∥x − x0∥1

A

Geometry matters

sampling from the difference

Poster #212

Summary

slide-73
SLIDE 73

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

(x − x0) (y − y0) A⊤ i ∼

|y − y0| ∥y − y0∥1

[ ] ,

j ∼

|x − x0| ∥x − x0∥1

A

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

Geometry matters

sampling from the difference

Poster #212

Summary

slide-74
SLIDE 74

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

(x − x0) (y − y0) A⊤ i ∼

|y − y0| ∥y − y0∥1

[ ] ,

j ∼

|x − x0| ∥x − x0∥1

A

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

Image credit: Chawit Waewsawangwong Geometry matters

sampling from the difference

Poster #212

Summary

slide-75
SLIDE 75

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

(x − x0) (y − y0) A⊤ i ∼

|y − y0| ∥y − y0∥1

[ ] ,

j ∼

|x − x0| ∥x − x0∥1

A

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

Image credit: Chawit Waewsawangwong Geometry matters

sampling from the difference

Poster #212

Summary

slide-76
SLIDE 76

(x0, y0) (x1/2, y1/2) (x1, y1) …

[center]

Centered gradient estimator

Var gz0(z) ≤ L2∥z − z0∥2

min

x∈𝒴 max y∈𝒵 f(x, y)

(x − x0) (y − y0) A⊤ i ∼

|y − y0| ∥y − y0∥1

[ ] ,

j ∼

|x − x0| ∥x − x0∥1

A

Variance reduction

(our approach)

Exact gradient

(Nemirovski `04, Nesterov `07)

n2 ⋅ L ϵ

Stochastic gradient

(GK95, CHW10)

n ⋅ L2 ϵ2 n2 + n3/2 ⋅ L ϵ

VR always better VR better for passes

  • ver data

Ω(1)

Image credit: Chawit Waewsawangwong Geometry matters Geometry matters

sampling from the difference

Poster #212

Summary

Poster #212