Poster #212
Variance Reduction for Matrix Games
Yair Carmon Yujia Jin Aaron Sidford Kevin Tian
(presenting)
Variance Reduction for Matrix Games Yair Carmon Yujia Jin - - PowerPoint PPT Presentation
Poster #212 Variance Reduction for Matrix Games Yair Carmon Yujia Jin Aaron Sidford Kevin Tian ( presenting ) Poster #212 Zero-sum games min x max y f ( x , y ) Super useful! Constraints: checks
Poster #212
Yair Carmon Yujia Jin Aaron Sidford Kevin Tian
(presenting)
Poster #212
Super useful!
๏ Constraints: checks feasibility (e.g. GAN’s) ๏ Robustness: represents uncertainty (e.g. adversarial training)
y y
Poster #212
Super useful!
๏ Constraints: checks feasibility (e.g. GAN’s) ๏ Robustness: represents uncertainty (e.g. adversarial training)
y y
Ideal (approximate) solution -Nash equilibrium
ϵ
player is happy
x
x′∈𝒴 f(x′, y) + ϵ
y′∈𝒵 f(x, y′) − ϵ
player is happy
y
Poster #212
Super useful!
๏ Constraints: checks feasibility (e.g. GAN’s) ๏ Robustness: represents uncertainty (e.g. adversarial training)
y y
Ideal (approximate) solution -Nash equilibrium
ϵ
player is happy
x
We assume is convex-concave Nash equilibrium exists
f ⟹
x′∈𝒴 f(x′, y) + ϵ
y′∈𝒵 f(x, y′) − ϵ
player is happy
y
Poster #212
Poster #212
x∈𝒴 max y∈𝒵 f(x, y)
Poster #212
x∈𝒴 max y∈𝒵 f(x, y)
gradient estimator
Poster #212
x∈𝒴 max y∈𝒵 f(x, y)
gradient estimator
Poster #212
x∈𝒴 max y∈𝒵 f(x, y)
gradient estimator
Poster #212
x∈𝒴 max y∈𝒵 f(x, y)
x∈𝒴 max y∈𝒵 y⊤Ax
gradient estimator
Poster #212
x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n
๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves
Poster #212
x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n
๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves
Poster #212
x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n
๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves
simplex Matrix games / LP
𝒴 = 𝒵 =
Poster #212
x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n
๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves
simplex Matrix games / LP
𝒴 = 𝒵 =
Euclidean, simplex Hard margin SVM
𝒴 = 𝒵 =
Poster #212
x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n
๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves
simplex Matrix games / LP
𝒴 = 𝒵 =
Euclidean, simplex Hard margin SVM
𝒴 = 𝒵 =
Euclidean Linear regression
𝒴 = 𝒵 =
Poster #212
x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n
๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves
simplex Matrix games / LP
𝒴 = 𝒵 =
Euclidean, simplex Hard margin SVM
𝒴 = 𝒵 =
Euclidean Linear regression
𝒴 = 𝒵 =
Balamurugan & Bach `16
Poster #212
x∈𝒴 max y∈𝒵 y⊤Ax, A ∈ ℝm×n
๏ Simplest case ๏ Local model for smooth zero-sum game ๏ Important by themselves
simplex Matrix games / LP
𝒴 = 𝒵 =
Euclidean, simplex Hard margin SVM
𝒴 = 𝒵 =
Euclidean Linear regression
𝒴 = 𝒵 =
Our work Balamurugan & Bach `16
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
for simplicity
m ≍ n x ↦ Ax takes n2 time
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Exact gradient
(Nemirovski `04, Nesterov `07)
for simplicity
m ≍ n x ↦ Ax takes n2 time
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Exact gradient
(Nemirovski `04, Nesterov `07)
for simplicity
m ≍ n x ↦ Ax takes n2 time
L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Exact gradient
(Nemirovski `04, Nesterov `07)
for simplicity
m ≍ n x ↦ Ax takes n2 time
L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball
Geometry matters
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, NJLS09, CHW10)
for simplicity
m ≍ n x ↦ Ax takes n2 time
L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball
Geometry matters
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Variance reduction
(our approach)
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, NJLS09, CHW10)
for simplicity
m ≍ n x ↦ Ax takes n2 time
L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball
Geometry matters
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Variance reduction
(our approach)
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, NJLS09, CHW10)
for simplicity
m ≍ n x ↦ Ax takes n2 time
L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball
Geometry matters Image credit: Chawit Waewsawangwong
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Variance reduction
(our approach)
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, NJLS09, CHW10)
for simplicity
m ≍ n x ↦ Ax takes n2 time
L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball
Geometry matters Image credit: Chawit Waewsawangwong
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Variance reduction
(our approach)
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, NJLS09, CHW10)
for simplicity
m ≍ n x ↦ Ax takes n2 time
L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball
Geometry matters Image credit: Chawit Waewsawangwong
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Variance reduction
(our approach)
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, NJLS09, CHW10)
for simplicity
m ≍ n x ↦ Ax takes n2 time
L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball
Geometry matters
VR always better
Image credit: Chawit Waewsawangwong Geometry matters
Poster #212
A ∈ ℝm×n
x∈𝒴 max y∈𝒵 y⊤Ax
Variance reduction
(our approach)
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, NJLS09, CHW10)
for simplicity
m ≍ n x ↦ Ax takes n2 time
L = { maxij |Aij| simplex-simplex maxi ∥Ai:∥2 simplex-ball
Geometry matters
VR always better VR better for passes
Ω(1)
Image credit: Chawit Waewsawangwong Geometry matters
Poster #212
Poster #212
Centered gradient estimator and
gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2
* ≤ L2∥x − x0∥2
Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)
Poster #212
Centered gradient estimator and
gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2
* ≤ L2∥x − x0∥2
Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)
variance
Poster #212
Centered gradient estimator and
gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2
* ≤ L2∥x − x0∥2
Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)
distance from reference point variance
Poster #212
Centered gradient estimator and
gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2
* ≤ L2∥x − x0∥2
Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)
distance from reference point variance
Poster #212
Centered gradient estimator and
gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2
* ≤ L2∥x − x0∥2
Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)
distance from reference point variance
Poster #212
Centered gradient estimator and
gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2
* ≤ L2∥x − x0∥2
Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16)
distance from reference point variance
Poster #212
Centered gradient estimator and
gx0( ⋅ ) 𝔽 gx0(x) = ∇f(x) 𝔽∥gx0(x) − ∇f(x)∥2
* ≤ L2∥x − x0∥2
Also using this concept in the Euclidean setting: VR for non-convex optimization (AH`16, RHSPS`16, FLLZ`18, ZXG`18) & bilinear saddle-point problems (BB`16) Typical cost: ๏ Preprocessing (exact gradient computation): ( ) ๏ Per stochastic gradient: ( )
Texact ∝ n2 Tstoch ∝ n
distance from reference point variance
Poster #212
x∈simplex
y∈simplex y⊤Ax
Poster #212
x∈simplex
y∈simplex y⊤Ax
gradient at reference point
Poster #212
x∈simplex
y∈simplex y⊤Ax
preprocessing cost
Texact = O(n2)
gradient at reference point
Poster #212
x∈simplex
y∈simplex y⊤Ax
|y − y0| ∥y − y0∥1
Sampling from the difference
preprocessing cost
Texact = O(n2)
gradient at reference point
Poster #212
x∈simplex
y∈simplex y⊤Ax
|y − y0| ∥y − y0∥1
∥y − y0∥1 sign([y − y0]i)
Sampling from the difference
preprocessing cost
Texact = O(n2)
gradient at reference point
Poster #212
x∈simplex
y∈simplex y⊤Ax
|y − y0| ∥y − y0∥1
|x − x0| ∥x − x0∥1
∥y − y0∥1 sign([y − y0]i)
Sampling from the difference
preprocessing cost
Texact = O(n2)
gradient at reference point
Poster #212
x∈simplex
y∈simplex y⊤Ax
|y − y0| ∥y − y0∥1
|x − x0| ∥x − x0∥1
∥y − y0∥1 sign([y − y0]i)
∥x − x0∥1 sign([x − x0]j)
Sampling from the difference
preprocessing cost
Texact = O(n2)
gradient at reference point
Poster #212
x∈simplex
y∈simplex y⊤Ax
|y − y0| ∥y − y0∥1
|x − x0| ∥x − x0∥1
∥y − y0∥1 sign([y − y0]i)
∥x − x0∥1 sign([x − x0]j)
Sampling from the difference
per-estimation cost
Tstoch = O(n)
preprocessing cost
Texact = O(n2)
gradient at reference point
Poster #212
x∈simplex
y∈simplex y⊤Ax
|y − y0| ∥y − y0∥1
|x − x0| ∥x − x0∥1
∥y − y0∥1 sign([y − y0]i)
∥x − x0∥1 sign([x − x0]j)
∞ ≤
1
Sampling from the difference
per-estimation cost
Tstoch = O(n)
preprocessing cost
Texact = O(n2)
gradient at reference point
Poster #212
x∈simplex
y∈simplex y⊤Ax
|y − y0| ∥y − y0∥1
|x − x0| ∥x − x0∥1
∥y − y0∥1 sign([y − y0]i)
∥x − x0∥1 sign([x − x0]j)
∞ ≤
1
Sampling from the difference
per-estimation cost
Tstoch = O(n)
preprocessing cost
Texact = O(n2)
Geometry matters gradient at reference point
Poster #212
(xk+1, yk+1) ← arg min
x∈𝒴 max y∈𝒵 {f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }
Basic proximal method (with parameter )
α
# of iterations cost per iteration Method
α ϵ cost of prox
Poster #212
(xk+1, yk+1) ← arg min
x∈𝒴 max y∈𝒵 {f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }
Basic proximal method (with parameter )
α
# of iterations cost per iteration Method
α ϵ cost of prox
rough solution to proximal problem extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← (xk+1, yk+1) ←
Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
Poster #212
(xk+1, yk+1) ← arg min
x∈𝒴 max y∈𝒵 {f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }
Basic proximal method (with parameter )
α
# of iterations cost per iteration Method
α ϵ cost of prox
rough solution to proximal problem extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← (xk+1, yk+1) ←
Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
(xk, yk) (xk+1/2, yk+1/2)
Poster #212
(xk+1, yk+1) ← arg min
x∈𝒴 max y∈𝒵 {f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2 }
Basic proximal method (with parameter )
α
# of iterations cost per iteration Method
α ϵ cost of prox
rough solution to proximal problem extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← (xk+1, yk+1) ←
Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1) Mirror-prox: rough solution = a gradient step
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1) Mirror-prox: rough solution = a gradient step
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1) Mirror-prox: rough solution = a gradient step ,
α = L
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
α ϵ = L ϵ
Mirror-prox: rough solution = a gradient step ,
α = L
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
α ϵ = L ϵ
Texact
Mirror-prox: rough solution = a gradient step ,
α = L
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
α ϵ = L ϵ
Texact
Mirror-prox: rough solution = a gradient step
( )
= n2 ,
α = L
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
α ϵ = L ϵ
Texact
Mirror-prox: rough solution = a gradient step Our appoach:
rough solution = centered stochastic gradient steps
( )
= n3/2 ,
α = L
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
…
[re-center]
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
α ϵ = L ϵ
Texact
Mirror-prox: rough solution = a gradient step Our appoach:
rough solution = centered stochastic gradient steps
( )
= n3/2 ,
α = L
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
…
[re-center]
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
α ϵ = L ϵ
Texact
Mirror-prox: rough solution = a gradient step Our appoach:
rough solution = centered stochastic gradient steps
( )
= n3/2 ,
α = L
Tstoch ⋅ L2 α2 +Texact
(main technical development)
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
…
[re-center]
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
α ϵ = L ϵ
Texact
Mirror-prox: rough solution = a gradient step Our appoach:
rough solution = centered stochastic gradient steps
( )
= n3/2 ,
α = L
,
α = L
Tstoch Texact
Tstoch ⋅ L2 α2 +Texact
(main technical development)
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
…
[re-center]
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
α ϵ = L ϵ
Texact
Mirror-prox: rough solution = a gradient step Our appoach:
rough solution = centered stochastic gradient steps
( )
= n3/2 ,
α = L
,
α = L
Tstoch Texact
Tstoch ⋅ L2 α2 +Texact
(main technical development)
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
…
[re-center]
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
stochastic steps each taking time
Texact/Tstoch Tstoch
α ϵ = L ϵ
Texact
Mirror-prox: rough solution = a gradient step Our appoach:
rough solution = centered stochastic gradient steps
( )
= n3/2 ,
α = L
,
α = L
Tstoch Texact
Tstoch ⋅ L2 α2 +Texact
(main technical development)
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
…
[re-center]
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
α ϵ = L ϵ
Tstoch Texact
stochastic steps each taking time
Texact/Tstoch Tstoch
α ϵ = L ϵ
Texact
Mirror-prox: rough solution = a gradient step Our appoach:
rough solution = centered stochastic gradient steps
( )
= n3/2 ,
α = L
,
α = L
Tstoch Texact
Tstoch ⋅ L2 α2 +Texact
(main technical development)
Poster #212
rough solution to extra-gradient step (exact gradient)
(xk+1/2, yk+1/2) ← f(x, y) + α
2 ∥x − xk∥2 − α 2 ∥y − yk∥2
(xk+1, yk+1) ←
# of iterations cost per iteration Method Nemirovski’s “conceptual prox-method”
α ϵ cost of rough prox +
Texact
…
[re-center]
(xk, yk) (xk+1/2, yk+1/2) (xk+1, yk+1)
α ϵ = L ϵ
Tstoch Texact
stochastic steps each taking time
Texact/Tstoch Tstoch
α ϵ = L ϵ
Texact
Mirror-prox: rough solution = a gradient step Our appoach:
rough solution = centered stochastic gradient steps
( )
= n3/2 ,
α = L
,
α = L
Tstoch Texact
Tstoch ⋅ L2 α2 +Texact
(main technical development)
x∈𝒴 max y∈𝒵 f(x, y)
Centered gradient estimator
Var gz0(z) ≤ L2∥z − z0∥2
x∈𝒴 max y∈𝒵 f(x, y)
(x0, y0) (x1/2, y1/2) (x1, y1)
Centered gradient estimator
Var gz0(z) ≤ L2∥z − z0∥2
x∈𝒴 max y∈𝒵 f(x, y)
(x0, y0) (x1/2, y1/2) (x1, y1) …
[center]
Centered gradient estimator
Var gz0(z) ≤ L2∥z − z0∥2
x∈𝒴 max y∈𝒵 f(x, y)
(x0, y0) (x1/2, y1/2) (x1, y1) …
[center]
Centered gradient estimator
Var gz0(z) ≤ L2∥z − z0∥2
x∈𝒴 max y∈𝒵 f(x, y)
Geometry matters
(x0, y0) (x1/2, y1/2) (x1, y1) …
[center]
Centered gradient estimator
Var gz0(z) ≤ L2∥z − z0∥2
x∈𝒴 max y∈𝒵 f(x, y)
(x − x0) (y − y0) A⊤ i ∼
|y − y0| ∥y − y0∥1
[ ] ,
j ∼
|x − x0| ∥x − x0∥1
A
Geometry matters
sampling from the difference
(x0, y0) (x1/2, y1/2) (x1, y1) …
[center]
Centered gradient estimator
Var gz0(z) ≤ L2∥z − z0∥2
x∈𝒴 max y∈𝒵 f(x, y)
(x − x0) (y − y0) A⊤ i ∼
|y − y0| ∥y − y0∥1
[ ] ,
j ∼
|x − x0| ∥x − x0∥1
A
Variance reduction
(our approach)
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, CHW10)
Geometry matters
sampling from the difference
(x0, y0) (x1/2, y1/2) (x1, y1) …
[center]
Centered gradient estimator
Var gz0(z) ≤ L2∥z − z0∥2
x∈𝒴 max y∈𝒵 f(x, y)
(x − x0) (y − y0) A⊤ i ∼
|y − y0| ∥y − y0∥1
[ ] ,
j ∼
|x − x0| ∥x − x0∥1
A
Variance reduction
(our approach)
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, CHW10)
Image credit: Chawit Waewsawangwong Geometry matters
sampling from the difference
(x0, y0) (x1/2, y1/2) (x1, y1) …
[center]
Centered gradient estimator
Var gz0(z) ≤ L2∥z − z0∥2
x∈𝒴 max y∈𝒵 f(x, y)
(x − x0) (y − y0) A⊤ i ∼
|y − y0| ∥y − y0∥1
[ ] ,
j ∼
|x − x0| ∥x − x0∥1
A
Variance reduction
(our approach)
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, CHW10)
Image credit: Chawit Waewsawangwong Geometry matters
sampling from the difference
(x0, y0) (x1/2, y1/2) (x1, y1) …
[center]
Centered gradient estimator
Var gz0(z) ≤ L2∥z − z0∥2
x∈𝒴 max y∈𝒵 f(x, y)
(x − x0) (y − y0) A⊤ i ∼
|y − y0| ∥y − y0∥1
[ ] ,
j ∼
|x − x0| ∥x − x0∥1
A
Variance reduction
(our approach)
Exact gradient
(Nemirovski `04, Nesterov `07)
Stochastic gradient
(GK95, CHW10)
VR always better VR better for passes
Ω(1)
Image credit: Chawit Waewsawangwong Geometry matters Geometry matters
sampling from the difference
Poster #212