Robust ML Training with Conditional Gradients
Sebastian Pokutta
Technische Universität Berlin and Zuse Institute Berlin
pokutta@math.tu-berlin.de @spokutta
CO@Work 2020 Summer School September, 2020
Berlin Mathematics Research Center
MATH Opportunities in Berlin Shameless plug Postdoc and PhD - - PowerPoint PPT Presentation
Robust ML Training with Conditional Gradients Sebastian Pokutta Technische Universitt Berlin and Zuse Institute Berlin pokutta@math.tu-berlin.de @spokutta CO@Work 2020 Summer School September, 2020 Berlin Mathematics Research Center MATH
Robust ML Training with Conditional Gradients
Sebastian Pokutta
Technische Universität Berlin and Zuse Institute Berlin
pokutta@math.tu-berlin.de @spokutta
CO@Work 2020 Summer School September, 2020
Berlin Mathematics Research Center
Opportunities in Berlin
Shameless plug
Postdoc and PhD positions in optimization/ML. At Zuse Institute Berlin and TU Berlin.
Sebastian Pokutta · Training with Conditional Gradients 1 / 14
What is this talk about?
Introduction
Can we train, e.g., Neural Networks so that they are (more) robust to noise and adversarial attacks?
Outline
A simple example The basic setup of supervised Machine Learning Stochastic Gradient Descent Stochastic Conditional Gradient Descent
(Hyperlinked) References are not exhaustive; check references contained therein. Statements are simplifjed for the sake of exposition.
Sebastian Pokutta · Training with Conditional Gradients 2 / 14
What is this talk about?
Introduction
Can we train, e.g., Neural Networks so that they are (more) robust to noise and adversarial attacks?
Outline
(Hyperlinked) References are not exhaustive; check references contained therein. Statements are simplifjed for the sake of exposition.
Sebastian Pokutta · Training with Conditional Gradients 2 / 14
Supervised Machine Learning and ERM
A simple example
Consider the following simple learning problem, a.k.a. linear regression: Given: Set of points X {x1,. . .,xk} ⊆ Rn Vector y (y1,. . .,yk) ∈ Rk Find: Linear function θ ∈ Rn, such that xiθ ≈ yi ∀i ∈ [k],
Xθ ≈ y.
[Wikipedia]
The search for the best can be naturally cast as an optimization problem:
i k
xi yi
2
X y 2
2
(linReg)
Sebastian Pokutta · Training with Conditional Gradients 3 / 14
Supervised Machine Learning and ERM
A simple example
Consider the following simple learning problem, a.k.a. linear regression: Given: Set of points X {x1,. . .,xk} ⊆ Rn Vector y (y1,. . .,yk) ∈ Rk Find: Linear function θ ∈ Rn, such that xiθ ≈ yi ∀i ∈ [k],
Xθ ≈ y.
[Wikipedia]
The search for the best θ can be naturally cast as an optimization problem: min
θ
|xiθ − yi|2 = min
θ
Xθ − y2
2
(linReg)
Sebastian Pokutta · Training with Conditional Gradients 3 / 14
Supervised Machine Learning and ERM
Empirical Risk Minimization
More generally, interested in the Empirical Risk Minimization problem: min
θ L(θ) min θ
1 |D|
ℓ(f(x,θ),y). (ERM) The ERM approximates the General Risk Minimization problem: L
x y
f x y (GRM) Note: If is chosen large enough, under relatively mild assumptions, a solution to (ERM) is a good approximation to a solution to (GRM): L L
1
with probability 1 . This bound is typically very loose.
[ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] Sebastian Pokutta · Training with Conditional Gradients 4 / 14
Supervised Machine Learning and ERM
Empirical Risk Minimization
More generally, interested in the Empirical Risk Minimization problem: min
θ L(θ) min θ
1 |D|
ℓ(f(x,θ),y). (ERM) The ERM approximates the General Risk Minimization problem: min
θ
θ E(x,y)∈ D ℓ(f(x,θ),y).
(GRM) Note: If is chosen large enough, under relatively mild assumptions, a solution to (ERM) is a good approximation to a solution to (GRM): L L
1
with probability 1 . This bound is typically very loose.
[ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] Sebastian Pokutta · Training with Conditional Gradients 4 / 14
Supervised Machine Learning and ERM
Empirical Risk Minimization
More generally, interested in the Empirical Risk Minimization problem: min
θ L(θ) min θ
1 |D|
ℓ(f(x,θ),y). (ERM) The ERM approximates the General Risk Minimization problem: min
θ
θ E(x,y)∈ D ℓ(f(x,θ),y).
(GRM) Note: If D is chosen large enough, under relatively mild assumptions, a solution to (ERM) is a good approximation to a solution to (GRM):
δ
|D| , with probability 1 − δ. This bound is typically very loose.
[ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] Sebastian Pokutta · Training with Conditional Gradients 4 / 14
Supervised Machine Learning and ERM
Empirical Risk Minimization: Examples
ℓ(zi,yi) |zi − yi|2 and zi = f(θ,xi) xiθ
zi yi
c C yi c
zi c and, e.g., zi f xi xi (or a neural network)
zi yi yi 0 1 zi 1 yi 0 1 zi and zi f xi xi
zi yi some loss function and zi f xi neural network with weights ...and many more choices and combinations possible.
Sebastian Pokutta · Training with Conditional Gradients 5 / 14
Supervised Machine Learning and ERM
Empirical Risk Minimization: Examples
ℓ(zi,yi) |zi − yi|2 and zi = f(θ,xi) xiθ
ℓ(zi,yi) −
c∈[C] yi,c log zi,c and, e.g., zi = f(θ,xi) xiθ (or a neural network)
zi yi yi 0 1 zi 1 yi 0 1 zi and zi f xi xi
zi yi some loss function and zi f xi neural network with weights ...and many more choices and combinations possible.
Sebastian Pokutta · Training with Conditional Gradients 5 / 14
Supervised Machine Learning and ERM
Empirical Risk Minimization: Examples
ℓ(zi,yi) |zi − yi|2 and zi = f(θ,xi) xiθ
ℓ(zi,yi) −
c∈[C] yi,c log zi,c and, e.g., zi = f(θ,xi) xiθ (or a neural network)
ℓ(zi,yi) yi max(0,1 − zi) + (1 − yi) max(0,1 + zi) and zi = f(θ,xi) xiθ
zi yi some loss function and zi f xi neural network with weights ...and many more choices and combinations possible.
Sebastian Pokutta · Training with Conditional Gradients 5 / 14
Supervised Machine Learning and ERM
Empirical Risk Minimization: Examples
ℓ(zi,yi) |zi − yi|2 and zi = f(θ,xi) xiθ
ℓ(zi,yi) −
c∈[C] yi,c log zi,c and, e.g., zi = f(θ,xi) xiθ (or a neural network)
ℓ(zi,yi) yi max(0,1 − zi) + (1 − yi) max(0,1 + zi) and zi = f(θ,xi) xiθ
ℓ(zi,yi) some loss function and zi = f(θ,xi) neural network with weights θ ...and many more choices and combinations possible.
Sebastian Pokutta · Training with Conditional Gradients 5 / 14
Optimizing the ERM Problem
Stochastic Gradient Descent
How to solve Problem (ERM)? Simple idea: Gradient Descent
[see blog for background on conv opt]
t 1 t
L
t
(GD) Unfortunately, this might be too expensive if (ERM) has a lot of summands. However, reexamine: L 1
x y
f x y 1
x y
f x y (ERMgrad) Thus if we sample x y uniformly at random, then L
x y
f x y (gradEst)
Sebastian Pokutta · Training with Conditional Gradients 6 / 14
Optimizing the ERM Problem
Stochastic Gradient Descent
How to solve Problem (ERM)? Simple idea: Gradient Descent
[see blog for background on conv opt]
θt+1 ← θt − η∇L(θt) (GD) Unfortunately, this might be too expensive if (ERM) has a lot of summands. However, reexamine: L 1
x y
f x y 1
x y
f x y (ERMgrad) Thus if we sample x y uniformly at random, then L
x y
f x y (gradEst)
Sebastian Pokutta · Training with Conditional Gradients 6 / 14
Optimizing the ERM Problem
Stochastic Gradient Descent
How to solve Problem (ERM)? Simple idea: Gradient Descent
[see blog for background on conv opt]
θt+1 ← θt − η∇L(θt) (GD) Unfortunately, this might be too expensive if (ERM) has a lot of summands. However, reexamine: L 1
x y
f x y 1
x y
f x y (ERMgrad) Thus if we sample x y uniformly at random, then L
x y
f x y (gradEst)
Sebastian Pokutta · Training with Conditional Gradients 6 / 14
Optimizing the ERM Problem
Stochastic Gradient Descent
How to solve Problem (ERM)? Simple idea: Gradient Descent
[see blog for background on conv opt]
θt+1 ← θt − η∇L(θt) (GD) Unfortunately, this might be too expensive if (ERM) has a lot of summands. However, reexamine: ∇L(θ) = ∇ 1 |D|
ℓ(f(x,θ),y) = 1 |D|
∇ℓ(f(x,θ),y), (ERMgrad) Thus if we sample x y uniformly at random, then L
x y
f x y (gradEst)
Sebastian Pokutta · Training with Conditional Gradients 6 / 14
Optimizing the ERM Problem
Stochastic Gradient Descent
How to solve Problem (ERM)? Simple idea: Gradient Descent
[see blog for background on conv opt]
θt+1 ← θt − η∇L(θt) (GD) Unfortunately, this might be too expensive if (ERM) has a lot of summands. However, reexamine: ∇L(θ) = ∇ 1 |D|
ℓ(f(x,θ),y) = 1 |D|
∇ℓ(f(x,θ),y), (ERMgrad) Thus if we sample (x,y) ∈ D uniformly at random, then ∇L(θ) = E(x,y)∈D∇ℓ(f(x,θ),y) (gradEst)
Sebastian Pokutta · Training with Conditional Gradients 6 / 14
Optimizing the ERM Problem
Stochastic Gradient Descent
This leads to Stochastic Gradient Descent θt+1 ← θt − η∇ℓ(f(x,θt),y) with (x,y) ∼ D, (SGD)
Typical variants include Batch versions. Rather than just taking one stochastic gradient, sample and average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Adaptive Variants and Momentum. RMSProp, Adagrad, Adadelta, Adam, ... Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRG.
[for an overview of variants: blog of Sebastian Ruder] Sebastian Pokutta · Training with Conditional Gradients 7 / 14
Optimizing the ERM Problem
Stochastic Gradient Descent
This leads to Stochastic Gradient Descent θt+1 ← θt − η∇ℓ(f(x,θt),y) with (x,y) ∼ D, (SGD)
Typical variants include
average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Adaptive Variants and Momentum. RMSProp, Adagrad, Adadelta, Adam, ... Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRG.
[for an overview of variants: blog of Sebastian Ruder] Sebastian Pokutta · Training with Conditional Gradients 7 / 14
Optimizing the ERM Problem
Stochastic Gradient Descent
This leads to Stochastic Gradient Descent θt+1 ← θt − η∇ℓ(f(x,θt),y) with (x,y) ∼ D, (SGD)
Typical variants include
average a mini batch. This also reduces variance of the gradient estimator.
managed. Adaptive Variants and Momentum. RMSProp, Adagrad, Adadelta, Adam, ... Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRG.
[for an overview of variants: blog of Sebastian Ruder] Sebastian Pokutta · Training with Conditional Gradients 7 / 14
Optimizing the ERM Problem
Stochastic Gradient Descent
This leads to Stochastic Gradient Descent θt+1 ← θt − η∇ℓ(f(x,θt),y) with (x,y) ∼ D, (SGD)
Typical variants include
average a mini batch. This also reduces variance of the gradient estimator.
managed.
Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRG.
[for an overview of variants: blog of Sebastian Ruder] Sebastian Pokutta · Training with Conditional Gradients 7 / 14
Optimizing the ERM Problem
Stochastic Gradient Descent
This leads to Stochastic Gradient Descent θt+1 ← θt − η∇ℓ(f(x,θt),y) with (x,y) ∼ D, (SGD)
Typical variants include
average a mini batch. This also reduces variance of the gradient estimator.
managed.
e.g., SVRG.
[for an overview of variants: blog of Sebastian Ruder] Sebastian Pokutta · Training with Conditional Gradients 7 / 14
A comparison between difgerent variants
Stochastic Gradient Descent
[Graphics from blog of Sebastian Ruder; see also for animations] Sebastian Pokutta · Training with Conditional Gradients 8 / 14
(More) robust ERM training
Stochastic Conditional Gradients
Recall Problem (ERM): min
θ
1 |D|
ℓ(f(x,θ),y). In the standard formulation is unbounded and can get quite large.
for, e.g., Neural Networks lead to large Lipschitz constants. Trained network becomes sensitive to input noise and perturbations.
[Tsuzuku, Sato, Sugiyama, 2018] Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14
(More) robust ERM training
Stochastic Conditional Gradients
Recall Problem (ERM): min
θ
1 |D|
ℓ(f(x,θ),y). In the standard formulation θ is unbounded and can get quite large.
for, e.g., Neural Networks lead to large Lipschitz constants. Trained network becomes sensitive to input noise and perturbations.
[Tsuzuku, Sato, Sugiyama, 2018] Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14
(More) robust ERM training
Stochastic Conditional Gradients
Recall Problem (ERM): min
θ
1 |D|
ℓ(f(x,θ),y). In the standard formulation θ is unbounded and can get quite large.
network becomes sensitive to input noise and perturbations.
[Tsuzuku, Sato, Sugiyama, 2018] Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14
(More) robust ERM training
Stochastic Conditional Gradients
Recall Problem (ERM): min
θ
1 |D|
ℓ(f(x,θ),y). In the standard formulation θ is unbounded and can get quite large.
network becomes sensitive to input noise and perturbations.
[Tsuzuku, Sato, Sugiyama, 2018]
test set accuracy 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14
(More) robust ERM training
Stochastic Conditional Gradients
Recall Problem (ERM): min
θ
1 |D|
ℓ(f(x,θ),y). In the standard formulation θ is unbounded and can get quite large.
network becomes sensitive to input noise and perturbations.
[Tsuzuku, Sato, Sugiyama, 2018]
test set accuracy 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.3)
20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14
(More) robust ERM training
Stochastic Conditional Gradients
Recall Problem (ERM): min
θ
1 |D|
ℓ(f(x,θ),y). In the standard formulation θ is unbounded and can get quite large.
network becomes sensitive to input noise and perturbations.
[Tsuzuku, Sato, Sugiyama, 2018]
test set accuracy 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.3)
20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8test set accuracy with noise (σ = 0.6)
20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14
(More) robust ERM training
Stochastic Conditional Gradients
(Partial) Solution. Constrained ERM training: min
θ ∈P
1 |D|
ℓ(f(x,θ),y), (cERM) where P is a compact convex set.
Sebastian Pokutta · Training with Conditional Gradients 10 / 14
(More) robust ERM training
Stochastic Conditional Gradients
(Partial) Solution. Constrained ERM training: min
θ ∈P
1 |D|
ℓ(f(x,θ),y), (cERM) where P is a compact convex set.
Sebastian Pokutta · Training with Conditional Gradients 10 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2
2
xt vt f xt x∗ xt 1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2
2
xt vt f xt x∗ xt 1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2
2
xt vt −∇f(xt) x∗ xt 1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2
2
xt vt −∇f(xt) x∗ xt 1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2
2
xt vt f xt x∗ xt 1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2
2
xt vt f xt x∗ xt+1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2
2
xt vt f xt x∗ xt+1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2
2
xt vt f xt x∗ xt+1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
sequentially picking up vertices
most T + 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2
2
xt vt f xt x∗ xt+1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
sequentially picking up vertices
most T + 1
Algorithm is robust and depends on few parameters f(x) = x − x∗2
2
xt vt f xt x∗ xt+1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
The Frank-Wolfe Algorithm a.k.a. Conditional Gradients
Stochastic Conditional Gradients
[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]
Algorithm Frank-Wolfe Algorithm (FW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
sequentially picking up vertices
most T + 1
few parameters f(x) = x − x∗2
2
xt vt f xt x∗ xt+1
Sebastian Pokutta · Training with Conditional Gradients 11 / 14
Does it work?
Stochastic Conditional Gradients
As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
Similarly, many variants available Batch versions. Rather than just taking one stochastic gradient, sample and average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRF, SVRCGS, ...
Sebastian Pokutta · Training with Conditional Gradients 12 / 14
Does it work?
Stochastic Conditional Gradients
As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
Similarly, many variants available Batch versions. Rather than just taking one stochastic gradient, sample and average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRF, SVRCGS, ...
Sebastian Pokutta · Training with Conditional Gradients 12 / 14
Does it work?
Stochastic Conditional Gradients
As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ˜
∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
Similarly, many variants available Batch versions. Rather than just taking one stochastic gradient, sample and average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRF, SVRCGS, ...
Sebastian Pokutta · Training with Conditional Gradients 12 / 14
Does it work?
Stochastic Conditional Gradients
As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ˜
∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
Similarly, many variants available
average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRF, SVRCGS, ...
Sebastian Pokutta · Training with Conditional Gradients 12 / 14
Does it work?
Stochastic Conditional Gradients
As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ˜
∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
Similarly, many variants available
average a mini batch. This also reduces variance of the gradient estimator.
managed. Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRF, SVRCGS, ...
Sebastian Pokutta · Training with Conditional Gradients 12 / 14
Does it work?
Stochastic Conditional Gradients
As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)
1: x0 ∈ V 2: for t = 0 to T − 1 do 3:
vt ← arg min
v∈V ˜
∇f(xt), v
4:
xt+1 ← xt + γt(vt − xt)
5: end for
Similarly, many variants available
average a mini batch. This also reduces variance of the gradient estimator.
managed.
e.g., SVRF, SVRCGS, ...
Sebastian Pokutta · Training with Conditional Gradients 12 / 14
Does it work?
Stochastic Conditional Gradients
Same setup as before. SGD and SFW as solvers.
Performance for Neural Network trained on MNIST.
More details and experiments in the exercise...
Sebastian Pokutta · Training with Conditional Gradients 13 / 14
Does it work?
Stochastic Conditional Gradients
Same setup as before. SGD and SFW as solvers.
test set accuracy Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Performance for Neural Network trained on MNIST.
More details and experiments in the exercise...
Sebastian Pokutta · Training with Conditional Gradients 13 / 14
Does it work?
Stochastic Conditional Gradients
Same setup as before. SGD and SFW as solvers.
test set accuracy Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.3) Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Performance for Neural Network trained on MNIST.
More details and experiments in the exercise...
Sebastian Pokutta · Training with Conditional Gradients 13 / 14
Does it work?
Stochastic Conditional Gradients
Same setup as before. SGD and SFW as solvers.
test set accuracy Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.3) Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.6) Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Performance for Neural Network trained on MNIST.
More details and experiments in the exercise...
Sebastian Pokutta · Training with Conditional Gradients 13 / 14
Does it work?
Stochastic Conditional Gradients
Same setup as before. SGD and SFW as solvers.
test set accuracy Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.3) Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.6) Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Performance for Neural Network trained on MNIST.
More details and experiments in the exercise...
Sebastian Pokutta · Training with Conditional Gradients 13 / 14
Thank you!
Sebastian Pokutta · Training with Conditional Gradients 14 / 14