[PPT] - MATH Opportunities in Berlin Shameless plug Postdoc and PhD PowerPoint Presentation

SLIDE 1

Robust ML Training with Conditional Gradients

Sebastian Pokutta

Technische Universität Berlin and Zuse Institute Berlin

pokutta@math.tu-berlin.de @spokutta

CO@Work 2020 Summer School September, 2020

Berlin Mathematics Research Center

MATH

SLIDE 2

Opportunities in Berlin

Shameless plug

Postdoc and PhD positions in optimization/ML. At Zuse Institute Berlin and TU Berlin.

Sebastian Pokutta · Training with Conditional Gradients 1 / 14

SLIDE 3

What is this talk about?

Introduction

Can we train, e.g., Neural Networks so that they are (more) robust to noise and adversarial attacks?

Outline

A simple example The basic setup of supervised Machine Learning Stochastic Gradient Descent Stochastic Conditional Gradient Descent

(Hyperlinked) References are not exhaustive; check references contained therein. Statements are simplifjed for the sake of exposition.

Sebastian Pokutta · Training with Conditional Gradients 2 / 14

SLIDE 4

What is this talk about?

Introduction

Can we train, e.g., Neural Networks so that they are (more) robust to noise and adversarial attacks?

Outline

A simple example
The basic setup of supervised Machine Learning
Stochastic Gradient Descent
Stochastic Conditional Gradient Descent

(Hyperlinked) References are not exhaustive; check references contained therein. Statements are simplifjed for the sake of exposition.

Sebastian Pokutta · Training with Conditional Gradients 2 / 14

SLIDE 5

Supervised Machine Learning and ERM

A simple example

Consider the following simple learning problem, a.k.a. linear regression: Given: Set of points X {x1,. . .,xk} ⊆ Rn Vector y (y1,. . .,yk) ∈ Rk Find: Linear function θ ∈ Rn, such that xiθ ≈ yi ∀i ∈ [k],

r in matrix form:

Xθ ≈ y.

[Wikipedia]

The search for the best can be naturally cast as an optimization problem:

i k

xi yi

2

X y 2

2

(linReg)

Sebastian Pokutta · Training with Conditional Gradients 3 / 14

SLIDE 6

Supervised Machine Learning and ERM

A simple example

Consider the following simple learning problem, a.k.a. linear regression: Given: Set of points X {x1,. . .,xk} ⊆ Rn Vector y (y1,. . .,yk) ∈ Rk Find: Linear function θ ∈ Rn, such that xiθ ≈ yi ∀i ∈ [k],

r in matrix form:

Xθ ≈ y.

[Wikipedia]

The search for the best θ can be naturally cast as an optimization problem: min

θ

i∈[k]

|xiθ − yi|2 = min

θ

Xθ − y2

2

(linReg)

Sebastian Pokutta · Training with Conditional Gradients 3 / 14

SLIDE 7

Supervised Machine Learning and ERM

Empirical Risk Minimization

More generally, interested in the Empirical Risk Minimization problem: min

θ L(θ) min θ

1 |D|

(x,y)∈D

ℓ(f(x,θ),y). (ERM) The ERM approximates the General Risk Minimization problem: L

x y

f x y (GRM) Note: If is chosen large enough, under relatively mild assumptions, a solution to (ERM) is a good approximation to a solution to (GRM): L L

1

with probability 1 . This bound is typically very loose.

[ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] Sebastian Pokutta · Training with Conditional Gradients 4 / 14

SLIDE 8

Supervised Machine Learning and ERM

Empirical Risk Minimization

More generally, interested in the Empirical Risk Minimization problem: min

θ L(θ) min θ

1 |D|

(x,y)∈D

ℓ(f(x,θ),y). (ERM) The ERM approximates the General Risk Minimization problem: min

θ

L(θ) min

θ E(x,y)∈ D ℓ(f(x,θ),y).

(GRM) Note: If is chosen large enough, under relatively mild assumptions, a solution to (ERM) is a good approximation to a solution to (GRM): L L

1

with probability 1 . This bound is typically very loose.

[ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] Sebastian Pokutta · Training with Conditional Gradients 4 / 14

SLIDE 9

Supervised Machine Learning and ERM

Empirical Risk Minimization

More generally, interested in the Empirical Risk Minimization problem: min

θ L(θ) min θ

1 |D|

(x,y)∈D

ℓ(f(x,θ),y). (ERM) The ERM approximates the General Risk Minimization problem: min

θ

L(θ) min

θ E(x,y)∈ D ℓ(f(x,θ),y).

(GRM) Note: If D is chosen large enough, under relatively mild assumptions, a solution to (ERM) is a good approximation to a solution to (GRM):

L(θ) ≤ L(θ) +
log|Θ| + log 1

δ

|D| , with probability 1 − δ. This bound is typically very loose.

[ e.g., Suriya Gunasekar’ lecture notes] [The Elements of Statistical Learning, Hastie et al] Sebastian Pokutta · Training with Conditional Gradients 4 / 14

SLIDE 10

Supervised Machine Learning and ERM

Empirical Risk Minimization: Examples

1. Linear Regression

ℓ(zi,yi) |zi − yi|2 and zi = f(θ,xi) xiθ

2. Classifjcation / Logistic Regression over classes C

zi yi

c C yi c

zi c and, e.g., zi f xi xi (or a neural network)

3. Support Vector Machines

zi yi yi 0 1 zi 1 yi 0 1 zi and zi f xi xi

4. Neural Networks

zi yi some loss function and zi f xi neural network with weights ...and many more choices and combinations possible.

Sebastian Pokutta · Training with Conditional Gradients 5 / 14

SLIDE 11

Supervised Machine Learning and ERM

Empirical Risk Minimization: Examples

1. Linear Regression

ℓ(zi,yi) |zi − yi|2 and zi = f(θ,xi) xiθ

2. Classifjcation / Logistic Regression over classes C

ℓ(zi,yi) −

c∈[C] yi,c log zi,c and, e.g., zi = f(θ,xi) xiθ (or a neural network)

3. Support Vector Machines

zi yi yi 0 1 zi 1 yi 0 1 zi and zi f xi xi

4. Neural Networks

zi yi some loss function and zi f xi neural network with weights ...and many more choices and combinations possible.

Sebastian Pokutta · Training with Conditional Gradients 5 / 14

SLIDE 12

Supervised Machine Learning and ERM

Empirical Risk Minimization: Examples

1. Linear Regression

ℓ(zi,yi) |zi − yi|2 and zi = f(θ,xi) xiθ

2. Classifjcation / Logistic Regression over classes C

ℓ(zi,yi) −

c∈[C] yi,c log zi,c and, e.g., zi = f(θ,xi) xiθ (or a neural network)

3. Support Vector Machines

ℓ(zi,yi) yi max(0,1 − zi) + (1 − yi) max(0,1 + zi) and zi = f(θ,xi) xiθ

4. Neural Networks

zi yi some loss function and zi f xi neural network with weights ...and many more choices and combinations possible.

Sebastian Pokutta · Training with Conditional Gradients 5 / 14

SLIDE 13

Supervised Machine Learning and ERM

Empirical Risk Minimization: Examples

1. Linear Regression

ℓ(zi,yi) |zi − yi|2 and zi = f(θ,xi) xiθ

2. Classifjcation / Logistic Regression over classes C

ℓ(zi,yi) −

c∈[C] yi,c log zi,c and, e.g., zi = f(θ,xi) xiθ (or a neural network)

3. Support Vector Machines

ℓ(zi,yi) yi max(0,1 − zi) + (1 − yi) max(0,1 + zi) and zi = f(θ,xi) xiθ

4. Neural Networks

ℓ(zi,yi) some loss function and zi = f(θ,xi) neural network with weights θ ...and many more choices and combinations possible.

Sebastian Pokutta · Training with Conditional Gradients 5 / 14

SLIDE 14

Optimizing the ERM Problem

Stochastic Gradient Descent

How to solve Problem (ERM)? Simple idea: Gradient Descent

[see blog for background on conv opt]

t 1 t

L

t

(GD) Unfortunately, this might be too expensive if (ERM) has a lot of summands. However, reexamine: L 1

x y

f x y 1

x y

f x y (ERMgrad) Thus if we sample x y uniformly at random, then L

x y

f x y (gradEst)

Sebastian Pokutta · Training with Conditional Gradients 6 / 14

SLIDE 15

Optimizing the ERM Problem

Stochastic Gradient Descent

How to solve Problem (ERM)? Simple idea: Gradient Descent

[see blog for background on conv opt]

θt+1 ← θt − η∇L(θt) (GD) Unfortunately, this might be too expensive if (ERM) has a lot of summands. However, reexamine: L 1

x y

f x y 1

x y

f x y (ERMgrad) Thus if we sample x y uniformly at random, then L

x y

f x y (gradEst)

Sebastian Pokutta · Training with Conditional Gradients 6 / 14

SLIDE 16

Optimizing the ERM Problem

Stochastic Gradient Descent

How to solve Problem (ERM)? Simple idea: Gradient Descent

[see blog for background on conv opt]

θt+1 ← θt − η∇L(θt) (GD) Unfortunately, this might be too expensive if (ERM) has a lot of summands. However, reexamine: L 1

x y

f x y 1

x y

f x y (ERMgrad) Thus if we sample x y uniformly at random, then L

x y

f x y (gradEst)

Sebastian Pokutta · Training with Conditional Gradients 6 / 14

SLIDE 17

Optimizing the ERM Problem

Stochastic Gradient Descent

How to solve Problem (ERM)? Simple idea: Gradient Descent

[see blog for background on conv opt]

θt+1 ← θt − η∇L(θt) (GD) Unfortunately, this might be too expensive if (ERM) has a lot of summands. However, reexamine: ∇L(θ) = ∇ 1 |D|

(x,y)∈D

ℓ(f(x,θ),y) = 1 |D|

(x,y)∈D

∇ℓ(f(x,θ),y), (ERMgrad) Thus if we sample x y uniformly at random, then L

x y

f x y (gradEst)

Sebastian Pokutta · Training with Conditional Gradients 6 / 14

SLIDE 18

Optimizing the ERM Problem

Stochastic Gradient Descent

How to solve Problem (ERM)? Simple idea: Gradient Descent

[see blog for background on conv opt]

θt+1 ← θt − η∇L(θt) (GD) Unfortunately, this might be too expensive if (ERM) has a lot of summands. However, reexamine: ∇L(θ) = ∇ 1 |D|

(x,y)∈D

ℓ(f(x,θ),y) = 1 |D|

(x,y)∈D

∇ℓ(f(x,θ),y), (ERMgrad) Thus if we sample (x,y) ∈ D uniformly at random, then ∇L(θ) = E(x,y)∈D∇ℓ(f(x,θ),y) (gradEst)

Sebastian Pokutta · Training with Conditional Gradients 6 / 14

SLIDE 19

Optimizing the ERM Problem

Stochastic Gradient Descent

This leads to Stochastic Gradient Descent θt+1 ← θt − η∇ℓ(f(x,θt),y) with (x,y) ∼ D, (SGD)

ne of the most-used algorithm for ML training (together with its many variants).

Typical variants include Batch versions. Rather than just taking one stochastic gradient, sample and average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Adaptive Variants and Momentum. RMSProp, Adagrad, Adadelta, Adam, ... Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRG.

[for an overview of variants: blog of Sebastian Ruder] Sebastian Pokutta · Training with Conditional Gradients 7 / 14

SLIDE 20

Optimizing the ERM Problem

Stochastic Gradient Descent

This leads to Stochastic Gradient Descent θt+1 ← θt − η∇ℓ(f(x,θt),y) with (x,y) ∼ D, (SGD)

ne of the most-used algorithm for ML training (together with its many variants).

Typical variants include

Batch versions. Rather than just taking one stochastic gradient, sample and

average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Adaptive Variants and Momentum. RMSProp, Adagrad, Adadelta, Adam, ... Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRG.

[for an overview of variants: blog of Sebastian Ruder] Sebastian Pokutta · Training with Conditional Gradients 7 / 14

SLIDE 21

Optimizing the ERM Problem

Stochastic Gradient Descent

This leads to Stochastic Gradient Descent θt+1 ← θt − η∇ℓ(f(x,θt),y) with (x,y) ∼ D, (SGD)

ne of the most-used algorithm for ML training (together with its many variants).

Typical variants include

Batch versions. Rather than just taking one stochastic gradient, sample and

average a mini batch. This also reduces variance of the gradient estimator.

Learning rate schedules. To ensure convergence the learning rate η is dynamically

managed. Adaptive Variants and Momentum. RMSProp, Adagrad, Adadelta, Adam, ... Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRG.

[for an overview of variants: blog of Sebastian Ruder] Sebastian Pokutta · Training with Conditional Gradients 7 / 14

SLIDE 22

Optimizing the ERM Problem

Stochastic Gradient Descent

This leads to Stochastic Gradient Descent θt+1 ← θt − η∇ℓ(f(x,θt),y) with (x,y) ∼ D, (SGD)

ne of the most-used algorithm for ML training (together with its many variants).

Typical variants include

Batch versions. Rather than just taking one stochastic gradient, sample and

average a mini batch. This also reduces variance of the gradient estimator.

Learning rate schedules. To ensure convergence the learning rate η is dynamically

managed.

Adaptive Variants and Momentum. RMSProp, Adagrad, Adadelta, Adam, ...

Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRG.

[for an overview of variants: blog of Sebastian Ruder] Sebastian Pokutta · Training with Conditional Gradients 7 / 14

SLIDE 23

Optimizing the ERM Problem

Stochastic Gradient Descent

This leads to Stochastic Gradient Descent θt+1 ← θt − η∇ℓ(f(x,θt),y) with (x,y) ∼ D, (SGD)

ne of the most-used algorithm for ML training (together with its many variants).

Typical variants include

Batch versions. Rather than just taking one stochastic gradient, sample and

average a mini batch. This also reduces variance of the gradient estimator.

Learning rate schedules. To ensure convergence the learning rate η is dynamically

managed.

Adaptive Variants and Momentum. RMSProp, Adagrad, Adadelta, Adam, ...
Variance Reduction. Compute exact gradient once in a while as reference point,

e.g., SVRG.

[for an overview of variants: blog of Sebastian Ruder] Sebastian Pokutta · Training with Conditional Gradients 7 / 14

SLIDE 24

A comparison between difgerent variants

Stochastic Gradient Descent

[Graphics from blog of Sebastian Ruder; see also for animations] Sebastian Pokutta · Training with Conditional Gradients 8 / 14

SLIDE 25

(More) robust ERM training

Stochastic Conditional Gradients

Recall Problem (ERM): min

θ

1 |D|

(x,y)∈D

ℓ(f(x,θ),y). In the standard formulation is unbounded and can get quite large.

Problem. Large

for, e.g., Neural Networks lead to large Lipschitz constants. Trained network becomes sensitive to input noise and perturbations.

[Tsuzuku, Sato, Sugiyama, 2018] Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14

SLIDE 26

(More) robust ERM training

Stochastic Conditional Gradients

Recall Problem (ERM): min

θ

1 |D|

(x,y)∈D

ℓ(f(x,θ),y). In the standard formulation θ is unbounded and can get quite large.

Problem. Large

for, e.g., Neural Networks lead to large Lipschitz constants. Trained network becomes sensitive to input noise and perturbations.

[Tsuzuku, Sato, Sugiyama, 2018] Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14

SLIDE 27

(More) robust ERM training

Stochastic Conditional Gradients

Recall Problem (ERM): min

θ

1 |D|

(x,y)∈D

ℓ(f(x,θ),y). In the standard formulation θ is unbounded and can get quite large.

Problem. Large θ for, e.g., Neural Networks lead to large Lipschitz constants. Trained

network becomes sensitive to input noise and perturbations.

[Tsuzuku, Sato, Sugiyama, 2018] Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14

SLIDE 28

(More) robust ERM training

Stochastic Conditional Gradients

Recall Problem (ERM): min

θ

1 |D|

(x,y)∈D

ℓ(f(x,θ),y). In the standard formulation θ is unbounded and can get quite large.

Problem. Large θ for, e.g., Neural Networks lead to large Lipschitz constants. Trained

network becomes sensitive to input noise and perturbations.

[Tsuzuku, Sato, Sugiyama, 2018]

test set accuracy 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14

SLIDE 29

(More) robust ERM training

Stochastic Conditional Gradients

Recall Problem (ERM): min

θ

1 |D|

(x,y)∈D

ℓ(f(x,θ),y). In the standard formulation θ is unbounded and can get quite large.

Problem. Large θ for, e.g., Neural Networks lead to large Lipschitz constants. Trained

network becomes sensitive to input noise and perturbations.

[Tsuzuku, Sato, Sugiyama, 2018]

test set accuracy 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.3)

20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14

SLIDE 30

(More) robust ERM training

Stochastic Conditional Gradients

Recall Problem (ERM): min

θ

1 |D|

(x,y)∈D

ℓ(f(x,θ),y). In the standard formulation θ is unbounded and can get quite large.

Problem. Large θ for, e.g., Neural Networks lead to large Lipschitz constants. Trained

network becomes sensitive to input noise and perturbations.

[Tsuzuku, Sato, Sugiyama, 2018]

test set accuracy 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.3)

20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8

test set accuracy with noise (σ = 0.6)

20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Performance for Neural Network trained on MNIST. Sebastian Pokutta · Training with Conditional Gradients 9 / 14

SLIDE 31

(More) robust ERM training

Stochastic Conditional Gradients

(Partial) Solution. Constrained ERM training: min

θ ∈P

1 |D|

(x,y)∈D

ℓ(f(x,θ),y), (cERM) where P is a compact convex set.

Rationelle. Find “better conditioned” local minima .

Sebastian Pokutta · Training with Conditional Gradients 10 / 14

SLIDE 32

(More) robust ERM training

Stochastic Conditional Gradients

(Partial) Solution. Constrained ERM training: min

θ ∈P

1 |D|

(x,y)∈D

ℓ(f(x,θ),y), (cERM) where P is a compact convex set.

Rationelle. Find “better conditioned” local minima θ.

Sebastian Pokutta · Training with Conditional Gradients 10 / 14

SLIDE 33

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2

2

xt vt f xt x∗ xt 1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 34

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2

2

xt vt f xt x∗ xt 1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 35

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2

2

xt vt −∇f(xt) x∗ xt 1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 36

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2

2

xt vt −∇f(xt) x∗ xt 1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 37

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2

2

xt vt f xt x∗ xt 1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 38

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2

2

xt vt f xt x∗ xt+1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 39

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over by sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2

2

xt vt f xt x∗ xt+1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 40

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over conv(V) by

sequentially picking up vertices The fjnal iterate xT has cardinality at most T 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2

2

xt vt f xt x∗ xt+1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 41

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over conv(V) by

sequentially picking up vertices

The fjnal iterate xT has cardinality at

most T + 1 Very easy implementation Algorithm is robust and depends on few parameters f(x) = x − x∗2

2

xt vt f xt x∗ xt+1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 42

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over conv(V) by

sequentially picking up vertices

The fjnal iterate xT has cardinality at

most T + 1

Very easy implementation

Algorithm is robust and depends on few parameters f(x) = x − x∗2

2

xt vt f xt x∗ xt+1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 43

The Frank-Wolfe Algorithm a.k.a. Conditional Gradients

Stochastic Conditional Gradients

[Frank, Wolfe, 1956] [Levitin, Polyak, 1966]

Algorithm Frank-Wolfe Algorithm (FW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

FW minimizes f over conv(V) by

sequentially picking up vertices

The fjnal iterate xT has cardinality at

most T + 1

Very easy implementation
Algorithm is robust and depends on

few parameters f(x) = x − x∗2

2

xt vt f xt x∗ xt+1

Sebastian Pokutta · Training with Conditional Gradients 11 / 14

SLIDE 44

Does it work?

Stochastic Conditional Gradients

As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

Similarly, many variants available Batch versions. Rather than just taking one stochastic gradient, sample and average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRF, SVRCGS, ...

Sebastian Pokutta · Training with Conditional Gradients 12 / 14

SLIDE 45

Does it work?

Stochastic Conditional Gradients

As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

Similarly, many variants available Batch versions. Rather than just taking one stochastic gradient, sample and average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRF, SVRCGS, ...

Sebastian Pokutta · Training with Conditional Gradients 12 / 14

SLIDE 46

Does it work?

Stochastic Conditional Gradients

As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ˜

∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

Similarly, many variants available Batch versions. Rather than just taking one stochastic gradient, sample and average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRF, SVRCGS, ...

Sebastian Pokutta · Training with Conditional Gradients 12 / 14

SLIDE 47

Does it work?

Stochastic Conditional Gradients

As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ˜

∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

Similarly, many variants available

Batch versions. Rather than just taking one stochastic gradient, sample and

average a mini batch. This also reduces variance of the gradient estimator. Learning rate schedules. To ensure convergence the learning rate is dynamically managed. Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRF, SVRCGS, ...

Sebastian Pokutta · Training with Conditional Gradients 12 / 14

SLIDE 48

Does it work?

Stochastic Conditional Gradients

As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ˜

∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

Similarly, many variants available

Batch versions. Rather than just taking one stochastic gradient, sample and

average a mini batch. This also reduces variance of the gradient estimator.

Learning rate schedules. To ensure convergence the learning rate η is dynamically

managed. Variance Reduction. Compute exact gradient once in a while as reference point, e.g., SVRF, SVRCGS, ...

Sebastian Pokutta · Training with Conditional Gradients 12 / 14

SLIDE 49

Does it work?

Stochastic Conditional Gradients

As before choose an unbiased gradient estimator ˜ ∇f(xt) with E[˜ ∇f(xt)] = ∇f(xt). Algorithm Stochastic Frank-Wolfe Algorithm (SFW)

1: x0 ∈ V 2: for t = 0 to T − 1 do 3:

vt ← arg min

v∈V ˜

∇f(xt), v

4:

xt+1 ← xt + γt(vt − xt)

5: end for

Similarly, many variants available

Batch versions. Rather than just taking one stochastic gradient, sample and

average a mini batch. This also reduces variance of the gradient estimator.

Learning rate schedules. To ensure convergence the learning rate η is dynamically

managed.

Variance Reduction. Compute exact gradient once in a while as reference point,

e.g., SVRF, SVRCGS, ...

Sebastian Pokutta · Training with Conditional Gradients 12 / 14

SLIDE 50

Does it work?

Stochastic Conditional Gradients

Same setup as before. SGD and SFW as solvers.

Performance for Neural Network trained on MNIST.

More details and experiments in the exercise...

Sebastian Pokutta · Training with Conditional Gradients 13 / 14

SLIDE 51

Does it work?

Stochastic Conditional Gradients

Same setup as before. SGD and SFW as solvers.

test set accuracy Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Performance for Neural Network trained on MNIST.

More details and experiments in the exercise...

Sebastian Pokutta · Training with Conditional Gradients 13 / 14

SLIDE 52

Does it work?

Stochastic Conditional Gradients

Same setup as before. SGD and SFW as solvers.

test set accuracy Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.3) Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Performance for Neural Network trained on MNIST.

More details and experiments in the exercise...

Sebastian Pokutta · Training with Conditional Gradients 13 / 14

SLIDE 53

Does it work?

Stochastic Conditional Gradients

Same setup as before. SGD and SFW as solvers.

test set accuracy Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.3) Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.6) Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Performance for Neural Network trained on MNIST.

More details and experiments in the exercise...

Sebastian Pokutta · Training with Conditional Gradients 13 / 14

SLIDE 54

Does it work?

Stochastic Conditional Gradients

Same setup as before. SGD and SFW as solvers.

test set accuracy Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.3) Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8 test set accuracy with noise (σ = 0.6) Frank Wolfe Gradient Descent 20 40 60 80 100 epoch 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Performance for Neural Network trained on MNIST.

More details and experiments in the exercise...

Sebastian Pokutta · Training with Conditional Gradients 13 / 14

SLIDE 55

Thank you!

Sebastian Pokutta · Training with Conditional Gradients 14 / 14