[PPT] - Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch PowerPoint Presentation

SLIDE 1

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch chs

Jeff HaoChen Tsinghua University Suvrit Sra Massachusetts Institute of Technology

SLIDE 2

In Intr troduc ductio tion

Goal: to minimize the function

SLIDE 3

SGD with replacement: (often appears in algorithm analysis)
!" = !"$% − '∇)

* " (!"$%)

-(.) uniformly random from [0], 1 ≤ . ≤ 4
SGD without replacement: (often appears in reality)
!"

5 = !"$% 5

− '∇)

67 " (!"$% 5

)

85 uniformly from random permutation of [0], 1 ≤ . ≤ 0

In Intr troduc ductio tion

SLIDE 4

SGD with replacement: (often appears in algorithm analysis)
!" = !"$% − '∇)

* " (!"$%)

-(.) uniformly random from [0], 1 ≤ . ≤ 4
SGD without replacement: (often appears in reality)
!"

5 = !"$% 5

− '∇)

67 " (!"$% 5

)

85 uniformly from random permutation of [0], 1 ≤ . ≤ 0

In Intr troduc ductio tion

SLIDE 5

SGD with replacement: (often appears in algorithm analysis)
!" = !"$% − '∇)

* " (!"$%)

-(.) uniformly random from [0], 1 ≤ . ≤ 4
SGD without replacement: (often appears in reality)
!"

5 = !"$% 5

− '∇)

67 " (!"$% 5

)

85 uniformly from random permutation of [0], 1 ≤ . ≤ 0

In Intr troduc ductio tion

We call this SGD We call this RandomShuffle

SLIDE 6

So a natural question: which one is better?
A Numerical Comparison: (Bottou, 2009)

SGD RandomShuffle

In Intr troduc ductio tion

SLIDE 7

So a natural question: which one is better?
A Numerical Comparison: (Bottou, 2009)

SGD RandomShuffle

In Intr troduc ductio tion

SLIDE 8

In Intr troduc ductio tion

Why?
Intuitively, we should prefer RandomShuffle for the following two reasons:
It uses more “information” in one epoch (by visiting each component)
It has smaller variance for one epoch
However, what is a rigorous proof?

SLIDE 9

In Intr troduc ductio tion

Why?
Intuitively, we should prefer RandomShuffle for the following two reasons:
It uses more “information” in one epoch (by visiting each component)
It has smaller variance for one epoch
However, what is a rigorous proof?

SLIDE 10

In Intr troduc ductio tion

Why?
Intuitively, we should prefer RandomShuffle for the following two reasons:
It uses more “information” in one epoch (by visiting each component)
It has smaller variance for one epoch
However, what is a rigorous proof?

SLIDE 11

A A Br Brief History

Under strong structure, we can convert this problem into matrix inequality:

(Recht and Ré, 2012)

Assume the problem is quadratic: !

" # = (&" '# − )")+

Then “RandomShuffle is better than SGD after one epoch” is true under

conjecture:

Which we still don’t know how to prove yet L

SLIDE 12

A A Br Brief History

Under strong structure, we can convert this problem into matrix inequality:

(Recht and Ré, 2012)

Assume the problem is quadratic: !

" # = (&" '# − )")+

Then “RandomShuffle is better than SGD after one epoch” is true under

conjecture:

Which we still don’t know how to prove yet L

SLIDE 13

A A Br Brief History

Under strong structure, we can convert this problem into matrix inequality:

(Recht and Ré, 2012)

Assume the problem is quadratic: !

" # = (&" '# − )")+

Then “RandomShuffle is better than SGD after one epoch” is true under

conjecture:

Which we still don’t know how to prove yet L

SLIDE 14

A A Br Brief History

Under strong structure, we can convert this problem into matrix inequality:

(Recht and Ré, 2012)

Assume the problem is quadratic: !

" # = (&" '# − )")+

Then “RandomShuffle is better than SGD after one epoch” is true under

conjecture:

Which we still don’t know how to prove yet L

SLIDE 15

A A Br Brief History

What about the more general situation?
We can try to show with a better convergence bound!
The hope is: prove a faster worst-case convergence rate of RandomShuffle
A well-known fact: SGD converges with rate !

" # :

$ ∥ &# − &∗ ∥) ≤ !

" #

SLIDE 16

A A Br Brief History

What about the more general situation?
We can try to show with a better convergence bound!
The hope is: prove a faster worst-case convergence rate of RandomShuffle
A well-known fact: SGD converges with rate !

" # :

$ ∥ &# − &∗ ∥) ≤ !

" #

SLIDE 17

A A Br Brief History

What about the more general situation?
We can try to show with a better convergence bound!
The hope is: prove a faster worst-case convergence rate of RandomShuffle
A well-known fact: SGD converges with rate !

" # :

$ ∥ &# − &∗ ∥) ≤ !

" #

SLIDE 18

A A Br Brief History

One of the recent breakthrough: (Gürbüzbalaban, 2015)
Asymptotically RandomShuffle has convergence rate !

" #$

But not sure what happen after finite epochs
In contrast, there is a non-asymptotic result: (Shamir, 2016)
RandomShuffle is no worse than SGD, with provably !

" # convergence rate

But cannot show that RandomShuffle is really faster

SLIDE 19

A A Br Brief History

One of the recent breakthrough: (Gürbüzbalaban, 2015)
Asymptotically RandomShuffle has convergence rate !

" #$

But not sure what happen after finite epochs
In contrast, there is a non-asymptotic result: (Shamir, 2016)
RandomShuffle is no worse than SGD, with provably !

" # convergence rate

But cannot show that RandomShuffle is really faster

SLIDE 20

A A Br Brief History

One of the recent breakthrough: (Gürbüzbalaban, 2015)
Asymptotically RandomShuffle has convergence rate !

" #$

But not sure what happen after finite epochs
In contrast, there is a non-asymptotic result: (Shamir, 2016)
RandomShuffle is no worse than SGD, with provably !

" # convergence rate

But cannot show that RandomShuffle is really faster

What happens in between?

SLIDE 21

Su Summa mmary of

f r

results

We analyze RandomShuffle in the following settings:

Strongly convex, Lipschitz Hessian
Sparse data
Vanishing variance
Nonconvex, under PL condition
Smooth convex

SLIDE 22

Su Summa mmary of

f r

results

We analyze RandomShuffle in the following settings:

Strongly convex, Lipschitz Hessian
Sparse data
Vanishing variance
Nonconvex, under PL condition
Smooth convex

Dheeraj Nagaraj et el. get rid

f this constraint

SLIDE 23

Su Summa mmary of

f r

results

We analyze RandomShuffle in the following settings:

Strongly convex, Lipschitz Hessian
Sparse data
Vanishing variance
Nonconvex, under PL condition
Smooth convex

this talk

SLIDE 24

Su Summa mmary of

f r

results

We analyze RandomShuffle in the following settings:

Strongly convex, Lipschitz Hessian
Sparse data
Vanishing variance
Nonconvex, under PL condition
Smooth convex

SLIDE 25

Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und!

Can we show a non-asymptotic bound better than !

" # ? E.g., ! " #$%& ?

If we can, then everything is solved J
……unless we cannot L

SLIDE 26

Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und!

Can we show a non-asymptotic bound better than !

" # ? E.g., ! " #$%& ?

If we can, then everything is solved J
……unless we cannot L

SLIDE 27

Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und!

Can we show a non-asymptotic bound better than !

" # ? E.g., ! " #$%& ?

If we can, then everything is solved J
……unless we cannot L

SLIDE 28

Pr Proof of the theorem

We only consider the case when ! = #, i.e., we run one epoch of the algorithm
We prove the theorem with a counter-example:
Recall function $ % = &

' ∑)*& '

+

) %

We set +

) % = , &

% − / 01 % − / ,

3 455,

&

% + / 01 % + / ,

3 787#.

A and b to be determined later…

SLIDE 29

Pr Proof of the theorem

We only consider the case when ! = #, i.e., we run one epoch of the algorithm
We prove the theorem with a counter-example:
Recall function $ % = &

' ∑)*& '

+

) %

We set +

) % = , &

% − / 01 % − / ,

3 455,

&

% + / 01 % + / ,

3 787#.

A and b to be determined later…

SLIDE 30

Pr Proof of the theorem

Step 1: Calculate the error
!

"# − "∗

& =

I − )* + x- − x∗

&

+ ! ∑012

#

−1 4 0 ) 5 − )* #60*7

&

Step 2: Simplify via eigenvector basis decomposition
8 = ∑912

:

1 − );9 &#<9

& ,

> = )& ∑912

:

?9

&;9 &!

∑012

#

−1 4 0 1 − );9 #60 &

Step 3: Construct a contradiction
For contradiction, assume there is ) dependent on @ achieving convergence A

2 #

P Q

)@ 2 − );9 = 1 ;9 + A(1)

⟹

SLIDE 31

Pr Proof of the theorem

Step 1: Calculate the error
!

"# − "∗

& =

I − )* + x- − x∗

&

+ ! ∑012

#

−1 4 0 ) 5 − )* #60*7

&

Step 2: Simplify via eigenvector basis decomposition
8 = ∑912

:

1 − );9 &#<9

& ,

> = )& ∑912

:

?9

&;9 &!

∑012

#

−1 4 0 1 − );9 #60 &

Step 3: Construct a contradiction
For contradiction, assume there is ) dependent on @ achieving convergence A

2 #

P Q

)@ 2 − );9 = 1 ;9 + A(1)

⟹

SLIDE 32

Pr Proof of the theorem

Step 1: Calculate the error
!

"# − "∗

& =

I − )* + x- − x∗

&

+ ! ∑012

#

−1 4 0 ) 5 − )* #60*7

&

Step 2: Simplify via eigenvector basis decomposition
8 = ∑912

:

1 − );9 &#<9

& ,

> = )& ∑912

:

?9

&;9 &!

∑012

#

−1 4 0 1 − );9 #60 &

Step 3: Construct a contradiction
For contradiction, assume there is ) dependent on @ achieving convergence A

2 #

P Q

)@ 2 − );9 = 1 ;9 + A(1)

⟹

SLIDE 33

Pr Proof of the theorem

Step 1: Calculate the error
!

"# − "∗

& =

I − )* + x- − x∗

&

+ ! ∑012

#

−1 4 0 ) 5 − )* #60*7

&

Step 2: Simplify via eigenvector basis decomposition
8 = ∑912

:

1 − );9 &#<9

& ,

> = )& ∑912

:

?9

&;9 &!

∑012

#

−1 4 0 1 − );9 #60 &

Step 3: Construct a contradiction
For contradiction, assume there is ) dependent on @ achieving convergence A

2 #

P Q

)@ 2 − );9 = 1 ;9 + A(1)

⟹

SLIDE 34

Pr Proof of the theorem

Step 1: Calculate the error
!

"# − "∗

& =

I − )* + x- − x∗

&

+ ! ∑012

#

−1 4 0 ) 5 − )* #60*7

&

Step 2: Simplify via eigenvector basis decomposition
8 = ∑912

:

1 − );9 &#<9

& ,

> = )& ∑912

:

?9

&;9 &!

∑012

#

−1 4 0 1 − );9 #60 &

Step 3: Construct a contradiction
For contradiction, assume there is ) dependent on @ achieving convergence A

2 #

P Q

)@ 2 − );9 = 1 ;9 + A(1)

⟹

Cannot be true for different ;9!

SLIDE 35

Wha What to do do ne next?

This means the best non-asymptotic rate we can hope is !

" #

Key step: introduce $ into the bound
The hope is if we can show bound like !

% #& , RandomShuffle behaves betterJ

Long Time: !

" #&

Short Time: !

" #

What happens in between?

SLIDE 36

Wha What to do do ne next?

This means the best non-asymptotic rate we can hope is !

" #

Key step: introduce $ into the bound
The hope is if we can show bound like !

% #& , RandomShuffle behaves betterJ

Long Time: !

" #&

Short Time: !

" #

What happens in between?

SLIDE 37

Bou Bounds dependent on

n !

For general second order differentiable functions with Lipschitz Hessian:

SLIDE 38

Bou Bounds dependent on

n !
On one hand, RandomShuffle converges with
On the other hand, SGD converges with
So the take away is:

" 1 $% + '( $( " 1 $ RandomShuffle is provably better than SGD after " ' epochs!

SLIDE 39

Bou Bounds dependent on

n !
On one hand, RandomShuffle converges with
On the other hand, SGD converges with
So the take away is:

" 1 $% + '( $( " 1 $ RandomShuffle is provably better than SGD after " ' epochs!

SLIDE 40

Su Summa mmary of

f r

results

We analyze RandomShuffle in the following settings:

Strongly convex, Lipschitz Hessian
Sparse data
Vanishing variance
Nonconvex, under PL condition
Smooth convex

SLIDE 41

Sp Sparse s setting

A sparse problem can be written as:

! " = $

%&' (

)

%("+,)

Where each .% is a subset of all the dimensions [0]
Consider a graph with 2 nodes, with edge (3, 5) if .% ∩ .

7 ≠ ∅

Define the sparsity level of the problem:

: = max

'>%>( |{. 7 ∶ .% ∩ . 7 ≠ ∅}|

2

SLIDE 42

Sp Sparse s setting

A sparse problem can be written as:

! " = $

%&' (

)

%("+,)

Where each .% is a subset of all the dimensions [0]
Consider a graph with 2 nodes, with edge (3, 5) if .% ∩ .

7 ≠ ∅

Define the sparsity level of the problem:

: = max

'>%>( |{. 7 ∶ .% ∩ . 7 ≠ ∅}|

2

SLIDE 43

Sp Sparse s setting

A fact about sparsity:

1 " ≤ $ ≤ 1

We have the following improved bound for sparse problem:
As a corollary, when $ = &

' ( , there is a & ' )* convergence rate!

SLIDE 44

Sp Sparse s setting

A fact about sparsity:

1 " ≤ $ ≤ 1

We have the following improved bound for sparse problem:
As a corollary, when $ = &

' ( , there is a & ' )* convergence rate!

SLIDE 45

Sp Sparse s setting

A fact about sparsity:

1 " ≤ $ ≤ 1

We have the following improved bound for sparse problem:
As a corollary, when $ = &

' ( , there is a & ' )* convergence rate!

SLIDE 46

Su Summa mmary of

f r

results

We analyze RandomShuffle in the following settings:

Strongly convex, Lipschitz Hessian
Sparse data
Vanishing variance
Nonconvex, under PL condition
Smooth convex

SLIDE 47

Whe When n Varianc nce Vani nishe shes

When the variance vanishes at the optimality

!

" #∗ = 0,

∀ )

Given * pairs of numbers 0 ≤ ," ≤ -", a optimal solution #∗ ∈ ℝ0 and an

initial upper bound on distance 1

A valid problem is defined as * functions and an initial point #2 such that:
!

" is ,"-strongly convex, -"-Lipschitz continuous

!

" 3 #∗ = 0

∥ #2 − #∗ ∥6 ≤ 1

SLIDE 48

Whe When n Varianc nce Vani nishe shes

When the variance vanishes at the optimality

!

" #∗ = 0,

∀ )

Given * pairs of numbers 0 ≤ ," ≤ -", a optimal solution #∗ ∈ ℝ0 and an

initial upper bound on distance 1

A valid problem is defined as * functions and an initial point #2 such that:
!

" is ,"-strongly convex, -"-Lipschitz continuous

!

" 3 #∗ = 0

∥ #2 − #∗ ∥6 ≤ 1

SLIDE 49

Whe When n Varianc nce Vani nishe shes

When the variance vanishes at the optimality

!

" #∗ = 0,

∀ )

Given * pairs of numbers 0 ≤ ," ≤ -", a optimal solution #∗ ∈ ℝ0 and an

initial upper bound on distance 1

A valid problem is defined as * functions and an initial point #2 such that:
!

" is ,"-strongly convex, -"-Lipschitz continuous

!

" 3 #∗ = 0

∥ #2 − #∗ ∥6 ≤ 1

SLIDE 50

Whe When n Varianc nce Vani nishe shes

SLIDE 51

Whe When n Varianc nce Vani nishe shes

RandomShuffle is provably better than SGD after ANY number of iterations!

SLIDE 52

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch chs

In Intr troduc ductio tion

In Intr troduc ductio tion

In Intr troduc ductio tion

In Intr troduc ductio tion

In Intr troduc ductio tion

In Intr troduc ductio tion

In Intr troduc ductio tion

In Intr troduc ductio tion

In Intr troduc ductio tion

A A Br Brief History

A A Br Brief History

A A Br Brief History

A A Br Brief History

A A Br Brief History

A A Br Brief History

A A Br Brief History

A A Br Brief History

A A Br Brief History

A A Br Brief History

What happens in between?

Su Summa mmary of

results

Su Summa mmary of

results

Su Summa mmary of

results

Su Summa mmary of

results

Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und!

Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und!

Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und!

Pr Proof of the theorem

Pr Proof of the theorem

Pr Proof of the theorem

⟹

Pr Proof of the theorem

⟹

Pr Proof of the theorem

⟹

Pr Proof of the theorem

⟹

Pr Proof of the theorem

⟹

Wha What to do do ne next?

Wha What to do do ne next?

Bou Bounds dependent on

Bou Bounds dependent on

Bou Bounds dependent on

Su Summa mmary of

results

Sp Sparse s setting

Sp Sparse s setting

Sp Sparse s setting

Sp Sparse s setting

Sp Sparse s setting

Su Summa mmary of

results

Whe When n Varianc nce Vani nishe shes

Whe When n Varianc nce Vani nishe shes

Whe When n Varianc nce Vani nishe shes

Whe When n Varianc nce Vani nishe shes

Whe When n Varianc nce Vani nishe shes

Thanks!