Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch - - PowerPoint PPT Presentation

ra randomsh shuffl fle b beats sg sgd d after finite
SMART_READER_LITE
LIVE PREVIEW

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch - - PowerPoint PPT Presentation

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch chs Tsinghua University Jeff HaoChen Suvrit Sra Massachusetts Institute of Technology In Intr troduc ductio tion Goal: to minimize the function In Intr troduc ductio


slide-1
SLIDE 1

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch chs

Jeff HaoChen Tsinghua University Suvrit Sra Massachusetts Institute of Technology

slide-2
SLIDE 2

In Intr troduc ductio tion

  • Goal: to minimize the function
slide-3
SLIDE 3
  • SGD with replacement: (often appears in algorithm analysis)
  • !" = !"$% − '∇)

* " (!"$%)

  • -(.) uniformly random from [0], 1 ≤ . ≤ 4
  • SGD without replacement: (often appears in reality)
  • !"

5 = !"$% 5

− '∇)

67 " (!"$% 5

)

  • 85 uniformly from random permutation of [0], 1 ≤ . ≤ 0

In Intr troduc ductio tion

slide-4
SLIDE 4
  • SGD with replacement: (often appears in algorithm analysis)
  • !" = !"$% − '∇)

* " (!"$%)

  • -(.) uniformly random from [0], 1 ≤ . ≤ 4
  • SGD without replacement: (often appears in reality)
  • !"

5 = !"$% 5

− '∇)

67 " (!"$% 5

)

  • 85 uniformly from random permutation of [0], 1 ≤ . ≤ 0

In Intr troduc ductio tion

slide-5
SLIDE 5
  • SGD with replacement: (often appears in algorithm analysis)
  • !" = !"$% − '∇)

* " (!"$%)

  • -(.) uniformly random from [0], 1 ≤ . ≤ 4
  • SGD without replacement: (often appears in reality)
  • !"

5 = !"$% 5

− '∇)

67 " (!"$% 5

)

  • 85 uniformly from random permutation of [0], 1 ≤ . ≤ 0

In Intr troduc ductio tion

We call this SGD We call this RandomShuffle

slide-6
SLIDE 6
  • So a natural question: which one is better?
  • A Numerical Comparison: (Bottou, 2009)

SGD RandomShuffle

In Intr troduc ductio tion

slide-7
SLIDE 7
  • So a natural question: which one is better?
  • A Numerical Comparison: (Bottou, 2009)

SGD RandomShuffle

In Intr troduc ductio tion

slide-8
SLIDE 8

In Intr troduc ductio tion

  • Why?
  • Intuitively, we should prefer RandomShuffle for the following two reasons:
  • It uses more “information” in one epoch (by visiting each component)
  • It has smaller variance for one epoch
  • However, what is a rigorous proof?
slide-9
SLIDE 9

In Intr troduc ductio tion

  • Why?
  • Intuitively, we should prefer RandomShuffle for the following two reasons:
  • It uses more “information” in one epoch (by visiting each component)
  • It has smaller variance for one epoch
  • However, what is a rigorous proof?
slide-10
SLIDE 10

In Intr troduc ductio tion

  • Why?
  • Intuitively, we should prefer RandomShuffle for the following two reasons:
  • It uses more “information” in one epoch (by visiting each component)
  • It has smaller variance for one epoch
  • However, what is a rigorous proof?
slide-11
SLIDE 11

A A Br Brief History

  • Under strong structure, we can convert this problem into matrix inequality:

(Recht and Ré, 2012)

  • Assume the problem is quadratic: !

" # = (&" '# − )")+

  • Then “RandomShuffle is better than SGD after one epoch” is true under

conjecture:

  • Which we still don’t know how to prove yet L
slide-12
SLIDE 12

A A Br Brief History

  • Under strong structure, we can convert this problem into matrix inequality:

(Recht and Ré, 2012)

  • Assume the problem is quadratic: !

" # = (&" '# − )")+

  • Then “RandomShuffle is better than SGD after one epoch” is true under

conjecture:

  • Which we still don’t know how to prove yet L
slide-13
SLIDE 13

A A Br Brief History

  • Under strong structure, we can convert this problem into matrix inequality:

(Recht and Ré, 2012)

  • Assume the problem is quadratic: !

" # = (&" '# − )")+

  • Then “RandomShuffle is better than SGD after one epoch” is true under

conjecture:

  • Which we still don’t know how to prove yet L
slide-14
SLIDE 14

A A Br Brief History

  • Under strong structure, we can convert this problem into matrix inequality:

(Recht and Ré, 2012)

  • Assume the problem is quadratic: !

" # = (&" '# − )")+

  • Then “RandomShuffle is better than SGD after one epoch” is true under

conjecture:

  • Which we still don’t know how to prove yet L
slide-15
SLIDE 15

A A Br Brief History

  • What about the more general situation?
  • We can try to show with a better convergence bound!
  • The hope is: prove a faster worst-case convergence rate of RandomShuffle
  • A well-known fact: SGD converges with rate !

" # :

  • $ ∥ &# − &∗ ∥) ≤ !

" #

slide-16
SLIDE 16

A A Br Brief History

  • What about the more general situation?
  • We can try to show with a better convergence bound!
  • The hope is: prove a faster worst-case convergence rate of RandomShuffle
  • A well-known fact: SGD converges with rate !

" # :

  • $ ∥ &# − &∗ ∥) ≤ !

" #

slide-17
SLIDE 17

A A Br Brief History

  • What about the more general situation?
  • We can try to show with a better convergence bound!
  • The hope is: prove a faster worst-case convergence rate of RandomShuffle
  • A well-known fact: SGD converges with rate !

" # :

  • $ ∥ &# − &∗ ∥) ≤ !

" #

slide-18
SLIDE 18

A A Br Brief History

  • One of the recent breakthrough: (Gürbüzbalaban, 2015)
  • Asymptotically RandomShuffle has convergence rate !

" #$

  • But not sure what happen after finite epochs
  • In contrast, there is a non-asymptotic result: (Shamir, 2016)
  • RandomShuffle is no worse than SGD, with provably !

" # convergence rate

  • But cannot show that RandomShuffle is really faster
slide-19
SLIDE 19

A A Br Brief History

  • One of the recent breakthrough: (Gürbüzbalaban, 2015)
  • Asymptotically RandomShuffle has convergence rate !

" #$

  • But not sure what happen after finite epochs
  • In contrast, there is a non-asymptotic result: (Shamir, 2016)
  • RandomShuffle is no worse than SGD, with provably !

" # convergence rate

  • But cannot show that RandomShuffle is really faster
slide-20
SLIDE 20

A A Br Brief History

  • One of the recent breakthrough: (Gürbüzbalaban, 2015)
  • Asymptotically RandomShuffle has convergence rate !

" #$

  • But not sure what happen after finite epochs
  • In contrast, there is a non-asymptotic result: (Shamir, 2016)
  • RandomShuffle is no worse than SGD, with provably !

" # convergence rate

  • But cannot show that RandomShuffle is really faster

What happens in between?

slide-21
SLIDE 21

Su Summa mmary of

  • f r

results

We analyze RandomShuffle in the following settings:

  • Strongly convex, Lipschitz Hessian
  • Sparse data
  • Vanishing variance
  • Nonconvex, under PL condition
  • Smooth convex
slide-22
SLIDE 22

Su Summa mmary of

  • f r

results

We analyze RandomShuffle in the following settings:

  • Strongly convex, Lipschitz Hessian
  • Sparse data
  • Vanishing variance
  • Nonconvex, under PL condition
  • Smooth convex

Dheeraj Nagaraj et el. get rid

  • f this constraint
slide-23
SLIDE 23

Su Summa mmary of

  • f r

results

We analyze RandomShuffle in the following settings:

  • Strongly convex, Lipschitz Hessian
  • Sparse data
  • Vanishing variance
  • Nonconvex, under PL condition
  • Smooth convex

this talk

slide-24
SLIDE 24

Su Summa mmary of

  • f r

results

We analyze RandomShuffle in the following settings:

  • Strongly convex, Lipschitz Hessian
  • Sparse data
  • Vanishing variance
  • Nonconvex, under PL condition
  • Smooth convex
slide-25
SLIDE 25

Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und!

  • Can we show a non-asymptotic bound better than !

" # ? E.g., ! " #$%& ?

  • If we can, then everything is solved J
  • ……unless we cannot L
slide-26
SLIDE 26

Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und!

  • Can we show a non-asymptotic bound better than !

" # ? E.g., ! " #$%& ?

  • If we can, then everything is solved J
  • ……unless we cannot L
slide-27
SLIDE 27

Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und!

  • Can we show a non-asymptotic bound better than !

" # ? E.g., ! " #$%& ?

  • If we can, then everything is solved J
  • ……unless we cannot L
slide-28
SLIDE 28

Pr Proof of the theorem

  • We only consider the case when ! = #, i.e., we run one epoch of the algorithm
  • We prove the theorem with a counter-example:
  • Recall function $ % = &

' ∑)*& '

+

) %

  • We set +

) % = , &

  • % − / 01 % − / ,

3 455,

&

  • % + / 01 % + / ,

3 787#.

  • A and b to be determined later…
slide-29
SLIDE 29

Pr Proof of the theorem

  • We only consider the case when ! = #, i.e., we run one epoch of the algorithm
  • We prove the theorem with a counter-example:
  • Recall function $ % = &

' ∑)*& '

+

) %

  • We set +

) % = , &

  • % − / 01 % − / ,

3 455,

&

  • % + / 01 % + / ,

3 787#.

  • A and b to be determined later…
slide-30
SLIDE 30

Pr Proof of the theorem

  • Step 1: Calculate the error
  • !

"# − "∗

& =

I − )* + x- − x∗

&

+ ! ∑012

#

−1 4 0 ) 5 − )* #60*7

&

  • Step 2: Simplify via eigenvector basis decomposition
  • 8 = ∑912

:

1 − );9 &#<9

& ,

> = )& ∑912

:

?9

&;9 &!

∑012

#

−1 4 0 1 − );9 #60 &

  • Step 3: Construct a contradiction
  • For contradiction, assume there is ) dependent on @ achieving convergence A

2 #

P Q

)@ 2 − );9 = 1 ;9 + A(1)

slide-31
SLIDE 31

Pr Proof of the theorem

  • Step 1: Calculate the error
  • !

"# − "∗

& =

I − )* + x- − x∗

&

+ ! ∑012

#

−1 4 0 ) 5 − )* #60*7

&

  • Step 2: Simplify via eigenvector basis decomposition
  • 8 = ∑912

:

1 − );9 &#<9

& ,

> = )& ∑912

:

?9

&;9 &!

∑012

#

−1 4 0 1 − );9 #60 &

  • Step 3: Construct a contradiction
  • For contradiction, assume there is ) dependent on @ achieving convergence A

2 #

P Q

)@ 2 − );9 = 1 ;9 + A(1)

slide-32
SLIDE 32

Pr Proof of the theorem

  • Step 1: Calculate the error
  • !

"# − "∗

& =

I − )* + x- − x∗

&

+ ! ∑012

#

−1 4 0 ) 5 − )* #60*7

&

  • Step 2: Simplify via eigenvector basis decomposition
  • 8 = ∑912

:

1 − );9 &#<9

& ,

> = )& ∑912

:

?9

&;9 &!

∑012

#

−1 4 0 1 − );9 #60 &

  • Step 3: Construct a contradiction
  • For contradiction, assume there is ) dependent on @ achieving convergence A

2 #

P Q

)@ 2 − );9 = 1 ;9 + A(1)

slide-33
SLIDE 33

Pr Proof of the theorem

  • Step 1: Calculate the error
  • !

"# − "∗

& =

I − )* + x- − x∗

&

+ ! ∑012

#

−1 4 0 ) 5 − )* #60*7

&

  • Step 2: Simplify via eigenvector basis decomposition
  • 8 = ∑912

:

1 − );9 &#<9

& ,

> = )& ∑912

:

?9

&;9 &!

∑012

#

−1 4 0 1 − );9 #60 &

  • Step 3: Construct a contradiction
  • For contradiction, assume there is ) dependent on @ achieving convergence A

2 #

P Q

)@ 2 − );9 = 1 ;9 + A(1)

slide-34
SLIDE 34

Pr Proof of the theorem

  • Step 1: Calculate the error
  • !

"# − "∗

& =

I − )* + x- − x∗

&

+ ! ∑012

#

−1 4 0 ) 5 − )* #60*7

&

  • Step 2: Simplify via eigenvector basis decomposition
  • 8 = ∑912

:

1 − );9 &#<9

& ,

> = )& ∑912

:

?9

&;9 &!

∑012

#

−1 4 0 1 − );9 #60 &

  • Step 3: Construct a contradiction
  • For contradiction, assume there is ) dependent on @ achieving convergence A

2 #

P Q

)@ 2 − );9 = 1 ;9 + A(1)

Cannot be true for different ;9!

slide-35
SLIDE 35

Wha What to do do ne next?

  • This means the best non-asymptotic rate we can hope is !

" #

  • Key step: introduce $ into the bound
  • The hope is if we can show bound like !

% #& , RandomShuffle behaves betterJ

Long Time: !

" #&

Short Time: !

" #

What happens in between?

slide-36
SLIDE 36

Wha What to do do ne next?

  • This means the best non-asymptotic rate we can hope is !

" #

  • Key step: introduce $ into the bound
  • The hope is if we can show bound like !

% #& , RandomShuffle behaves betterJ

Long Time: !

" #&

Short Time: !

" #

What happens in between?

slide-37
SLIDE 37

Bou Bounds dependent on

  • n !

For general second order differentiable functions with Lipschitz Hessian:

slide-38
SLIDE 38

Bou Bounds dependent on

  • n !
  • On one hand, RandomShuffle converges with
  • On the other hand, SGD converges with
  • So the take away is:

" 1 $% + '( $( " 1 $ RandomShuffle is provably better than SGD after " ' epochs!

slide-39
SLIDE 39

Bou Bounds dependent on

  • n !
  • On one hand, RandomShuffle converges with
  • On the other hand, SGD converges with
  • So the take away is:

" 1 $% + '( $( " 1 $ RandomShuffle is provably better than SGD after " ' epochs!

slide-40
SLIDE 40

Su Summa mmary of

  • f r

results

We analyze RandomShuffle in the following settings:

  • Strongly convex, Lipschitz Hessian
  • Sparse data
  • Vanishing variance
  • Nonconvex, under PL condition
  • Smooth convex
slide-41
SLIDE 41

Sp Sparse s setting

  • A sparse problem can be written as:

! " = $

%&' (

)

%("+,)

  • Where each .% is a subset of all the dimensions [0]
  • Consider a graph with 2 nodes, with edge (3, 5) if .% ∩ .

7 ≠ ∅

  • Define the sparsity level of the problem:

: = max

'>%>( |{. 7 ∶ .% ∩ . 7 ≠ ∅}|

2

slide-42
SLIDE 42

Sp Sparse s setting

  • A sparse problem can be written as:

! " = $

%&' (

)

%("+,)

  • Where each .% is a subset of all the dimensions [0]
  • Consider a graph with 2 nodes, with edge (3, 5) if .% ∩ .

7 ≠ ∅

  • Define the sparsity level of the problem:

: = max

'>%>( |{. 7 ∶ .% ∩ . 7 ≠ ∅}|

2

slide-43
SLIDE 43

Sp Sparse s setting

  • A fact about sparsity:

1 " ≤ $ ≤ 1

  • We have the following improved bound for sparse problem:
  • As a corollary, when $ = &

' ( , there is a & ' )* convergence rate!

slide-44
SLIDE 44

Sp Sparse s setting

  • A fact about sparsity:

1 " ≤ $ ≤ 1

  • We have the following improved bound for sparse problem:
  • As a corollary, when $ = &

' ( , there is a & ' )* convergence rate!

slide-45
SLIDE 45

Sp Sparse s setting

  • A fact about sparsity:

1 " ≤ $ ≤ 1

  • We have the following improved bound for sparse problem:
  • As a corollary, when $ = &

' ( , there is a & ' )* convergence rate!

slide-46
SLIDE 46

Su Summa mmary of

  • f r

results

We analyze RandomShuffle in the following settings:

  • Strongly convex, Lipschitz Hessian
  • Sparse data
  • Vanishing variance
  • Nonconvex, under PL condition
  • Smooth convex
slide-47
SLIDE 47

Whe When n Varianc nce Vani nishe shes

  • When the variance vanishes at the optimality

!

" #∗ = 0,

∀ )

  • Given * pairs of numbers 0 ≤ ," ≤ -", a optimal solution #∗ ∈ ℝ0 and an

initial upper bound on distance 1

  • A valid problem is defined as * functions and an initial point #2 such that:
  • !

" is ,"-strongly convex, -"-Lipschitz continuous

  • !

" 3 #∗ = 0

  • ∥ #2 − #∗ ∥6 ≤ 1
slide-48
SLIDE 48

Whe When n Varianc nce Vani nishe shes

  • When the variance vanishes at the optimality

!

" #∗ = 0,

∀ )

  • Given * pairs of numbers 0 ≤ ," ≤ -", a optimal solution #∗ ∈ ℝ0 and an

initial upper bound on distance 1

  • A valid problem is defined as * functions and an initial point #2 such that:
  • !

" is ,"-strongly convex, -"-Lipschitz continuous

  • !

" 3 #∗ = 0

  • ∥ #2 − #∗ ∥6 ≤ 1
slide-49
SLIDE 49

Whe When n Varianc nce Vani nishe shes

  • When the variance vanishes at the optimality

!

" #∗ = 0,

∀ )

  • Given * pairs of numbers 0 ≤ ," ≤ -", a optimal solution #∗ ∈ ℝ0 and an

initial upper bound on distance 1

  • A valid problem is defined as * functions and an initial point #2 such that:
  • !

" is ,"-strongly convex, -"-Lipschitz continuous

  • !

" 3 #∗ = 0

  • ∥ #2 − #∗ ∥6 ≤ 1
slide-50
SLIDE 50

Whe When n Varianc nce Vani nishe shes

slide-51
SLIDE 51

Whe When n Varianc nce Vani nishe shes

RandomShuffle is provably better than SGD after ANY number of iterations!

slide-52
SLIDE 52

Thanks!