On the steplength selection in Stochastic Gradient Methods Giorgia - - PowerPoint PPT Presentation

on the steplength selection in stochastic gradient methods
SMART_READER_LITE
LIVE PREVIEW

On the steplength selection in Stochastic Gradient Methods Giorgia - - PowerPoint PPT Presentation

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments On the steplength selection in Stochastic Gradient Methods Giorgia Franchini giorgia.franchini@unimore.it Universit


slide-1
SLIDE 1

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments

On the steplength selection in Stochastic Gradient Methods

Giorgia Franchini giorgia.franchini@unimore.it

Università degli studi di Modena e Reggio Emilia

Como, 16-18 July, 2018

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-2
SLIDE 2

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Large-scale optimization in machine learning

Optimization problem in machine learning

The following optimization problem, which minimizes the sum of cost functions over samples from a finite training set composed by sample data ai ∈ Rd and class label bi ∈ {±1} for i ∈ {1 . . . n}, appears frequently in machine learning: min F(x) ≡ 1 n

n

  • i=1

fi(x), (1) where n is the sample size, and each fi : Rd → R is the cost function corresponding to a training set element. For example in the logistic regression case we have: fi(x) = log

  • 1 + exp(−biaT

i x)

  • We are interested in finding x that minimizes (1).

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-3
SLIDE 3

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Large-scale optimization in machine learning

Stochastic Gradient Descent (SGD)

For given x, computing F(x) and ∇F(x) is prohibited, due to the large size of the training set; when n is large, Stochastic Gradient Descent (SGD) method and its variants have been the main approaches for solving (1); in the t − th iteration of SGD, a random index of a training sample it is chosen from {1, 2, . . . , n} and the iterate xt is updated by xt+1 = xt − ηt∇fit(xt) where ∇fit(xt) denotes the gradient of the it − th component function at xt, and ηt > 0 is the step size or learning rate.

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-4
SLIDE 4

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments

SGD properties

Theorem (Strongly Convex Objective, Fixed Step size) Suppose that the SGD method is run with a fixed step size, ηt = ¯ η for all t ∈ N, satisfying 0 < ¯ η ≤ µ LMG . Then, the expected optimality gap satisfies the following relation: E[F(xt) − F∗] t→∞ − − − → ¯

ηLM 2cµ .

L > 0 Lipschitz constant of the gradient of F(x); c > 0 strongly convex constant of F(x); µ and M are related to the first and second moment of the stochastic gradient.

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-5
SLIDE 5

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments

Notation and assumptions

L > 0 gradient Lipschitz constant; c > 0 strongly convex constant; there exist scalars µG ≥ µ > 0 such that, for all k ∈ N, ∇F(xt)TEξt[g(xt, ξt)] ≥ µ ∇F(xt) 2

2

Eξt[g(xt, ξt)] 2≤ µG ∇F(xt) 2; there exist scalars M ≥ 0 and MV ≥ 0 such that, for all t ∈ N, Vξt[g(xt, ξt)] ≤ M + MV ∇F(xt) 2

2;

MG := MV + µ2

G ≥ µ2 > 0.

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-6
SLIDE 6

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments

SGD properties, Diminishing Step sizes

Theorem (Strongly Convex Objective, Diminishing Step sizes) Suppose that the SGD method is run with a step size sequence such that:

  • t=1

ηt = ∞ and

  • t=1

η2

t < ∞.

Then the expected optimality gap satisfies: E[F(xt) − F∗] ≤ ν γ + t , where ν and γ are constant.

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-7
SLIDE 7

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments

Notation

ηt =

β γ+t for some β > 1 cµ and γ > 0 such that η1 ≤ µ LMG ;

ν := max

  • β2LM

2(βcµ−1) , (γ + 1)(F(x1) − F∗)

  • .

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-8
SLIDE 8

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Step size selection: full and stochastic gradient methods Numerical experiments

A numerical experiment: the test problem

Logistic regression with l2 − norm regularization: min

x F(x) = 1

n

n

  • i=1

log

  • 1 + exp(−biaT

i x)

  • + λ

2 x2

2

where ai ∈ Rd and bi ∈ {±1} are the feature vectors and class labels of the i − th sample, respectively, and λ > 0 is a regularization parameter; database: MNIST 8 and 9 digits, dimension: 11800 × 784.

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-9
SLIDE 9

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Step size selection: full and stochastic gradient methods Numerical experiments

MNIST

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-10
SLIDE 10

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Step size selection: full and stochastic gradient methods Numerical experiments

A numerical experiment: the algorithms

The two full gradient BB rules:

BB1 full gradient: a nonmonotone gradient method with the first Barzilai-Borwein step size rule; Adaptive BB (ABB): a nonmonotone gradient method with a step size rule that alternates between the two BB rules full gradient.

Stochastic:

ADAM: stochastic gradient method based on adaptive moment estimation; ADAM ABB.

behaviour with respect to the epochs: one epoch = 11800 SGD steps.

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-11
SLIDE 11

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Step size selection: full and stochastic gradient methods Numerical experiments

Deterministic cases

ηBB1

t

= sT

t−1st−1

sT

t−1vt−1

ηBB2

t

= sT

t−1vt−1

vT

t−1vt−1

where st−1 = xt − xt−1 and vt−1 = ∇F(xt) − ∇F(xt−1) with ∇F(x) = 1

n

n

i=1 ∇fi(x).

ηABBmin

t

=

  • min{ηBB2

j

: j = max{1, t − ma}, . . . , t}, if ηBB2

t

ηBB1

t

< τ ηBB1

t

,

  • therwise

where ma is a nonnegative integer and τ ∈ (0, 1).

[Di Serafino, Ruggiero, Toraldo, Zanni, On the steplength selection in gradient methods for unconstrained optimization, AMC 318, 2018] Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-12
SLIDE 12

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Step size selection: full and stochastic gradient methods Numerical experiments

Kingma, Lei Ba, Adam: a method for stochastic

  • ptimization, ArXiv, 2017

Algorithm 1 Adam

1: Choose maxit, η, ǫ, β1 and β2 ∈ [0, 1), x0; 2: initialize m0 ← 0, v0 ← 0, t ← 0 3: for t ∈ {0, . . . , maxit} do 4:

t ← t + 1

5:

gt ← ∇fit(xt−1)

6:

mt ← β1 · mt−1 + (1 − β1) · gt

7:

vt ← β2 · vt−1 + (1 − β2) · g2

t

8:

ηt = η √

1−βt

2

(1−βt

1)

9:

xt ← xt−1 − ηt · mt/(√vt + ǫ)

10: end for 11: Result: xt

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-13
SLIDE 13

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Step size selection: full and stochastic gradient methods Numerical experiments

Behaviour of the deterministic and the stochastic methods

10 20 30 40 50 60 70 80 90 100 epoch 10-3 10-2 10-1 100

F-F*

Optimality gap

ADAM BB1 FULL GRADIENT ABB FULL GRADIENT

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-14
SLIDE 14

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Step size selection: full and stochastic gradient methods Numerical experiments

Comparison between different SGD types

2 4 6 8 10 12 14 16 18 20 epoch 10-2 10-1 100

F-F*

Optimality gap

ADAM SGD MOMENTUM

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-15
SLIDE 15

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Step size selection: full and stochastic gradient methods Numerical experiments

Are BB rules useful in the stochastic framework?

ηBB1

t

= sT

t−1st−1

sT

t−1vt−1

ηBB2

t

= sT

t−1vt−1

vT

t−1vt−1

where st−1 = xt − xt−1 and vt−1 = ∇fit(xt) − ∇fit−1(xt−1). Sopyla-Drozda, SGD with BB update step for SVM, Inf. Sci., 2015 There is not just one way to calculate them. Tan, Ma, Dai, Qian, BB Step Size for SGD, Adv NIPS 2016 Comparable performance with popular SGD approaches with best-tuned step size, but without expensive parameter settings.

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-16
SLIDE 16

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Step size selection: full and stochastic gradient methods Numerical experiments

ABB in Adam

ηABBmin

t

=

  • min{ηBB2

j

: j = max{1, t − ma}, . . . , t}, if ηBB2

t

ηBB1

t

< τ ηBB1

t

,

  • therwise

where ma is a nonnegative integer and τ ∈ (0, 1). ηt = max{η, ηABBmin

t

}

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-17
SLIDE 17

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments Step size selection: full and stochastic gradient methods Numerical experiments

Comparison between Adam and Adam with ABB

2 4 6 8 10 12 14 16 18 20 epoch 10-3 10-2 10-1 100

F-F*

Optimality gap

ADAM ADAM ABB

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-18
SLIDE 18

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments

Future developments

Further study to make the adaptive step size rules useful also in the stochastic case; validation of the Adam-ABB version: experiments on other database and other loss-functions; exploit minibatch of adaptive size; analyse the sensitivity of the step size rules to the minibatch size.

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods

slide-19
SLIDE 19

Introduction Stochastic Gradient Methods and their properties A numerical experiment: the test problem Future developments

Thanks for your attention!

giorgia.franchini@unimore.it

Giorgia Franchini giorgia.franchini@unimore.it On the steplength selection in Stochastic Gradient Methods