Improving L-BFGS Initialization for Trust-Region Methods in Deep - - PowerPoint PPT Presentation

improving l bfgs initialization for trust region methods
SMART_READER_LITE
LIVE PREVIEW

Improving L-BFGS Initialization for Trust-Region Methods in Deep - - PowerPoint PPT Presentation

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced Agenda


slide-1
SLIDE 1

Jacob Rafati

http://rafati.net jrafatiheravi@ucmerced.edu

Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

slide-2
SLIDE 2

Agenda

  • Introduction, Problem Statement and Motivations
  • Overview on Quasi-Newton Optimization Methods
  • L-BFGS Trust Region Optimization Method
  • Proposed Methods for Initialization of L-BFGS
  • Application in Deep Learning (Image Classification Task)
slide-3
SLIDE 3

Introduction, Problem Statement and Motivations

slide-4
SLIDE 4

Unconstrained Optimization Problem

min

w∈Rn L(w) , 1

N

N

X

i=1

`i(w)

L : Rn → R

slide-5
SLIDE 5

Optimization Algorithms

Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838

slide-6
SLIDE 6

Optimization Algorithms

  • 1. Start from a random point
  • 2. Repeat each iteration,
  • 3. Choose a search direction
  • 4. Choose a step size
  • 5. Update parameters
  • 6. Until

w0 αk wk+1 ← wk + αkpk k = 0, 1, 2, . . . , pk krLk < ✏

slide-7
SLIDE 7

Properties of Objective Function

  • are both large in modern applications.
  • is a non-convex and nonlinear function.
  • is ill-conditioned.
  • Computing full gradient, is expensive.
  • Computing Hessian, is not practical.

n and N L(w) r2L(w) r2L rL

min

w∈Rn L(w) , 1

N

N

X

i=1

`i(w)

slide-8
SLIDE 8

Stochastic Gradient Decent

  • 1. Sample indices
  • 2. Compute stochastic (subsampled) gradient
  • 3. Assign a learning rate
  • 4. Update parameters using
  • H. Robbins, D. Siegmund. (1971). ”A convergence theorem for non negative almost supermartingales and some applications”.

Optimizing methods in statistics.

Sk ⊂ {1, 2, . . . , N}

αk

wk+1 wk αkrL(wk)(Sk)

rL(wk) ⇡ rL(wk)(Sk) , 1 |Sk| X

i∈Sk

r`i(wk)

pk = rL(wk)(Sk)

slide-9
SLIDE 9

Advantages of SGD

  • SGD algorithms are very easy to implement.
  • SGD requires only computing the gradient.
  • SGD has a low cost-per-iteration.

Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838

slide-10
SLIDE 10

Disadvantages of SGD

  • Very sensitive to the ill-conditioning problem and

scaling.

  • Requires fine-tuning many hyper-parameters.
  • Unlikely exhibit acceptable performance on first try.
  • requires many trials and errors.
  • Can stuck in a saddle-point instead of local

minimum.

  • Sublinear and slow rate of convergence.

Bottou et al., (2016). Optimization methods for large-scale machine learning. print arXiv:1606.04838

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
slide-11
SLIDE 11

Second-Order Methods

  • 1. Sample indices
  • 2. Compute stochastic (subsampled) gradient
  • 3. Compute Hessian

Sk ⊂ {1, 2, . . . , N}

rL(wk) ⇡ rL(wk)(Sk) , 1 |Sk| X

i∈Sk

r`i(wk)

r2L(wk) ⇡ r2L(wk)(Sk) , 1 |Sk| X

i∈Sk

r2`i(wk)

slide-12
SLIDE 12

Second-Order Methods

  • 4. Compute Newton’s direction
  • 5. Find proper step length
  • 6. Update parameters

pk = r2L(wk)−1rL(wk)

αk = min

α L(wk + αpk)

wk+1 ← wk + αkpk

slide-13
SLIDE 13

Second-Order Methods Advantages

  • The rate of convergence is super-linear

(quadratic for Newton method).

  • They are resilient to problem ill-conditioning.
  • They involve less parameter tuning.
  • They are less sensitive to the choice of hyper-

parameters.

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
slide-14
SLIDE 14

Second-Order Methods Disadvantages

  • Computing the Hessian matrix is very

expensive and requires massive storage.

  • Computing the inverse of Hessian is not

practical.

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.

Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838

slide-15
SLIDE 15

Quasi-Newton Methods

  • 1. Construct a low-rank approximation of Hessian
  • 2. Find the search direction by Minimizing the

Quadratic Model of the objective function

Bk ⇡ r2L(wk) pk = argmin

p∈Rn Qk(p) , gT k p + 1

2pT BkpT

slide-16
SLIDE 16

Quasi-Newton Matrices

  • Symmetric
  • Easy and Fast Computation
  • Satisfies Secant Condition

Bk+1sk = yk

sk , wk+1 − wk

yk , rL(wk+1) rL(wk)

slide-17
SLIDE 17

Broyden Fletcher Goldfarb Shanno.

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
slide-18
SLIDE 18

Broyden Fletcher Goldfarb Shanno.

Bk+1 = Bk − 1 sT

k Bksk

BksksT

k Bk +

1 yT

k sk

ykyT

k ,

sk , wk+1 − wk

yk , rL(wk+1) rL(wk)

B0 = γkI

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
slide-19
SLIDE 19

Quasi-Newton Methods Advantages

  • The rate of convergence is super-linear.
  • They are resilient to problem ill-conditioning.
  • The second derivative is not required.
  • They only use the gradient information to

construct quasi-Newton matrices.

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
slide-20
SLIDE 20

Quasi-Newton Methods disadvantages

  • The cost of storing the gradient informations can

be expensive.

  • The quasi-Newton matrix can be dense.
  • The quasi-Newton matrix grow in size and rank

in large-scale problems.

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
slide-21
SLIDE 21

Limited-Memory BFGS

Sk = ⇥ sk−m . . . sk−1 ⇤

Yk = ⇥ yk−m . . . yk−1 ⇤

Ψk = ⇥ B0Sk Yk ⇤ , Mk =  −ST

k B0Sk

−Lk −LT

k

Dk −1

ST

k Yk = Lk + Dk + Uk

Bk = B0 + ΨkMkΨT

k

L-BFGS Compact Representation

B0 = γkI

where Limited Memory Storage

slide-22
SLIDE 22

Limited-Memory Quasi-Newton Methods

  • Low rank approximation
  • Small memory of recent gradients.
  • Low cost of computation of search direction.
  • Linear or superlinear Convergence rate can be

achieved.

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
slide-23
SLIDE 23

Objectives

Bk = B0 + ΨkMkΨT

k

B0 = γkI

What is the best choice for initialization?

slide-24
SLIDE 24

Overview on Quasi-Newton Optimization Strategies

slide-25
SLIDE 25

Line Search Method

pk

wk

Quadratic model if Bk is positive definite:

pk = argmin

p∈Rn Qk(p) , gT k p + 1

2pT BkpT

pk = B−1

k gk

rL(wk)

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
slide-26
SLIDE 26

Line Search Method

pk

wk

wk+1

αkpk

Wolfe conditions L(wk + αkpk)  L(wk) + c1αkrL(wk)T pk rL(wk + αkpk)T pk c2rf(wk)T pk

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
slide-27
SLIDE 27

Trust Region Method

wk

δk

pk

Toint et al. (2000), Trust Region Methods, SIAM.

pk = arg min

p∈Rn Q(p)

s.t. kpk2  δk

slide-28
SLIDE 28

Trust Region Method

  • 6
  • 4
  • 2

2 4 6

  • 6
  • 4
  • 2

2 4 6

Newton step Global minimum Local minimum

  • J. J. Mor´e and D. C. Sorensen, (1984) Newton’s method, in Studies in Mathematics, Volume 24. Studies in Numerical Analysis, Math. Assoc., pp. 29–82.
slide-29
SLIDE 29

L-BFGS Trust Region Optimization Method

slide-30
SLIDE 30

L-BFGS in Trust Region

  • L. Adhikari et al. (2017) “Limited-memory trust-region methods for sparse relaxation,” in Proc.SPIE, vol. 10394.

Brust et al, (2017). “On solving L-SR1 trust-region subproblems,” Computational Optimization and Applications, vol. 66, pp. 245–266.

Bk = B0 + ΨkMkΨT

k

Eigen-decomposition Bk = P Λ + γkI γkI

  • P T

Sherman-Morrison-Woodbury Formula p∗

k = − 1

τ ∗ ⇥ I − Ψk(τ ∗M −1

k

+ ΨT

k Ψk)−1ΨT k

⇤ gk, τ ∗ = γk + σ∗

slide-31
SLIDE 31

L-BFGS in Trust Region vs. Line-Search

26th European Signal Processing Conference, Rome, Italy. September 2018.

slide-32
SLIDE 32

Proposed Methods for Initialization of L-BFGS

slide-33
SLIDE 33

Initialization Method I

Bk = B0 + ΨkMkΨT

k

B0 = γkI

Spectral estimate of Hessian

γk = yT

k−1yk−1

sT

k−1yk−1

γk = arg min

γ

kB−1

0 yk−1 sk−1k2 2,

B0 = γI

  • J. Nocedal and S. J. Wright. (2006). Numerical Optimization. 2nd ed. New York. Springer.
slide-34
SLIDE 34

Initialization Method II

HSk = Yk

ST

k HSk = ST k Yk

Consider a quadratic function

L(w) = 1 2wT Hw + gT w

r2L(w) = H

We have Therefore

Erway et al. (2018). “Trust-Region Algorithms for Training Responses: Machine Learning Methods Using Indefinite Hessian Approximations,” ArXiv e-prints.

slide-35
SLIDE 35

Initialization Method II

ST

k HSk − γkST k Sk = ST k ΨkMkΨT k Sk Ψk = ⇥ B0Sk Yk ⇤ , Mk =  −ST

k B0Sk

−Lk −LT

k

Dk −1

Bk = B0 + ΨkMkΨT

k

Since

B0 = γkI

We have Secant Condition

BkST

k = Yk

slide-36
SLIDE 36

Initialization Method II

ST

k HSk − γkST k Sk = ST k ΨkMkΨT k Sk

(Lk + Dk + LT

k )z = λST k Skz

γk ∈ (0, λmin)

General Eigen-Value Problem Upper bound for initial value to avoid false curvature information

slide-37
SLIDE 37

Initialization Method III

Bk = B0 + ΨkMkΨT

k

B0 = γkI

ST

k HSk − γkST k Sk = ST k ΨkMkΨT k Sk

Ψk = ⇥ B0Sk Yk ⇤ , Mk =  −ST

k B0Sk

−Lk −LT

k

Dk −1

Note that compact representation matrices contains γk

slide-38
SLIDE 38

Initialization Method III

γk ∈ (0, λmin)

A∗z = λB∗z

A∗ = Lk+Dk+LT

k − ST k Yk ˜

DY T

k Sk− γ2 k−1(ST k Sk ˜

AST

k Sk),

B∗ = ST

k Sk + ST k Sk ˜

BY T

k ST k + ST k Yk ˜

BT ST

k Sk.

General Eigen-Value Problem Upper bound for initial values to avoid false curvature information

slide-39
SLIDE 39

Applications in Deep Learning

slide-40
SLIDE 40

X = {x1, x2, . . . , xi, . . . , xN}

T = {t1, t2, . . . , ti, . . . , tN}

Supervised Learning

Features Labels

Φ : X → T

slide-41
SLIDE 41

Input: Model (Predictor) Output:

x

y

φ(x; w)

1 2 3 . . . 9 0.97 0.03 . . .

Supervised Learning

slide-42
SLIDE 42

Input conv1 conv2 pool1

pool2 hidden4

  • utput

Convolutional Neural Network

n = 413, 080

φ(x; w)

Lecun et al. (1998) “Gradient-basedlearning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324.

slide-43
SLIDE 43

t = ⇥0, 0, 1, 0, . . . , 0⇤

y = ⇥ 0, 0, 0.97, 0.03, . . . , ⇤

Loss Function

Cross-entropy Empirical Risk

L(w) = 1 N

N

X

i=1

`(ti, (xi, w))

`(t, y) = −t. log(y) − (1 − t). log(1 − y)

Target Output

slide-44
SLIDE 44

Multi-Batch L-BFGS

Shuffled Data Shuffled Data

Ok

Ok−1

Ok

Ok+1 Ok+1 Ok+2 Sk Sk+1 Sk+2 Ok = Sk ∩ Sk+1

Berahas et al. (2016). “A multi-batch L-BFGS method for machine learning,” in Advances in Neural Information Processing Systems 29, pp. 1055–1063.

slide-45
SLIDE 45

Computing gradients

Ok

Ok−1

Ok

Ok+1 Sk Sk+1 Ok = Sk ∩ Sk+1 gk = rL(wk)(Sk) = 1 |Sk| X

i∈Jk

rLi(wk)

yk = rL(wk+1)(Ok) rL(wk)(Ok)

Berahas et al. (2016). “A multi-batch L-BFGS method for machine learning,” in Advances in Neural Information Processing Systems 29, pp. 1055–1063.

slide-46
SLIDE 46

Experiment

slide-47
SLIDE 47

Trust-Region Algorithm

slide-48
SLIDE 48

Results - Loss

Full Stochastic Full Stochastic

slide-49
SLIDE 49

Results - Accuracy

Full Stochastic Full Stochastic

slide-50
SLIDE 50

Results - Training Time

Full Stochastic Full Stochastic

slide-51
SLIDE 51

This research is supported by NSF Grants CMMI Award 1333326 , IIS 1741490 and ACI 1429783.

Acknowledgement

slide-52
SLIDE 52

Paper, Code and Slides: http://rafati.net

jrafatiheravi@ucmerced.edu