Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 - - PowerPoint PPT Presentation

calculating hypergradient
SMART_READER_LITE
LIVE PREVIEW

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 - - PowerPoint PPT Presentation

Calculating Hypergradient Jingchang Liu November 13, 2019 HKUST 1 Table of Contents Background Bilevel optimization Forward and Reverse Gradient-Based Hyperparameter Optimization Conclusion Q & A 2 Background Hyperparameter


slide-1
SLIDE 1

Calculating Hypergradient

Jingchang Liu November 13, 2019

HKUST 1

slide-2
SLIDE 2

Table of Contents

Background Bilevel optimization Forward and Reverse Gradient-Based Hyperparameter Optimization Conclusion Q & A

2

slide-3
SLIDE 3

Background

slide-4
SLIDE 4

Hyperparameter Optimization

Tradeoff parameter

  • The dataset is split in two: Strain and Stest.
  • Suppose we add ℓ2 norm as the regulation term, then

arg min

λ∈D

loss(Stest, X(λ)) (1) s.t. X(λ) ∈ arg min

x∈Rp

loss(Strain, x) + eλx2. Stepsize For gradient descent with momentum: vt = µvt−1 + ∇Jt(wt−1), wt = wt−1 − η (µvt−1 − ∇Jt(wt−1)) . Hyperparameters are µ and η.

3

slide-5
SLIDE 5

Group Lasso

Traditional Group Lasso To seduce the group sparse effect of parameter w, we do ˆ w ∈ arg min

w∈Rp

1 2y − Xw2 + λ

L

  • l=1

wGl2, (2) where we partition features in L groups {G1, G2, . . . , GL}.

  • But we need to do the partition by ourself beforehand.
  • How to learn the partition?

4

slide-6
SLIDE 6

Group Lasso

  • Encapsulate the group structure by an hyperparameter

θ = [θ1, θ2, . . . , θL] ∈ {0, 1}P×L, where L is max number of groups and P is the number of features.

  • θp,l = 1 if the p-th feature belongs to the l-th group, and 0
  • therwise.

Formulations for learning θ: ˆ θ ∈ arg min

θ∈∈{0,1}P×L C( ˆ

w(θ)), (3) where C( ˆ w(θ)) can be the validation function C( ˆ w(θ)) = 1

2

  • y

′ − X ′w

  • 2

, and ˆ w(θ) = arg min

w∈RP×L

1 2 y − Xw2 + λ

L

  • l=1

θl ⊙ w 2 (4)

5

slide-7
SLIDE 7

Bilevel optimization

slide-8
SLIDE 8

Bilevel Optimization

We can conclude the following optimization problem: min

x

f U(x, y) s.t. y ∈ arg miny ′ f L(x, y

′),

(5)

  • f U is the upper-level objective, over two variables x and y.
  • f L is the lower-level objective, which binds y as a function of x.
  • (5) can be simply viewed as a special case of constrained
  • ptimization.
  • If we can get the analytic solution y ∗(x) of y, then we just need to

solve the single-level problem minx f U(x, y ∗(x)).

6

slide-9
SLIDE 9

Gradient

Compute the gradient of the solution to the lower-level problem with respect to variables in the upper-level problem: x = x − η ∂f U ∂x + ∂f U ∂y ∂y ∂x

  • |(x,y ∗).

(6) How to calculate ∂y

∂x ?

Theorem Let f : R × R → R be a continuous function with first and second

  • derivatives. Let g(x) = arg minyf (x, y). Then the derivative of g with

respect to x is dg(x) dx = −fXY (x, g(x)) fYY (x, g(x)). (7) where fXY =

∂2f ∂x∂y and fYY = ∂2f ∂y 2 , 7

slide-10
SLIDE 10

Proof

  • 1. Since g(x) = arg minyf (x, y), we get ∂f (x,y)

∂y

|y=g(x) = 0;

  • 2. Differentiating lhs and rhs, we get

d dx ∂f (x,g(x)) ∂y

= 0;

  • 3. While by the chain rule, we have

d dx ∂f (x, g(x)) ∂y = ∂2f (x, g(x)) ∂x∂y + ∂2f (x, g(x)) ∂y 2 dg(x) dx ; (8) Equating to zero and rearranging gives: dg(x) dx = ∂2f (x, g(x)) ∂y 2 −1 ∂2f (x, g(x)) ∂x∂y (9) = −fXY (x, g(x)) fYY (x, g(x)). (10)

8

slide-11
SLIDE 11

Lemma

Lemma 1 Let f : R × R⋉ → R be a continuous function with first and second

  • derivatives. Let g(x) = arg miny∈Rnf (x, y). Then the derivative of g

with respect to x is g

′(x) = −fXY (x, g(x))−1fYY (x, g(x)).

(11) where fXY = ∇2

yyf (x, y) ∈ Rn×n and fYY = ∂ ∂x ∇yf (x, y) ∈ Rn, 9

slide-12
SLIDE 12

Application to hyperparameter optimization (icml 16)

Hyperparameter optimization arg min

λ∈D

loss(Stest, X(λ)) (12) s.t. X(λ) ∈ arg min

x∈Rp

loss(Strain, x) + eλx2. Gradient descent for bilevel problem x = x − η ∂f U ∂x + ∂f U ∂y ∂y ∂x

  • |(x,y ∗)

(13) = x − η

  • ∂f U

∂x − ∂f U ∂y ∂2f (x, g(x)) ∂y 2 −1 ∂2f (x, g(x)) ∂x∂y

  • (14)

Gradient ∇f = ∇2g −

  • ∇2

1,2h

T ∇2

1h

−1 ∇1g

10

slide-13
SLIDE 13

HOAG

11

slide-14
SLIDE 14

Analysis

Conclusion

  • If the sequence {ǫi}∞

i=1 is summable, then this implies the

convergence to a stationary point of f . Theorem If the sequence {ǫi}∞

i=1 is positive and verifies ∞

  • i=1

ǫi < ∞, then the sequence λk of iterates in the HOAG algorithm has limit λ∗ ∈ D. In particular, if λ∗ belongs to the interior of D, it is verified then ∇f (λ∗) = 0.

12

slide-15
SLIDE 15

Forward and Reverse Gradient-Based Hyperparameter Optimization

slide-16
SLIDE 16

Formulation I

  • Focus on training procedures of an objective function J(w) with

respect to w.

  • The training procedures of SGD or its variants like momentum,

RMSProp and ADAM can be regarded as a dynamical system with a state st ∈ Rd. st = Φt(st−1, λ), t = 1, . . . , T

  • For gradient descent with momentum:

vt = µvt−1 + ∇Jt(wt−1), wt = wt−1 − η (µvt−1 − ∇Jt(wt−1)) .

  • 1. st = (wt, vt), st ∈ Rd.
  • 2. λ = (µ, η), λ ∈ Rm.
  • 3. Φt : (Rd × Rm) → Rd.

13

slide-17
SLIDE 17

Formulation II

  • The iterates s1, . . . , sT implicitly depend on the vector of

hyperparameters λ.

  • Goal: optimize the hyperparameters according to a certain error

function E evaluated at the last iterate sT.

  • We wish to solve the problem

min

λ∈Λ f (λ),

where the set Λ ⊂ Rm incorporates constraints on the hyperparameters.

  • The response function f : Rm → R, defined at λ ∈ Rm

f (λ) = E(sT(λ)).

14

slide-18
SLIDE 18

Diagram

Figure 1: The iterates s1, . . . , sT depend on the hyperparameters λ

  • Change the bilevel program to use the parameters at the last iterate

sT rather than ˆ w, min

λ∈Λ f (λ),

where f (λ) = E(sT(λ)).

  • The hypergradient

∇f (λ) = ∇E(sT)d sT d λ .

15

slide-19
SLIDE 19

Forward-Mode to calculate hypergradient

  • Chain rule:

∇f (λ) = ∇E(sT)d sT d λ , where d sT

d λ is the d × m matrix.

  • Sine st = Φt(st−1, λ), Φt depends on λ both directly and indirectly

through the state st−1: d st d λ = ∂ Φt(st−1, λ) ∂st−1 d st−1 d λ + ∂Φt(st−1, λ) ∂ λ .

  • Defining Zt = d st

d λ , we rewrite it as

Zt = AtZt−1 + Bt, t ∈ {1, . . . , T}.

16

slide-20
SLIDE 20

Forward-mode Recurrence

Figure 2: Recurrence

17

slide-21
SLIDE 21

Forward-HG algorithm

Figure 3: Forward-HG algorithm

18

slide-22
SLIDE 22

Reverse-Mode to calculate hypergradient

Reformulate original problem as the constrained opt problem min

λ,s1,...,sT

E(sT), s.t. st = Φt(st−1, λ), t ∈ {1, . . . , T}. Lagrangian L(s, λ, α) = E(sT) +

T

  • t=1

αt(Φt(st−1, λ) − st).

19

slide-23
SLIDE 23

Partial derivation of the Lagrangian

20

slide-24
SLIDE 24

Derivations

Notation At = ∂ Φt(st−1, λ) ∂st−1 , Bt = ∂ Φt(st−1, λ) ∂λ , note that At ∈ Rd×d and Bt ∈ Rd×m.

∂L ∂st = 0 and ∂L ∂sT = 0

αt =

  • ∇E(sT)

if t = T, ∇E(sT)AT · · · At+1 if t ∈ {1, . . . , T − 1}. (15) Since ∂L

∂λ = T

  • t=1

αtBt, ∂L ∂λ = ∇E(sT)

T

  • t=1
  • T
  • s=t+1

As

  • Bt.

21

slide-25
SLIDE 25

Reverse-HG algorithm

Figure 4: Reverse-HG algorithm

TRUNCATED BACK-PROPAGATION t = T − 1 to T − k.

22

slide-26
SLIDE 26

Real-Time HO

  • For t ∈ {1, . . . , T}, define

ft(λ) = E(st(λ)).

  • Partial hypergradients are avaliable in forward mode

∇ft(λ) = d E(st) d λ = ∇E(st)Zt.

  • Significant: we can update hyperparameters in a single epoch,

without having to wait until time T.

Figure 5: The iterates s1, . . . , sT depend on the hyperparameters λ

23

slide-27
SLIDE 27

Real-Time HO algorithm

Figure 6: Real-Time HO algorithm

24

slide-28
SLIDE 28

Analysis

  • Forward and inverse mode have different time/space tradeoffs.
  • Reverse mode needs to store the whole history of parameters.
  • Forward mode need to calculate mat multipy mat in each step.

25

slide-29
SLIDE 29

Conclusion

slide-30
SLIDE 30

Conclusions

  • Calculating the hypergradients, the gradients with respect to

hyperparameters, is very important in selecting a good hyperparameter.

  • We talk about two ways for calculating hypergradients: bilevel
  • ptimization and forward/inverse mode.
  • In bilevel optimization, we suppose an optimal solutions set of lower

level function; while in forward/inverse mode, we cosider the whole process of the lower level iterations.

  • Calculating hypergradients in bilevel optimization invovles solving

the lower level problem and two second-order derivatives, both are very heavy cost.

  • Forward/inverse mode uses chain rule, just like for deep nets

training.

26

slide-31
SLIDE 31

Q & A