Advance Stochastic Gradient with Variance Reduction Jingchang Liu - - PowerPoint PPT Presentation

advance stochastic gradient with variance reduction
SMART_READER_LITE
LIVE PREVIEW

Advance Stochastic Gradient with Variance Reduction Jingchang Liu - - PowerPoint PPT Presentation

Advance Stochastic Gradient with Variance Reduction Jingchang Liu December 7, 2017 University of Science and Technology of China 1 Table of Contents Introductions Control Variates Antithetic Sampling Stratified Sampling Important Sampling


slide-1
SLIDE 1

Advance Stochastic Gradient with Variance Reduction

Jingchang Liu December 7, 2017

University of Science and Technology of China 1

slide-2
SLIDE 2

Table of Contents

Introductions Control Variates Antithetic Sampling Stratified Sampling Important Sampling Experiments Conclusions Q & A

2

slide-3
SLIDE 3

Introductions

slide-4
SLIDE 4

Formulations

Optimization problems min f (w), f (w) := 1 n

n

  • i=1

fi(w) Stochastic gradient descent At each iteration t = 1, 2, · · · , draw it randomly from {1, · · · , n} w t = w t − ηt∇fit

  • w t

Unified formulation ζ is a random variable. w t+1 = w t − ηtg(w t, ζt)

3

slide-5
SLIDE 5

Estimation

Stochastic gradient ∇fit(w t) → 1 n

n

  • i=1

fi(w t) Unbiased E

  • ∇fit(w t)
  • = 1

n

n

  • i=1

fi(w t) Variance Reduce(VR) control variates antithetic variates important sampling stratified sampling

4

slide-6
SLIDE 6

Control Variates

slide-7
SLIDE 7

Control variates

Introduction Unknown parameter µ, assume we have a static X : EX = µ, another r.v. Y , such that EY = τ is a known value, define a new r.v. ¯ X = X + c(Y − τ) Properties

  • Unbias: E ¯

X = EX = µ

  • Variance: Var( ¯

X) = Var(X) + c2Var(Y ) + 2cCov(X, Y ) Optimal coefficient: c∗ = − Cov(X,Y )

Var(Y )

  • Simply:
  • ¯

X = X − Y + τ ,if cov(X, Y ) > 0

  • ¯

X = X + Y − τ ,if cov(X, Y ) < 0

5

slide-8
SLIDE 8

Control variates for stochastic gradient

VR gradient

  • Former: vk = ∇fik(wk−1)
  • Case 1: vk = ∇fik(wk−1) − ∇hik(wk−1) + E∇hik(wk−1)
  • Case 2: vk = ∇fik(wk−1) − ∇fik( ˜

w) + ˜ v Methods

  • SAGA: ∇fik( ˜

w) is stored in the table.

  • SVRG: ∇fik( ˜

w) is calculated after a specific number of iterations.

  • lim

k→0 E vk2 = 0

  • SAGA. SVRG will convergence under fixed stepsize.

6

slide-9
SLIDE 9

Antithetic Sampling

slide-10
SLIDE 10

antithetic variates

Two r.v. Xi, Xj id, EXi = µ, EXj = µ. As E 1

2(Xi + Xj)

  • = µ use 1

2(Xi + Xj) to estimate µ

Formulations

  • if X and Y are independent,

Var(1 2(Xi + Xj)) = 1 4Var(Xi + Xj) = 1 4 {Var(Xi) + Var(Xj)} = 1 4 × 2Var(Xi) = 1 2Var(Xi)

  • if X and Y are negative correlation,

Var(1 2(Xi +Xj)) = 1 4{Var(Xi)+Var(Xj)+2Cov(Xi, Xj)} ≤ 1 2Var(Xi)

  • if Xj = 2µ − Xi, then Var( 1

2(Xi + Xj)) = Var(µ) = 0 7

slide-11
SLIDE 11

antithetic variates for stochastic gradient

logistic regression ∇fi(w) = e−yi·x

′ i w

1 + e−yi·x′

i w yix ′

i

Formulations E ∇fi(w) + ∇fj(w)2 = E ∇fi(w)2+E ∇fj(w)2+2E ∇fi(w), ∇fj(w) E ∇fi(w), ∇fj(w) = E

  • e−yi·x

′ i w

1 + e−yi·x′

i w yix ′

i ,

e−yi·x

′ j w

1 + e−yj·x′

j w yjx ′

j

−E

  • e−yi·x

′ i w

1 + e−yi·x′

i w yix ′

i

  • e−yj·x

′ j w

1 + e−yj·x′

j w yjx ′

j

  • if and only if yix

i yjx

j , equal hold. 8

slide-12
SLIDE 12

SDCA

Derivation f (w) = 1

n n

  • i=1

fi (w) + λ

2 w2 equals to

P (y, z) = 1 n

n

  • i=1

fi (zi) + λ 2 y2 s.t. y = zi, i = 1, 2, · · · , n L (y, z, α) = P (y, z) + 1

n n

  • i=1

αi (y − zi) D (α) = inf

y,z L (y, z, α)

= 1 n

n

  • i=1

inf

zi {fi (zi) − αizi} + inf y

  • λ

2 y2 + 1 n

n

  • i=1

αiy

  • =

1 n

n

  • i=1

−f ∗

i (−αi) − λ

2

  • 1

λn

n

  • i=1

αi

  • 2

9

slide-13
SLIDE 13

SDCA

Formulation and relationships min f (w) = 1 n

n

  • i=1

fi(w) + 0.5λw ′w α∗

i = − 1

λn∇fi(w ∗) w t =

n

  • i=1

αt

i

Update αt

l =

  • αt−1

l

− ηt(∇fi(w t−1) + λnαt−1

l

) l = i αt−1

l

l = i w t = w t−1 +

  • αt

i − αt−1 i

  • =

w t−1 − ηt(∇fi(w t−1) + λnαt−1

l

) λnαt−1

l

is antithetic to ∇fi(w t−1), ∇fi(w t−1) + λnαt−1

l

→ 0 as t → inf

10

slide-14
SLIDE 14

Stratified Sampling

slide-15
SLIDE 15

Stratified sampling

Figure 1: Stratified sampling

Group 1 Group 2 Group 3 Group 4 ∇f11, · · · , ∇f1n1 ∇f21, · · · , ∇f2n2 · · · ∇fL1, · · · , ∇fLnL

11

slide-16
SLIDE 16

Stratified sampling

Principles

  • homogenous within-groups.
  • heterogenous between the groups.

Stratified sample size

  • Proportional:

bh b = nh n = Wh

  • Neyman: bh = b

WhSh

L

  • h=1

WhSh

= b

NhSh

L

  • h=1

NhSh

Apply to stochastic gradient

  • for the same labels y, cluster x, to stratify.
  • (xi, yi) → ∇fi(w; xi, yi)

12

slide-17
SLIDE 17

Important Sampling

slide-18
SLIDE 18

Important Sampling

  • Uniform sampling: ∇f (w t) =

n

  • i=1

1 n ∇fi(w t)

  • Important sampling: ∇f (w t) =

n

  • i=1

∇fi(w) npt

i

pt

i ,

n

i=1 pt i = 1

t = 1, 2, · · ·

Figure 2: Important sampling

13

slide-19
SLIDE 19

Important Sampling for Stochastic Gradient

min

pt E

  • ∇fit(w t)

npt

it

  • 2

= min

pt

1 n2

n

  • i=1

∇fi(w t)2 pt

i

≥ 1 n2 n

  • i=1
  • ∇fi(w t)
  • 2

pt

i =

∇fi(w t) n

j=1 ∇fj(w t)

if fi(w) is Li-Lipschitz, then ∇fi(w) ≤ Li, pt

i =

Li n

j=1 Lj 14

slide-20
SLIDE 20

Experiments

slide-21
SLIDE 21

Stratified Sampling

Figure 3: multi-class logistic regression (convex) on letter, mnist, pendigits, and usps.

15

slide-22
SLIDE 22

Important sampling

Figure 4: SVM on several datasets

16

slide-23
SLIDE 23

Conclusions

slide-24
SLIDE 24

conclusions

  • VR base on optimize variables, such as SDCA. SVRG, can make the

variance convergence to 0.

  • VR base on samples, can significantly reduce the variance.
  • Constructing related variates is crucial.
  • Different VR methods can be combined, but how to need our efforts.

17

slide-25
SLIDE 25

Q & A