Training Neural Networks Using Features Replay Zhouyuan Huo 1 , Bin - - PowerPoint PPT Presentation

training neural networks using features replay
SMART_READER_LITE
LIVE PREVIEW

Training Neural Networks Using Features Replay Zhouyuan Huo 1 , Bin - - PowerPoint PPT Presentation

Training Neural Networks Using Features Replay Zhouyuan Huo 1 , Bin Gu 1 , 2 , Heng Huang 1 , 2 1 Department of Electrical and Computer Engineering, University of Pittsburgh 2 JD.com November 28, 2018 Zhouyuan Huo 1 , Bin Gu 1 , 2 , Heng Huang 1 ,


slide-1
SLIDE 1

Training Neural Networks Using Features Replay

Zhouyuan Huo1, Bin Gu1,2, Heng Huang1,2

1Department of Electrical and Computer Engineering, University of Pittsburgh 2 JD.com

November 28, 2018

Zhouyuan Huo1, Bin Gu1,2, Heng Huang1,2 (Pitt) FR November 28, 2018 1 / 8

slide-2
SLIDE 2

Motivation poster #12

Backpropagation algorithm: step 1: Forward pass. step 2: Backward pass. Problem: Backward time is about 2 times of forward time. Backward locking. Backward cannot be parallellized.

Zhouyuan Huo1, Bin Gu1,2, Heng Huang1,2 (Pitt) FR November 28, 2018 2 / 8

slide-3
SLIDE 3

Problem Reformulation poster #12

Original formulation: min

w

f (hL, y) s.t. hl = Fl(hl−1; wl) New formulation:

min

w,δ K−1

  • k=1
  • δt

k − ∂fht

Lk

(w t) ∂ht

Lk

  • 2

2

+ f

  • ht

LK , y t

s.t. ht

Lk = FG(k)(ht Lk−1; w t G(k))

Zhouyuan Huo1, Bin Gu1,2, Heng Huang1,2 (Pitt) FR November 28, 2018 3 / 8

slide-4
SLIDE 4

Problem Reformulation (Continued) poster #12

Module 1: min

w,δ

  • δt

1 − ∂fht

L1

(wt) ∂ht

L1

  • 2

2

s.t. ht

L1 = FG(1)(ht L0; wt G(1))

We approximate δt

1 = ∂fht−3

L1

(wt) ∂ht−3

L1

. Module 4: min

w,δ

f

  • ht

L4, yt

s.t. ht

L4 = FG(4)(ht L3; wt G(4))

Zhouyuan Huo1, Bin Gu1,2, Heng Huang1,2 (Pitt) FR November 28, 2018 4 / 8

slide-5
SLIDE 5

Features Replay poster #12

layer 1 layer 2 layer 3 δt

1

Module 1

Forward pass Backward pass h Activation Error gradient ht − 3 ht − 2 ht − 1 layer 4 layer 5 layer 6 δt

2

Module 2

ht − 2

3

ht − 1

3

ht

3

˜ h

t 1

˜ h

t 2

˜ h

t 3

ht ˜ h

t 4

˜ h

t 6

˜ h

t 5

layer 7 layer 8 layer 9 δt

3

Module 3

ht − 1

6

ht

6

˜ h

t 7

˜ h

t 9

˜ h

t 8

layer 10 layer 11 layer 12 Module 4 ht

9

ht

10

ht

12

ht

11

loss

δ

Forward pass:

ht

Lk = FG(k)

  • ht

Lk−1; w t G(k)

  • (Play)

Backward pass:

˜ ht

Lk = FG(k)(ht+k−K Lk−1

; w t

G(k))

(Replay)

Apply chain rule using ˜ ht

Lk and δt k

in each module.

Zhouyuan Huo1, Bin Gu1,2, Heng Huang1,2 (Pitt) FR November 28, 2018 5 / 8

slide-6
SLIDE 6

Convergence Guarantee poster #12

Convergence Guarantee: 1

T−1

  • t=0

γt

T−1

  • t=0

γtE

  • ∇f (wt)
  • 2

2

≤ f (w0) − f (w∗) σ

T−1

  • t=0

γt + LM 2σ

T−1

  • t=0

γ2

t T−1

  • t=0

γt . (1)

Zhouyuan Huo1, Bin Gu1,2, Heng Huang1,2 (Pitt) FR November 28, 2018 6 / 8

slide-7
SLIDE 7

Experimental Results poster #12

Faster Convergence. Lower Memory Consumption. Better Generalization Error.

Zhouyuan Huo1, Bin Gu1,2, Heng Huang1,2 (Pitt) FR November 28, 2018 7 / 8

slide-8
SLIDE 8

Thanks !

Welcome to poster #12 Room 210 & 230 AB

Zhouyuan Huo1, Bin Gu1,2, Heng Huang1,2 (Pitt) FR November 28, 2018 8 / 8