Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di - - PowerPoint PPT Presentation

towards binary valued gates for robust lstm training
SMART_READER_LITE
LIVE PREVIEW

Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di - - PowerPoint PPT Presentation

Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tie-Yan Liu Peking University & Microsoft Research Asia IC ICML ML | | 2018 2018 2018/07/12 Towards Binary-Valued Gates


slide-1
SLIDE 1

Towards Binary-Valued Gates for Robust LSTM Training

Zhuohan Li, Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tie-Yan Liu

Peking University & Microsoft Research Asia

IC ICML ML | | 2018 2018

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 1

slide-2
SLIDE 2

Long Short-Term Memory (LSTM) RNN

  • ℎ", $" = LSTM(ℎ"+,, $"+,, -")
  • /

" = 0 1 23-" + 1 53ℎ"+, + 63

  • 7" = 0 1

28-" + 1 58ℎ"+, + 68

  • 9" = tanh 1

2>-" + 1 5>ℎ"+, + 6>

  • ?" = 0 1

2@-" + 1 5@ℎ"+, + 6@

  • $" = /

" ⨀ $"+, + 7" ⨀ 9"

  • ℎ" = ?" ⨀ tanh($")

Towards Binary-Valued Gates for Robust LSTM Training 2018/07/12 2 Figure credit to: Christopher Olah, "Understanding LSTM Networks"

tanh σ

  • ft

σ

  • t

ct-1

tanh

ct ht

Linear

ht-1 xt it

σ +

gt

slide-3
SLIDE 3

Long Short-Term Memory (LSTM) RNN

  • ℎ", $" = LSTM(ℎ"+,, $"+,, -")
  • /

" = 0 1 23-" + 1 53ℎ"+, + 63

  • 7" = 0 1

28-" + 1 58ℎ"+, + 68

  • 9" = tanh 1

2>-" + 1 5>ℎ"+, + 6>

  • ?" = 0 1

2@-" + 1 5@ℎ"+, + 6@

  • $" = /

" ⨀ $"+, + 7" ⨀ 9"

  • ℎ" = ?" ⨀ tanh($")

Towards Binary-Valued Gates for Robust LSTM Training 2018/07/12 3 Figure credit to: Christopher Olah, "Understanding LSTM Networks"

tanh σ

  • t

ct-1

tanh

ct ht it

σ +

gt

  • Linear

ht-1 xt

σ

  • ft

Forget Gates

slide-4
SLIDE 4

Long Short-Term Memory (LSTM) RNN

  • ℎ", $" = LSTM(ℎ"+,, $"+,, -")
  • /

" = 0 1 23-" + 1 53ℎ"+, + 63

  • 7" = 0 1

28-" + 1 58ℎ"+, + 68

  • 9" = tanh 1

2>-" + 1 5>ℎ"+, + 6>

  • ?" = 0 1

2@-" + 1 5@ℎ"+, + 6@

  • $" = /

" ⨀ $"+, + 7" ⨀ 9"

  • ℎ" = ?" ⨀ tanh($")

Towards Binary-Valued Gates for Robust LSTM Training 2018/07/12 4 Figure credit to: Christopher Olah, "Understanding LSTM Networks"

tanh σ

  • t

ct-1

tanh

ct ht

σ

  • ft

gt

Input Gates

it

σ +

  • Linear

ht-1 xt

slide-5
SLIDE 5

Long Short-Term Memory (LSTM) RNN

  • ℎ", $" = LSTM(ℎ"+,, $"+,, -")
  • /

" = 0 1 23-" + 1 53ℎ"+, + 63

  • 7" = 0 1

28-" + 1 58ℎ"+, + 68

  • 9" = tanh 1

2>-" + 1 5>ℎ"+, + 6>

  • ?" = 0 1

2@-" + 1 5@ℎ"+, + 6@

  • $" = /

" ⨀ $"+, + 7" ⨀ 9"

  • ℎ" = ?" ⨀ tanh($")

Towards Binary-Valued Gates for Robust LSTM Training 2018/07/12 5 Figure credit to: Christopher Olah, "Understanding LSTM Networks"

tanh

ct-1

tanh

ct ht it

σ +

gt

  • σ
  • ft

Linear

ht-1 xt

Output Gates

σ

  • t
slide-6
SLIDE 6

Example: Input Gates & Forget Gates

  • When the LSTM sees "France", the input gate will open and the

LSTM will remember the information

  • At the subsequent timesteps, the forget gates will also be open

(take value 1) to keep the information. Finally the LSTM will use this information to predict word "French"

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 6

I grew up in France I speak fluent French LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM …

slide-7
SLIDE 7

Example: Input Gates & Forget Gates

  • When the LSTM sees "but" and "back", the forget gates should

be closed (take value 0) to forget the information of "left"

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 7

I

  • nce

left , but now I am LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM back LSTM

slide-8
SLIDE 8

Histograms of Gate Distributions in LSTM

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 8

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12 3ercentage 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 3ercentage

LSTM Input gates LSTM Forget gates

Based on the gate outputs of the first-layer LSTM in the decoder from 10000 sentence pairs IWSLT14 German→English training sets

slide-9
SLIDE 9

Training LSTM Gates Towards Binary Values

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 9

Push the gate values to the boundary of range (0, 1)

Well aligns with the original purpose of gates: to get the information in or skip by "opening"

  • r "closing"

Ready for further compression by pushing the activation function to be binarized Enables better generalization

slide-10
SLIDE 10

Ready for Further Compression & Better Generalization

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 10

0.5 1

  • 10
  • 8
  • 6
  • 4
  • 2

2 4 6 8 10

Saturation area of Sigmoid function

The output falls in the saturation area Parameters in the gates perturb Change to the

  • utput of the

gates will be small Change to the final loss will also be little

Robust to model compression Better test performance Flat region generalizes better

slide-11
SLIDE 11

Sharpened Sigmoid

  • Straight forward idea: sharpen the Sigmoid function by using a

smaller temperature ! < 1 $

%,' ( = * (,( + .)/! = *

,/! ( + ./!

  • This is equivalent to rescale the weight initialization and the gradient

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 11

Harm the optimization process Cannot guarantee the outputs to be close to the boundary

slide-12
SLIDE 12

Gumbel-Softmax Estimator

  • In our special case, we leverage the Gumbel-Softmax estimator to

estimate the Bernoulli distribution !"~$(&(')) with prob. &(')

  • Define

) ', + = & ' + log 1 − log 1 − 1 + , where 1~Uniform 0, 1 , then the following holds for ; ∈ (0, 1/2): ? !" = 1 − (+/4) log(1/;) ≤ ? ) ', + ≥ 1 − ; ≤ ? !" = 1 ? !" = 0 − (+/4) log(1/;) ≤ ? ) ', + ≤ ; ≤ ? !" = 0

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 12

slide-13
SLIDE 13

Gumbel-Softmax Estimator

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 13

! = 0 ! = 1/2 ! = 1 ! = 2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Probability density functions of Gumbel-Softmax estimators with different temperature !

slide-14
SLIDE 14

Gumbel-Gate LSTM (G2-LSTM)

  • ℎ", $" = LSTM(ℎ"+,, $"+,, -")
  • /

" = 0 1 23-" + 1 53ℎ"+, + 63, 7

  • 8" = 0 1

29-" + 1 59ℎ"+, + 69, 7

  • :" = tanh 1

2?-" + 1 5?ℎ"+, + 6?

  • @" = A 1

2B-" + 1 5Bℎ"+, + 6B

  • $" = /

" ⨀ $"+, + 8" ⨀ :"

  • ℎ" = @" ⨀ tanh($")

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 14

In the forward pass during training, we independently sample all forget and input gates at each timestep, and update G2-LSTM In the backward pass, we use standard gradient-based method to update model parameters, since all components are differentiable

slide-15
SLIDE 15

Experiments

  • Language Modeling
  • Penn Treebank
  • Machine Translation
  • IWSLT'14 German→English
  • WMT'14 English→German

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 15

slide-16
SLIDE 16

Sensitivity Analysis

  • Compress the gate-related parameters to show the robustness of our

learned models

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 16

  • Low-precision compression
  • Reduce the support set of the

parameters by !"#$%& = round(./0) 2 0

  • Further clip the rounded value

to a fixed range using 34567 = clip(., −>, >)

  • Low-rank compression
  • Compress the parameter

matrices by singular value decomposition (SVD)

  • Reduce the model size and lead

to fast matrix multiplication

slide-17
SLIDE 17

Experimental Results

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 17

Model Original Round Round & Clip SVD SVD+ Penn Treebank (Perplexity) Baseline 52.8 53.2 (+0.4) 53.6 (+0.8) 56.6 (+3.8) 65.5 (+12.7) Sharpened Sigmoid 53.2 53.5 (+0.3) 53.6 (+0.4) 54.6 (+1.4) 60.0 (+6.8) G2-LSTM 52.1 52.2 (+0.1) 52.8 (+0.7) 53.3 (+1.2) 56.0 (+3.9) IWSLT'14 German→English (BLEU) Baseline 31.00 28.65 (-2.35) 21.97 (-9.03) 30.52 (-0.48) 29.56 (-1.44) Sharpened Sigmoid 29.73 27.08 (-2.65) 25.14 (-4.59) 29.17 (-0.53) 28.82 (-0.91) G2-LSTM 31.95 31.44 (-0.51) 31.44 (-0.51) 31.62 (-0.33) 31.28 (-0.67) WMT'14 English→German (BLEU) Baseline 21.89 16.22 (-5.67) 16.03 (-5.86) 21.15 (-0.74) 19.99 (-1.90) Sharpened Sigmoid 21.64 16.85 (-4.79) 16.72 (-4.92) 20.98 (-0.66) 19.87 (-1.77) G2-LSTM 22.43 20.15 (-2.28) 20.29 (-2.14) 22.16 (-0.27) 21.84 (-0.51)

slide-18
SLIDE 18

Histograms of Gate Distributions in G2-LSTM

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 18

G2-LSTM Input gates G2-LSTM Forget gates

Based on the gate outputs of the first-layer G2-LSTM in the decoder from the same 10000 sentence pairs IWSLT14 German→English training sets

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 3ercentage 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 3ercentage

slide-19
SLIDE 19

Visualization of Average Gate Values

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 19

slide-20
SLIDE 20

Summary

  • A new training algorithm for LSTM by leveraging the recently

developed Gumbel-Softmax estimator

  • Push the values of the input and forget gates to 0 or 1, leading to

robust LSTM models

  • Experiments on language modeling and machine translation

demonstrated the effectiveness of the proposed training algorithm

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 20

slide-21
SLIDE 21

Thanks!

Poster #63

#63

Contact: Zhuohan Li (lizhuohan@pku.edu.cn) Tao Qin (taoqin@microsoft.com)

2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 21

Zhuohan is applying for a Ph.D. in Fall 2018! Please contact if you are interested!