Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di - PowerPoint PPT Presentation

Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tie-Yan Liu Peking University & Microsoft Research Asia IC ICML ML | | 2018 2018 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 1

Long Short-Term Memory (LSTM) RNN • ℎ " , $ " = LSTM(ℎ "+, , $ "+, , - " ) c t c t-1 + � • / " = 0 1 23 - " + 1 53 ℎ "+, + 6 3 tanh � � • 7 " = 0 1 28 - " + 1 58 ℎ "+, + 6 8 g t o t f t i t σ σ tanh σ • 9 " = tanh 1 2> - " + 1 5> ℎ "+, + 6 > h t h t-1 Linear • ? " = 0 1 2@ - " + 1 5@ ℎ "+, + 6 @ • $ " = / " ⨀ $ "+, + 7 " ⨀ 9 " x t • ℎ " = ? " ⨀ tanh($ " ) Figure credit to: Christopher Olah, "Understanding LSTM Networks" 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 2

Long Short-Term Memory (LSTM) RNN Forget Gates • ℎ " , $ " = LSTM(ℎ "+, , $ "+, , - " ) c t c t-1 + � • / " = 0 1 23 - " + 1 53 ℎ "+, + 6 3 tanh � � • 7 " = 0 1 28 - " + 1 58 ℎ "+, + 6 8 g t o t f t i t σ σ tanh σ • 9 " = tanh 1 2> - " + 1 5> ℎ "+, + 6 > h t h t-1 Linear • ? " = 0 1 2@ - " + 1 5@ ℎ "+, + 6 @ • $ " = / " ⨀ $ "+, + 7 " ⨀ 9 " x t • ℎ " = ? " ⨀ tanh($ " ) Figure credit to: Christopher Olah, "Understanding LSTM Networks" 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 3

Long Short-Term Memory (LSTM) RNN Input Gates • ℎ " , $ " = LSTM(ℎ "+, , $ "+, , - " ) c t c t-1 + � • / " = 0 1 23 - " + 1 53 ℎ "+, + 6 3 tanh � � • 7 " = 0 1 28 - " + 1 58 ℎ "+, + 6 8 g t o t f t i t σ σ tanh σ • 9 " = tanh 1 2> - " + 1 5> ℎ "+, + 6 > h t h t-1 Linear • ? " = 0 1 2@ - " + 1 5@ ℎ "+, + 6 @ • $ " = / " ⨀ $ "+, + 7 " ⨀ 9 " x t • ℎ " = ? " ⨀ tanh($ " ) Figure credit to: Christopher Olah, "Understanding LSTM Networks" 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 4

Long Short-Term Memory (LSTM) RNN Output Gates • ℎ " , $ " = LSTM(ℎ "+, , $ "+, , - " ) c t c t-1 + � • / " = 0 1 23 - " + 1 53 ℎ "+, + 6 3 tanh � � • 7 " = 0 1 28 - " + 1 58 ℎ "+, + 6 8 g t o t f t i t σ σ tanh σ • 9 " = tanh 1 2> - " + 1 5> ℎ "+, + 6 > h t h t-1 Linear • ? " = 0 1 2@ - " + 1 5@ ℎ "+, + 6 @ • $ " = / " ⨀ $ "+, + 7 " ⨀ 9 " x t • ℎ " = ? " ⨀ tanh($ " ) Figure credit to: Christopher Olah, "Understanding LSTM Networks" 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 5

Example: Input Gates & Forget Gates French LSTM LSTM LSTM LSTM LSTM … LSTM LSTM LSTM I grew up in France I speak fluent • When the LSTM sees " France ", the input gate will open and the LSTM will remember the information • At the subsequent timesteps, the forget gates will also be open (take value 1) to keep the information. Finally the LSTM will use this information to predict word " French " 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 6

Example: Input Gates & Forget Gates LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM I once left , but now I am back • When the LSTM sees " but " and " back ", the forget gates should be closed (take value 0) to forget the information of " left " 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 7

Histograms of Gate Distributions in LSTM 0.175 0.12 0.150 0.10 0.125 3ercentage 3ercentage 0.08 0.100 0.06 0.075 0.04 0.050 0.02 0.025 0.000 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 LSTM Input gates LSTM Forget gates Based on the gate outputs of the first-layer LSTM in the decoder from 10000 sentence pairs IWSLT14 German→English training sets 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 8

Training LSTM Gates Towards Binary Values Well aligns with the original purpose of gates: to get the information in or skip by "opening" or "closing" Push the gate values to the Ready for further compression by pushing the boundary of activation function to be binarized range (0, 1) Enables better generalization 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 9

Ready for Further Compression & Better Generalization 1 0.5 Saturation area of Sigmoid function 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 Change to the The output falls Parameters in Change to the output of the in the saturation the gates final loss will gates will be area perturb also be little small Better test performance Robust to model compression Flat region generalizes better 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 10

Sharpened Sigmoid • Straight forward idea: sharpen the Sigmoid function by using a smaller temperature ! < 1 $ %,' ( = * (,( + .)/! = * ,/! ( + ./! • This is equivalent to rescale the weight initialization and the gradient Cannot guarantee the outputs Harm the optimization process to be close to the boundary 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 11

Gumbel-Softmax Estimator • In our special case, we leverage the Gumbel-Softmax estimator to estimate the Bernoulli distribution ! " ~$(&(')) with prob. &(') • Define ) ', + = & ' + log 1 − log 1 − 1 , + where 1~Uniform 0, 1 , then the following holds for ; ∈ (0, 1/2) : ? ! " = 1 − (+/4) log(1/;) ≤ ? ) ', + ≥ 1 − ; ≤ ? ! " = 1 ? ! " = 0 − (+/4) log(1/;) ≤ ? ) ', + ≤ ; ≤ ? ! " = 0 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 12

Gumbel-Softmax Estimator 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 ! = 0 ! = 1/2 ! = 1 ! = 2 Probability density functions of Gumbel-Softmax estimators with different temperature ! 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 13

Gumbel-Gate LSTM (G 2 -LSTM) • ℎ " , $ " = LSTM(ℎ "+, , $ "+, , - " ) In the forward pass during training, we independently sample all forget and input gates at each timestep, • / " = 0 1 23 - " + 1 53 ℎ "+, + 6 3 , 7 and update G 2 -LSTM • 8 " = 0 1 29 - " + 1 59 ℎ "+, + 6 9 , 7 • : " = tanh 1 2? - " + 1 5? ℎ "+, + 6 ? In the backward pass, we use • @ " = A 1 2B - " + 1 5B ℎ "+, + 6 B standard gradient-based method • $ " = / " ⨀ $ "+, + 8 " ⨀ : " to update model parameters, since • ℎ " = @ " ⨀ tanh($ " ) all components are differentiable 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 14

Experiments • Language Modeling • Penn Treebank • Machine Translation • IWSLT'14 German→English • WMT'14 English→German 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 15

Sensitivity Analysis • Compress the gate-related parameters to show the robustness of our learned models • Low-precision compression • Low-rank compression • Reduce the support set of the • Compress the parameter parameters by matrices by singular value !"#$% & = round(./0) 2 0 decomposition (SVD) • Further clip the rounded value • Reduce the model size and lead to a fixed range using to fast matrix multiplication 3456 7 = clip(., −>, >) 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 16

Experimental Results Model Original Round Round & Clip SVD SVD+ Penn Treebank (Perplexity) Baseline 52.8 53.2 (+0.4) 53.6 (+0.8) 56.6 (+3.8) 65.5 (+12.7) Sharpened Sigmoid 53.2 53.5 (+0.3) 53.6 ( +0.4 ) 54.6 (+1.4) 60.0 (+6.8) G 2 -LSTM 52.1 52.2 ( +0.1 ) 52.8 (+0.7) 53.3 ( +1.2 ) 56.0 ( +3.9 ) IWSLT'14 German→English (BLEU) Baseline 31.00 28.65 (-2.35) 21.97 (-9.03) 30.52 (-0.48) 29.56 (-1.44) Sharpened Sigmoid 29.73 27.08 (-2.65) 25.14 (-4.59) 29.17 (-0.53) 28.82 (-0.91) G 2 -LSTM 31.95 31.44 ( -0.51 ) 31.44 ( -0.51 ) 31.62 ( -0.33 ) 31.28 ( -0.67 ) WMT'14 English→German (BLEU) Baseline 21.89 16.22 (-5.67) 16.03 (-5.86) 21.15 (-0.74) 19.99 (-1.90) Sharpened Sigmoid 21.64 16.85 (-4.79) 16.72 (-4.92) 20.98 (-0.66) 19.87 (-1.77) G 2 -LSTM 22.43 20.15 ( -2.28 ) 20.29 ( -2.14 ) 22.16 ( -0.27 ) 21.84 ( -0.51 ) 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 17

Histograms of Gate Distributions in G 2 -LSTM 0.7 0.35 0.6 0.30 0.5 0.25 3ercentage 3ercentage 0.4 0.20 0.3 0.15 0.2 0.10 0.1 0.05 0.0 0.00 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 G 2 -LSTM Input gates G 2 -LSTM Forget gates Based on the gate outputs of the first-layer G 2 -LSTM in the decoder from the same 10000 sentence pairs IWSLT14 German→English training sets 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 18

Visualization of Average Gate Values 2018/07/12 Towards Binary-Valued Gates for Robust LSTM Training 19

Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di - PowerPoint PPT Presentation

Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tie-Yan Liu Peking University & Microsoft Research Asia IC ICML ML | | 2018 2018 2018/07/12 Towards Binary-Valued Gates

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Encoder-decoder Models

CS101 Lecture 06: Logic Gates + Binary Addition = The Adder Review Logic Gates Adding Binary

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Many-Valued Logic Daniel Bonevac February 27, 2013 Daniel Bonevac Many-Valued Logic Rationales

Logic Gates and Typical gates Functional Blocks Logic gates ideally have signals of two

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Logic Gates What are logic gates? In the binary lesson, we discussed the switches inside a

Algebraic Study of Lattice-Valued Logic and Lattice-Valued Modal Logic Yoshihiro Maruyama

Shuffle algebra perspective on operator valued probability theory 30 mars 2020 1/25 Operator

VECTOR-VALUED FUNCTIONS MATH 200 MAIN QUESTIONS FOR TODAY Whats a vector valued function?

Multi-Dimensional LSTM Networks for Video Prediction Wonmin Byeon NVIDIA Research March 29, 2018

Class 15 - Long Short-Term Memory (LSTM) Class 15 - Long Short-Term Memory (LSTM) Study materials

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Slow Down to Go Fast: Lessons Learned Shipping Bing Voice Search on Xbox James Waletzky Director

Sequence to Sequence Models Matt Gormley Lecture 5 Sep. 11, 2019 1 Q&A Q: What did the

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

DC power flow in rectangular coordinates Ross Baldick The University of Texas, Austin, TX 78712

Lever everaging Open- aging Open-Sour Source P ce Power Measur ower Measuremen ement Standar

Quadrature Generation Techniques in CMOS Relaxation Oscillators S. Aniruddhan Indian Institute

Ageing with spina bifida and hydrocephalus October 2019 Powered by Q1: What is your age?

Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di - PowerPoint PPT Presentation

Towards Binary-Valued Gates for Robust LSTM Training Zhuohan Li , Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tie-Yan Liu Peking University & Microsoft Research Asia IC ICML ML | | 2018 2018 2018/07/12 Towards Binary-Valued Gates

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Encoder-decoder Models

Attention Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Encoder-decoder Models

CS101 Lecture 06: Logic Gates + Binary Addition = The Adder Review Logic Gates Adding Binary

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Many-Valued Logic Daniel Bonevac February 27, 2013 Daniel Bonevac Many-Valued Logic Rationales

Logic Gates and Typical gates Functional Blocks Logic gates ideally have signals of two

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Logic Gates What are logic gates? In the binary lesson, we discussed the switches inside a

Algebraic Study of Lattice-Valued Logic and Lattice-Valued Modal Logic Yoshihiro Maruyama

Shuffle algebra perspective on operator valued probability theory 30 mars 2020 1/25 Operator

VECTOR-VALUED FUNCTIONS MATH 200 MAIN QUESTIONS FOR TODAY Whats a vector valued function?

Multi-Dimensional LSTM Networks for Video Prediction Wonmin Byeon NVIDIA Research March 29, 2018

Class 15 - Long Short-Term Memory (LSTM) Class 15 - Long Short-Term Memory (LSTM) Study materials

E-LSTM: Efficient Inference of Sparse LSTM on Embedded Heterogeneous System Runbin Shi 1 Junjie

CPU Scheduling Schedulers in the OS Structure of a CPU Scheduler Scheduling =

Slow Down to Go Fast: Lessons Learned Shipping Bing Voice Search on Xbox James Waletzky Director

Sequence to Sequence Models Matt Gormley Lecture 5 Sep. 11, 2019 1 Q&amp;A Q: What did the

Neural Machine Translation Philipp Koehn 6 October 2020 Philipp Koehn Machine Translation:

DC power flow in rectangular coordinates Ross Baldick The University of Texas, Austin, TX 78712

Lever everaging Open- aging Open-Sour Source P ce Power Measur ower Measuremen ement Standar

Quadrature Generation Techniques in CMOS Relaxation Oscillators S. Aniruddhan Indian Institute

Ageing with spina bifida and hydrocephalus October 2019 Powered by Q1: What is your age?

Sequence to Sequence Models Matt Gormley Lecture 5 Sep. 11, 2019 1 Q&A Q: What did the