SLIDE 1 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP’18)
Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions
Authors: S. Scardapane, S. Van Vaerenbergh,
- D. Comminiello, S. Totaro and A. Uncini
SLIDE 2
Contents
Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline
SLIDE 3
Content at a glance
Setting: Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks, ...). State-of-the-art: Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective: Design an enhanced gate, with a small number of addi- tional adaptable parameters, to model a wider range of gat- ing functions.
SLIDE 4
Content at a glance
Setting: Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks, ...). State-of-the-art: Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective: Design an enhanced gate, with a small number of addi- tional adaptable parameters, to model a wider range of gat- ing functions.
SLIDE 5
Content at a glance
Setting: Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks, ...). State-of-the-art: Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective: Design an enhanced gate, with a small number of addi- tional adaptable parameters, to model a wider range of gat- ing functions.
SLIDE 6
Contents
Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline
SLIDE 7
Gated unit: basic model
Definition: (vanilla) gated unit For a generic input x we have: g (x) = σ (Wx) ⊙ f (x) , (1) where σ(·) is the sigmoid function, ⊙ is the element-wise multiplication, and f(x) a generic network component. Notable examples: ◮ LSTM networks (Hochreiter and Schmidhuber, 1997). ◮ Gated recurrent units (Cho et al., 2014). ◮ Highway networks (Srivastava et al., 2015). ◮ Neural arithmetic logic unit (Trask et al., 2018).
SLIDE 8 Gated recurrent unit (GRU)
At each time step t we receive xt ∈ Rd and update the in- ternal state ht−1 as: ut = σ (Wuxt + Vuht−1 + bu) , (2) rt = σ (Wrxt + Vrht−1 + br) , (3) ht = (1 − ut) ◦ ht−1 + ut ◦ tanh
- Whxt + Ut (rt ◦ ht−1) + bh
- ,
(4) where (2)-(3) are the update gate and reset gate.
Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.
SLIDE 9 Gated recurrent unit (GRU)
At each time step t we receive xt ∈ Rd and update the in- ternal state ht−1 as: ut = σ (Wuxt + Vuht−1 + bu) , (2) rt = σ (Wrxt + Vrht−1 + br) , (3) ht = (1 − ut) ◦ ht−1 + ut ◦ tanh
- Whxt + Ut (rt ◦ ht−1) + bh
- ,
(4) where (2)-(3) are the update gate and reset gate.
Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.
SLIDE 10 Gated recurrent unit (GRU)
At each time step t we receive xt ∈ Rd and update the in- ternal state ht−1 as: ut = σ (Wuxt + Vuht−1 + bu) , (2) rt = σ (Wrxt + Vrht−1 + br) , (3) ht = (1 − ut) ◦ ht−1 + ut ◦ tanh
- Whxt + Ut (rt ◦ ht−1) + bh
- ,
(4) where (2)-(3) are the update gate and reset gate.
Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.
SLIDE 11 Training the network (classification)
N sequences
t
N
i=1 with labels yi = 1, . . . , C. hi is the in-
ternal state of the GRU after processing the i-th sequence. This is fed through another layer with a softmax activation function for classification:
(5) We then minimize the average cross-entropy between the real classes and the predicted classes: J(θ) = − 1 N
N
C
j
(6)
SLIDE 12 Training the network (classification)
N sequences
t
N
i=1 with labels yi = 1, . . . , C. hi is the in-
ternal state of the GRU after processing the i-th sequence. This is fed through another layer with a softmax activation function for classification:
(5) We then minimize the average cross-entropy between the real classes and the predicted classes: J(θ) = − 1 N
N
C
j
(6)
SLIDE 13 Training the network (classification)
N sequences
t
N
i=1 with labels yi = 1, . . . , C. hi is the in-
ternal state of the GRU after processing the i-th sequence. This is fed through another layer with a softmax activation function for classification:
(5) We then minimize the average cross-entropy between the real classes and the predicted classes: J(θ) = − 1 N
N
C
j
(6)
SLIDE 14
Contents
Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline
SLIDE 15 Summary of the proposal
Key items of our proposal:
- 1. Maintain the linear component, but replace the sig-
moid element-wise operation with a generalized sig- moid function.
- 2. We extend the kernel activation function (KAF), a re-
cently proposed non-parametric activation function.
- 3. We modify the KAF to ensure that it behaves correctly
as a gating function.
SLIDE 16 Summary of the proposal
Key items of our proposal:
- 1. Maintain the linear component, but replace the sig-
moid element-wise operation with a generalized sig- moid function.
- 2. We extend the kernel activation function (KAF), a re-
cently proposed non-parametric activation function.
- 3. We modify the KAF to ensure that it behaves correctly
as a gating function.
SLIDE 17 Summary of the proposal
Key items of our proposal:
- 1. Maintain the linear component, but replace the sig-
moid element-wise operation with a generalized sig- moid function.
- 2. We extend the kernel activation function (KAF), a re-
cently proposed non-parametric activation function.
- 3. We modify the KAF to ensure that it behaves correctly
as a gating function.
SLIDE 18 Basic structure of the KAF
A KAF models each activation function in terms of a kernel expansion over D terms as: KAF(s) =
D
αiκ (s, di) , (7) where:
i=1 are the mixing coefficients;
i=1 are the dictionary elements;
- 3. κ(·, ·) : R × R → R is a 1D kernel function.
Scardapane, S., Van Vaerenbergh, S., Totaro, S. and Uncini, A.,
- 2017. Kafnets: kernel-based non-parametric activation functions for
neural networks. arXiv preprint arXiv:1707.04035.
SLIDE 19 Extending KAFs for gated units
We cannot use a KAF straightforwardly because it is un- bounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: σKAF(s) = σ 1 2KAF(s) + 1 2s
(8) As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.
SLIDE 20 Extending KAFs for gated units
We cannot use a KAF straightforwardly because it is un- bounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: σKAF(s) = σ 1 2KAF(s) + 1 2s
(8) As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.
SLIDE 21 Extending KAFs for gated units
We cannot use a KAF straightforwardly because it is un- bounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: σKAF(s) = σ 1 2KAF(s) + 1 2s
(8) As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.
SLIDE 22 Visualizing the new gates
−5 5 Activation Value of the gate
(a) γ = 1.0
−5 5 Activation Value of the gate
(b) γ = 0.5
−5 5 Activation Value of the gate
(c) γ = 0.1
Figure 1: Random samples of the proposed flexible gates with Gaussian kernel and different hyperparameters.
SLIDE 23 Initializing the mixing coefficients
To simplify optimization we initialize the mixing coeffi- cients to approximate the identity function: α = (K + εI)−1 d , (9) where ε > 0 is a small constant. We then use a different set
- f mixing coefficients for each forget gate and update gate.
−2 2 Activation −1.0 −0.5 0.0 0.5 1.0 Gate output
SLIDE 24
Contents
Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline
SLIDE 25 Sequential MNIST benchmark
◮ [Row-wise MNIST (R-MNIST)] Each image is pro- cessed sequentially, row-by-row, i.e., we have sequences
- f length 28, each represented by the value of 28 pix-
els. ◮ [Pixel-wise MNIST (P-MNIST)] Each image is repre- sented as a sequence of 784 pixels, read from left to right and from top to bottom from the original image. ◮ [Permuted P-MNIST (PP-MNIST)] Similar to P-MNIST, but the order of the pixels is shuffled using a (fixed) permutation matrix.
SLIDE 26 Models and hyperparameters
- 1. We compare standard GRUs and GRUs with the pro-
posed flexible gating function.
- 2. GRUs have 100 units and we include an additional
batch normalization step to stabilize training.
- 3. We train with Adam on mini-batches of 32 elements,
with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0.
- 4. For the proposed gate, we use the Gaussian kernel
and initialize the dictionary from 10 elements equis- paced in [−4.0, 4.0].
- 5. We compute the average accuracy of the model every
25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.
SLIDE 27 Models and hyperparameters
- 1. We compare standard GRUs and GRUs with the pro-
posed flexible gating function.
- 2. GRUs have 100 units and we include an additional
batch normalization step to stabilize training.
- 3. We train with Adam on mini-batches of 32 elements,
with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0.
- 4. For the proposed gate, we use the Gaussian kernel
and initialize the dictionary from 10 elements equis- paced in [−4.0, 4.0].
- 5. We compute the average accuracy of the model every
25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.
SLIDE 28 Models and hyperparameters
- 1. We compare standard GRUs and GRUs with the pro-
posed flexible gating function.
- 2. GRUs have 100 units and we include an additional
batch normalization step to stabilize training.
- 3. We train with Adam on mini-batches of 32 elements,
with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0.
- 4. For the proposed gate, we use the Gaussian kernel
and initialize the dictionary from 10 elements equis- paced in [−4.0, 4.0].
- 5. We compute the average accuracy of the model every
25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.
SLIDE 29 Models and hyperparameters
- 1. We compare standard GRUs and GRUs with the pro-
posed flexible gating function.
- 2. GRUs have 100 units and we include an additional
batch normalization step to stabilize training.
- 3. We train with Adam on mini-batches of 32 elements,
with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0.
- 4. For the proposed gate, we use the Gaussian kernel
and initialize the dictionary from 10 elements equis- paced in [−4.0, 4.0].
- 5. We compute the average accuracy of the model every
25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.
SLIDE 30 Models and hyperparameters
- 1. We compare standard GRUs and GRUs with the pro-
posed flexible gating function.
- 2. GRUs have 100 units and we include an additional
batch normalization step to stabilize training.
- 3. We train with Adam on mini-batches of 32 elements,
with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0.
- 4. For the proposed gate, we use the Gaussian kernel
and initialize the dictionary from 10 elements equis- paced in [−4.0, 4.0].
- 5. We compute the average accuracy of the model every
25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.
SLIDE 31
Accuracy on the test set
Dataset GRU (Standard) GRU (proposed) R-MNIST 98.29 ± 0.01 98.67 ± 0.02 P-MNIST 89.50 ± 5.64 97.34 ± 0.61 PP-MNIST 86.41 ± 6.71 96.10 ± 0.93
Table 1: Average test accuracy obtained by a standard GRU compared with a GRU endowed with the proposed flexible gates (with standard deviation).
SLIDE 32 Evolution of the loss and validation accuracy
500 1000 1500 2000 2500 3000 Epoch 0.5 1.0 1.5 2.0 Loss Standard GRU Proposed GRU
(a) Training loss
50 100 150 200 250 300 Epoch 0.2 0.4 0.6 0.8 Accuracy Standard GRU Proposed GRU
(b) Validation accuracy
Figure 2: Convergence results on the P-MNIST dataset for a standard GRU and the proposed GRU.
SLIDE 33 Distribution of the kernel’s bandwidths
0.05 0.10 0.15 0.20 0.25 0.30 Value for gamma 5 10 15 20 Number of cells
Figure 3: Sample histogram of the values for the kernel’s hyperparameters, after training, for the reset gate of the GRU.
SLIDE 34 Ablation study
0.96 0.97 0.98 0.99 Test accuracy Normal Rand No-Residual Rand+No-Residual
Figure 4: Average results of an ablation study on the R-MNIST
- dataset. Rand: we initialize the mixing coefficients randomly.
No-Residual: we remove the residual connection in (8). With a dashed red line we show the performance of a standard GRU.
SLIDE 35
Contents
Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline
SLIDE 36
Summary
◮ We proposed an extension of the standard gating com- ponent used in most gated RNNs. ◮ To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. ◮ Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). ◮ Need more experiments with other gated RNNs, ap- plications, and interpretability of the resulting func- tions with respect to the task at hand.
SLIDE 37
Summary
◮ We proposed an extension of the standard gating com- ponent used in most gated RNNs. ◮ To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. ◮ Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). ◮ Need more experiments with other gated RNNs, ap- plications, and interpretability of the resulting func- tions with respect to the task at hand.
SLIDE 38
Summary
◮ We proposed an extension of the standard gating com- ponent used in most gated RNNs. ◮ To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. ◮ Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). ◮ Need more experiments with other gated RNNs, ap- plications, and interpretability of the resulting func- tions with respect to the task at hand.
SLIDE 39
Summary
◮ We proposed an extension of the standard gating com- ponent used in most gated RNNs. ◮ To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. ◮ Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). ◮ Need more experiments with other gated RNNs, ap- plications, and interpretability of the resulting func- tions with respect to the task at hand.
SLIDE 40
Questions?