Recurrent Neural Networks with Flexible Gates using Kernel - - PowerPoint PPT Presentation

recurrent neural networks with flexible gates using
SMART_READER_LITE
LIVE PREVIEW

Recurrent Neural Networks with Flexible Gates using Kernel - - PowerPoint PPT Presentation

2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors : S. Scardapane, S. Van Vaerenbergh, D. Comminiello, S. Totaro and A.


slide-1
SLIDE 1

2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP’18)

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Authors: S. Scardapane, S. Van Vaerenbergh,

  • D. Comminiello, S. Totaro and A. Uncini
slide-2
SLIDE 2

Contents

Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

slide-3
SLIDE 3

Content at a glance

Setting: Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks, ...). State-of-the-art: Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective: Design an enhanced gate, with a small number of addi- tional adaptable parameters, to model a wider range of gat- ing functions.

slide-4
SLIDE 4

Content at a glance

Setting: Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks, ...). State-of-the-art: Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective: Design an enhanced gate, with a small number of addi- tional adaptable parameters, to model a wider range of gat- ing functions.

slide-5
SLIDE 5

Content at a glance

Setting: Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks, ...). State-of-the-art: Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective: Design an enhanced gate, with a small number of addi- tional adaptable parameters, to model a wider range of gat- ing functions.

slide-6
SLIDE 6

Contents

Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

slide-7
SLIDE 7

Gated unit: basic model

Definition: (vanilla) gated unit For a generic input x we have: g (x) = σ (Wx) ⊙ f (x) , (1) where σ(·) is the sigmoid function, ⊙ is the element-wise multiplication, and f(x) a generic network component. Notable examples: ◮ LSTM networks (Hochreiter and Schmidhuber, 1997). ◮ Gated recurrent units (Cho et al., 2014). ◮ Highway networks (Srivastava et al., 2015). ◮ Neural arithmetic logic unit (Trask et al., 2018).

slide-8
SLIDE 8

Gated recurrent unit (GRU)

At each time step t we receive xt ∈ Rd and update the in- ternal state ht−1 as: ut = σ (Wuxt + Vuht−1 + bu) , (2) rt = σ (Wrxt + Vrht−1 + br) , (3) ht = (1 − ut) ◦ ht−1 + ut ◦ tanh

  • Whxt + Ut (rt ◦ ht−1) + bh
  • ,

(4) where (2)-(3) are the update gate and reset gate.

Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.

slide-9
SLIDE 9

Gated recurrent unit (GRU)

At each time step t we receive xt ∈ Rd and update the in- ternal state ht−1 as: ut = σ (Wuxt + Vuht−1 + bu) , (2) rt = σ (Wrxt + Vrht−1 + br) , (3) ht = (1 − ut) ◦ ht−1 + ut ◦ tanh

  • Whxt + Ut (rt ◦ ht−1) + bh
  • ,

(4) where (2)-(3) are the update gate and reset gate.

Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.

slide-10
SLIDE 10

Gated recurrent unit (GRU)

At each time step t we receive xt ∈ Rd and update the in- ternal state ht−1 as: ut = σ (Wuxt + Vuht−1 + bu) , (2) rt = σ (Wrxt + Vrht−1 + br) , (3) ht = (1 − ut) ◦ ht−1 + ut ◦ tanh

  • Whxt + Ut (rt ◦ ht−1) + bh
  • ,

(4) where (2)-(3) are the update gate and reset gate.

Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.

slide-11
SLIDE 11

Training the network (classification)

N sequences

  • xi

t

N

i=1 with labels yi = 1, . . . , C. hi is the in-

ternal state of the GRU after processing the i-th sequence. This is fed through another layer with a softmax activation function for classification:

  • yi = softmax
  • Ahi + b
  • ,

(5) We then minimize the average cross-entropy between the real classes and the predicted classes: J(θ) = − 1 N

N

  • i=1

C

  • c=1
  • yi = c
  • log
  • yi

j

  • ,

(6)

slide-12
SLIDE 12

Training the network (classification)

N sequences

  • xi

t

N

i=1 with labels yi = 1, . . . , C. hi is the in-

ternal state of the GRU after processing the i-th sequence. This is fed through another layer with a softmax activation function for classification:

  • yi = softmax
  • Ahi + b
  • ,

(5) We then minimize the average cross-entropy between the real classes and the predicted classes: J(θ) = − 1 N

N

  • i=1

C

  • c=1
  • yi = c
  • log
  • yi

j

  • ,

(6)

slide-13
SLIDE 13

Training the network (classification)

N sequences

  • xi

t

N

i=1 with labels yi = 1, . . . , C. hi is the in-

ternal state of the GRU after processing the i-th sequence. This is fed through another layer with a softmax activation function for classification:

  • yi = softmax
  • Ahi + b
  • ,

(5) We then minimize the average cross-entropy between the real classes and the predicted classes: J(θ) = − 1 N

N

  • i=1

C

  • c=1
  • yi = c
  • log
  • yi

j

  • ,

(6)

slide-14
SLIDE 14

Contents

Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

slide-15
SLIDE 15

Summary of the proposal

Key items of our proposal:

  • 1. Maintain the linear component, but replace the sig-

moid element-wise operation with a generalized sig- moid function.

  • 2. We extend the kernel activation function (KAF), a re-

cently proposed non-parametric activation function.

  • 3. We modify the KAF to ensure that it behaves correctly

as a gating function.

slide-16
SLIDE 16

Summary of the proposal

Key items of our proposal:

  • 1. Maintain the linear component, but replace the sig-

moid element-wise operation with a generalized sig- moid function.

  • 2. We extend the kernel activation function (KAF), a re-

cently proposed non-parametric activation function.

  • 3. We modify the KAF to ensure that it behaves correctly

as a gating function.

slide-17
SLIDE 17

Summary of the proposal

Key items of our proposal:

  • 1. Maintain the linear component, but replace the sig-

moid element-wise operation with a generalized sig- moid function.

  • 2. We extend the kernel activation function (KAF), a re-

cently proposed non-parametric activation function.

  • 3. We modify the KAF to ensure that it behaves correctly

as a gating function.

slide-18
SLIDE 18

Basic structure of the KAF

A KAF models each activation function in terms of a kernel expansion over D terms as: KAF(s) =

D

  • i=1

αiκ (s, di) , (7) where:

  • 1. {αi}D

i=1 are the mixing coefficients;

  • 2. {di}D

i=1 are the dictionary elements;

  • 3. κ(·, ·) : R × R → R is a 1D kernel function.

Scardapane, S., Van Vaerenbergh, S., Totaro, S. and Uncini, A.,

  • 2017. Kafnets: kernel-based non-parametric activation functions for

neural networks. arXiv preprint arXiv:1707.04035.

slide-19
SLIDE 19

Extending KAFs for gated units

We cannot use a KAF straightforwardly because it is un- bounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: σKAF(s) = σ 1 2KAF(s) + 1 2s

  • .

(8) As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.

slide-20
SLIDE 20

Extending KAFs for gated units

We cannot use a KAF straightforwardly because it is un- bounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: σKAF(s) = σ 1 2KAF(s) + 1 2s

  • .

(8) As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.

slide-21
SLIDE 21

Extending KAFs for gated units

We cannot use a KAF straightforwardly because it is un- bounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: σKAF(s) = σ 1 2KAF(s) + 1 2s

  • .

(8) As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.

slide-22
SLIDE 22

Visualizing the new gates

−5 5 Activation Value of the gate

(a) γ = 1.0

−5 5 Activation Value of the gate

(b) γ = 0.5

−5 5 Activation Value of the gate

(c) γ = 0.1

Figure 1: Random samples of the proposed flexible gates with Gaussian kernel and different hyperparameters.

slide-23
SLIDE 23

Initializing the mixing coefficients

To simplify optimization we initialize the mixing coeffi- cients to approximate the identity function: α = (K + εI)−1 d , (9) where ε > 0 is a small constant. We then use a different set

  • f mixing coefficients for each forget gate and update gate.

−2 2 Activation −1.0 −0.5 0.0 0.5 1.0 Gate output

slide-24
SLIDE 24

Contents

Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

slide-25
SLIDE 25

Sequential MNIST benchmark

◮ [Row-wise MNIST (R-MNIST)] Each image is pro- cessed sequentially, row-by-row, i.e., we have sequences

  • f length 28, each represented by the value of 28 pix-

els. ◮ [Pixel-wise MNIST (P-MNIST)] Each image is repre- sented as a sequence of 784 pixels, read from left to right and from top to bottom from the original image. ◮ [Permuted P-MNIST (PP-MNIST)] Similar to P-MNIST, but the order of the pixels is shuffled using a (fixed) permutation matrix.

slide-26
SLIDE 26

Models and hyperparameters

  • 1. We compare standard GRUs and GRUs with the pro-

posed flexible gating function.

  • 2. GRUs have 100 units and we include an additional

batch normalization step to stabilize training.

  • 3. We train with Adam on mini-batches of 32 elements,

with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0.

  • 4. For the proposed gate, we use the Gaussian kernel

and initialize the dictionary from 10 elements equis- paced in [−4.0, 4.0].

  • 5. We compute the average accuracy of the model every

25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

slide-27
SLIDE 27

Models and hyperparameters

  • 1. We compare standard GRUs and GRUs with the pro-

posed flexible gating function.

  • 2. GRUs have 100 units and we include an additional

batch normalization step to stabilize training.

  • 3. We train with Adam on mini-batches of 32 elements,

with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0.

  • 4. For the proposed gate, we use the Gaussian kernel

and initialize the dictionary from 10 elements equis- paced in [−4.0, 4.0].

  • 5. We compute the average accuracy of the model every

25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

slide-28
SLIDE 28

Models and hyperparameters

  • 1. We compare standard GRUs and GRUs with the pro-

posed flexible gating function.

  • 2. GRUs have 100 units and we include an additional

batch normalization step to stabilize training.

  • 3. We train with Adam on mini-batches of 32 elements,

with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0.

  • 4. For the proposed gate, we use the Gaussian kernel

and initialize the dictionary from 10 elements equis- paced in [−4.0, 4.0].

  • 5. We compute the average accuracy of the model every

25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

slide-29
SLIDE 29

Models and hyperparameters

  • 1. We compare standard GRUs and GRUs with the pro-

posed flexible gating function.

  • 2. GRUs have 100 units and we include an additional

batch normalization step to stabilize training.

  • 3. We train with Adam on mini-batches of 32 elements,

with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0.

  • 4. For the proposed gate, we use the Gaussian kernel

and initialize the dictionary from 10 elements equis- paced in [−4.0, 4.0].

  • 5. We compute the average accuracy of the model every

25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

slide-30
SLIDE 30

Models and hyperparameters

  • 1. We compare standard GRUs and GRUs with the pro-

posed flexible gating function.

  • 2. GRUs have 100 units and we include an additional

batch normalization step to stabilize training.

  • 3. We train with Adam on mini-batches of 32 elements,

with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0.

  • 4. For the proposed gate, we use the Gaussian kernel

and initialize the dictionary from 10 elements equis- paced in [−4.0, 4.0].

  • 5. We compute the average accuracy of the model every

25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

slide-31
SLIDE 31

Accuracy on the test set

Dataset GRU (Standard) GRU (proposed) R-MNIST 98.29 ± 0.01 98.67 ± 0.02 P-MNIST 89.50 ± 5.64 97.34 ± 0.61 PP-MNIST 86.41 ± 6.71 96.10 ± 0.93

Table 1: Average test accuracy obtained by a standard GRU compared with a GRU endowed with the proposed flexible gates (with standard deviation).

slide-32
SLIDE 32

Evolution of the loss and validation accuracy

500 1000 1500 2000 2500 3000 Epoch 0.5 1.0 1.5 2.0 Loss Standard GRU Proposed GRU

(a) Training loss

50 100 150 200 250 300 Epoch 0.2 0.4 0.6 0.8 Accuracy Standard GRU Proposed GRU

(b) Validation accuracy

Figure 2: Convergence results on the P-MNIST dataset for a standard GRU and the proposed GRU.

slide-33
SLIDE 33

Distribution of the kernel’s bandwidths

0.05 0.10 0.15 0.20 0.25 0.30 Value for gamma 5 10 15 20 Number of cells

Figure 3: Sample histogram of the values for the kernel’s hyperparameters, after training, for the reset gate of the GRU.

slide-34
SLIDE 34

Ablation study

0.96 0.97 0.98 0.99 Test accuracy Normal Rand No-Residual Rand+No-Residual

Figure 4: Average results of an ablation study on the R-MNIST

  • dataset. Rand: we initialize the mixing coefficients randomly.

No-Residual: we remove the residual connection in (8). With a dashed red line we show the performance of a standard GRU.

slide-35
SLIDE 35

Contents

Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

slide-36
SLIDE 36

Summary

◮ We proposed an extension of the standard gating com- ponent used in most gated RNNs. ◮ To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. ◮ Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). ◮ Need more experiments with other gated RNNs, ap- plications, and interpretability of the resulting func- tions with respect to the task at hand.

slide-37
SLIDE 37

Summary

◮ We proposed an extension of the standard gating com- ponent used in most gated RNNs. ◮ To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. ◮ Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). ◮ Need more experiments with other gated RNNs, ap- plications, and interpretability of the resulting func- tions with respect to the task at hand.

slide-38
SLIDE 38

Summary

◮ We proposed an extension of the standard gating com- ponent used in most gated RNNs. ◮ To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. ◮ Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). ◮ Need more experiments with other gated RNNs, ap- plications, and interpretability of the resulting func- tions with respect to the task at hand.

slide-39
SLIDE 39

Summary

◮ We proposed an extension of the standard gating com- ponent used in most gated RNNs. ◮ To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. ◮ Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). ◮ Need more experiments with other gated RNNs, ap- plications, and interpretability of the resulting func- tions with respect to the task at hand.

slide-40
SLIDE 40

Questions?