Investigating Relational Recurrent Neural Networks with Variable - - PowerPoint PPT Presentation

β–Ά
investigating relational recurrent neural networks with
SMART_READER_LITE
LIVE PREVIEW

Investigating Relational Recurrent Neural Networks with Variable - - PowerPoint PPT Presentation

Investigating Relational Recurrent Neural Networks with Variable Length Memory Pointer Mahtab Ahmed and Robert E. Mercer Department of Computer Science University of Western Ontario London, ON, Canada Introduction Memory based Neural


slide-1
SLIDE 1

Investigating Relational Recurrent Neural Networks with Variable Length Memory Pointer

Mahtab Ahmed and Robert E. Mercer Department of Computer Science University of Western Ontario London, ON, Canada

slide-2
SLIDE 2

Introduction

  • Memory based Neural Networks can remember information longer

while modelling temporal data.

  • Encode a Relational Memory Core (RMC) as the cell state inside an

LSTM cell.

  • Uses standard Multi-head Self Attention.
  • Uses variable length memory pointer.
  • Evaluate on four different tasks.
  • State of the art on one of them; On par with the other three.

2

slide-3
SLIDE 3

Standard LSTM

3

slide-4
SLIDE 4

The model: Fixed Length Memory Pointer

  • Apply Multi-head Self Attention and create a weighted version, 𝑁
  • Add a residual connection
  • Apply Layer-Normalization block on top of 𝑁
  • Maintain separate version of mean and variance projection matrices.

4

Random Input at t

slide-5
SLIDE 5

The model: Fixed Length Memory Pointer (contd.)

  • n non-linear projections of β„Ž! are applied followed by a residual

connection

  • Resultant tensor π‘Œ (having shape 2 Γ— b Γ— d) is split on the cardinal

dimension to extract the memory

  • LSTM’s candidate cell state gets changed to
  • 𝑦! is replaced with the projected input (= 𝑋𝑦!) in all LSTM equations.

5

f = RELU and β„Ž! = 𝑁

slide-6
SLIDE 6

Variable Length Memory Pointer

  • Share W across all time steps.
  • Apply all the steps as before.
  • For Layer-Normalization, maintain just one version of mean and

variance projection matrices.

  • Memory is still at the cardinal dimension.
  • Rather than looking at everything before
  • Track a fixed window of words (n-grams).
  • Mimic the behavior of convolution kernel.

6

slide-7
SLIDE 7

Model Architecture

7

Mli-HeadAei LaeNmaliai

LSTM Eqaion

N-Liea Pjeci LaeNmaliai

Linear Projecion

Mli-HeadAei LaeNmaliai

LSTM Eqaion

N-Liea Pjeci LaeNmaliai

Linear Projecion

Mli-HeadAei LaeNmaliai

LSTM Eqaion

N-Liea Pjeci LaeNmaliai

Linear Projecion

slide-8
SLIDE 8

Sentence Pair Modelling

8

Word Representations Word Representations

InferSent - https://arxiv.org/abs/1705.02364

Left Sentence Right Sentence

βŠ• Classifier Encoder Encoder

Sentence Representation Sentence Representation Classes

slide-9
SLIDE 9

Hyperparameters

9

  • We tried a range of values for each hyperparameter. The ones that worked for us are bold-faced.
slide-10
SLIDE 10

Experimental Results

  • Models marked with †are the ones that we implemented

10

slide-11
SLIDE 11

Attention Visualization

11

Me <> Me <> Me <> He Me <> He a He a Me <> He a ed ed i Me he Vigiia ae geea' fce Me <> Me <> Me <> Befe Me <> Befe ha Befe ha Me <> Befe ha he he hed Me Vigiia icdig de ae geea . .22 .32 .25 .2 .15 .1 .10 .12 .10 .1 .3 .1 .1 .13 .14 .31 .20 .3 0.16 0.1 0.1 0.16 0.22 0.11 0.26 0.12 0.2 0.11 0.11 0.10 0.2 0.11 0.33 0.14 0.13 0.2 0.10 0.43 0.1 0.23 0.12 0.65 0. 0.22 .43 Me Vigiia ae geea' fce .10 .0 .30 .30 .10 .13 </> .0 .22 .34 .13 .12 .10 Me icdig de ae geea 0.1 0.1 0.15 0.2 0.0 0.15 </>

He a ked i he Vigiia ae geea' fce. Befe ha he hed ai i Vigiia, icdig de ae geea.

slide-12
SLIDE 12

Conclusion

  • Extend the classical RMC with variable length memory pointer.
  • Uses a non-local context to compute an enhanced memory.
  • Design a sentence pair modelling architecture.
  • Evaluate on four different tasks.
  • On par performance on most of the tasks and best performance on one of them.
  • Interprets the attention shifting very well.
  • Memory pointer length does not follow a uniform pattern across all datasets.

12

slide-13
SLIDE 13

Thank you

13