Outline Gated Feedback Recurrent Neural Networks. arXiv1502. - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. - - PowerPoint PPT Presentation

Outline GF-RNN ReNet Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN Gated Feedback Recurrent Neural Networks (GF-RNN) Experiments: Character-level Language Modeling & Python Program


slide-1
SLIDE 1

Outline GF-RNN ReNet

Outline

Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN Gated Feedback Recurrent Neural Networks (GF-RNN) Experiments: Character-level Language Modeling & Python Program Evaluation ReNet: A Recurrent Neural Network Based Alternative to Convolutional

  • Networks. arXiv1505.

Introduction ReNet: 4 RNNs that sweep over lower-layer features in 4 directions Experiments: MNIST & CIFAR-10 & Street View House Numbers LU Yangyang luyy11@pku.edu.cn May 2015 @ KERE Seminar

slide-2
SLIDE 2

Outline GF-RNN ReNet

Authors

  • Gated Feedback Recurrent Neural Networks
  • arXiv.org. 9 Feb 2015 - 18 Feb 2015.
  • Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, Yoshua Ben-

gio (University of Montreal)

  • ReNet: A Recurrent Neural Network Based Alternative to Convolu-

tional Networks

  • arXiv.org. 3 May 2015.
  • Francesco Visin, Kyle Kastner, Kyunghyun Cho, Matteo Matteucci,

Aaron Courville, Yoshua Bengio (University of Montreal)

slide-3
SLIDE 3

Outline GF-RNN ReNet

Outline

Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN Gated Feedback Recurrent Neural Networks (GF-RNN) Experiments: Character-level Language Modeling & Python Program Evaluation ReNet: A Recurrent Neural Network Based Alternative to Convolutional

  • Networks. arXiv1505.
slide-4
SLIDE 4

Outline GF-RNN ReNet

Recurrent Neural Networks (RNN)

FOR Sequence Modeling

  • Can process a sequence of arbitrary length
  • Recursively applying a transition function to its internal hidden state for each

symbol of the input sequence

  • Theoretically capture any long-term dependency in an input sequence
  • Difficult to train an RNN to actually do so

ht = f(xt, ht−1) = φ(Wxt + Uht−1) p(x1, x2, ..., xT ) = p(x1)p(x2|x1)...p(xT |x1, ..., xT −1) p(xt+1|x1, ..., xt) = g(ht)

Figure: A single-layer RNN

slide-5
SLIDE 5

Outline GF-RNN ReNet

Gated Recurrent Neural Networks1

LSTM & GRU A LSTM Unit:

hj t = oj t tanh(cj t ) cj t = fj t cj t−1 + ij t ˜ cj t ˜ cj t = tanh(Wcxt + Ucht−1)j fj t = σ(Wf xt + Uf ht−1 + Vf ct)j ij t = σ(Wixt + Uiht−1 + Vict)j

  • j

t = σ(Woxt + Uoht−1 + Voct)j

A GRU Unit:

hj t = (1 − zj t )hj t−1 + zj t ˜ hj t zj t = σ(Wzxt + Uzht−1)j ˜ hj t = tanh(W xt + U(rt ⊙ ht−1))j rj t = σ(Wrxt + Urht−1)j 1Chung, J.,et al. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv’14.

slide-6
SLIDE 6

Outline GF-RNN ReNet

Gated Recurrent Neural Networks

Modifying the RNN architecture

  • Using a gated activation function:
  • the long short-term memory unit (LSTM): a memory cell, an input

gate , a forget gate, and an output gate

  • the gated recurrent unit (GRU): a reset gate and an update gate
  • Can contain both fast changing and slow changing components
  • stacked multiple levels of recurrent layers
  • partitioned and grouped hidden units to allow feedback information

at multiple timescales

  • Achieved promising results in both classification and generation tasks

⇒ Gated-feedback RNN (GF-RNN): learning multiple adaptive timescales

slide-7
SLIDE 7

Outline GF-RNN ReNet

GF-RNN: Overview

Figure: A Clockwork RNN

  • A sequence often consists of both slow-moving and fast-moving components.
  • slow-moving: long-term dependencies
  • fast-moving: short-term dependencies
  • El Hihi & Bengio (1995): an RNN can capture these dependencies of different

timescales more easily and efficiently when the hidden units of the RNN is explicitly partitioned into groups that correspond to different timescales.

  • The clockwork RNN (CW-RNN) (Koutnik et al., 2014):

updating the i-th module only when t mod 2i−1 = 0 ⇒ to generalize the CW-RNN by allowing the model to adaptively adjust the connectivity pattern between the hidden layers in the consecutive time-steps

slide-8
SLIDE 8

Outline GF-RNN ReNet

GF-RNN: Overview (cont.)

  • Partition the hidden units into multiple modules:

each module corresponds to a different layer in a stack of recurrent layers

  • Compared to CW-RNN: do not set an explicit rate for each module

each module: hierarchically stacked → different timescales

  • Each module is fully connected to all the other modules across the

stack and itself.

  • The global reset gate: gated the recurrent connection between two

modules based on the current input and the previous states of the hidden layers

slide-9
SLIDE 9

Outline GF-RNN ReNet

GF-RNN: The global reset gate

  • hi

t: the hidden unit on the i-th layer at time-step t

  • wi→j

g

, ui→j

g

: weights for inputs and hidden states of all layers at time-step t − 1

  • gi→j: control the signal from the i-th layer hi

t−1 to the j-th layer hj t based on

the input and the previous hidden units

slide-10
SLIDE 10

Outline GF-RNN ReNet

GF-RNN: The global reset gate

  • hi

t: the hidden unit on the i-th layer at time-step t

  • wi→j

g

, ui→j

g

: weights for inputs and hidden states of all layers at time-step t − 1

  • gi→j: control the signal from the i-th layer hi

t−1 to the j-th layer hj t based on

the input and the previous hidden units Information flows:

  • stacked RNN & GF-RNN: lower layers → upper layers
  • GF-RNN: lower layers ← upper layers (finer timescale ← coarser timescale)

A gated-feedback RNN: A fully-connected recurrent transition and global reset gates

slide-11
SLIDE 11

Outline GF-RNN ReNet

GF-RNN: Different Units of Practical Implementation

  • tanh Units
  • LSTM Units & GRU Units:
  • nly use the global reset gates when computing the new state

LSTM:

hj t = oj t tanh(cj t ) cj t = fj t cj t−1 + ij t ˜ cj t fj t = σ(Wf xt + Uf ht−1 + Vf ct)j ij t = σ(Wixt + Uiht−1 + Vict)j

  • j

t = σ(Woxt + Uoht−1 + Voct)j

GRU:

hj t = (1 − zj t )hj t−1 + zj t ˜ hj t zj t = σ(Wzxt + Uzht−1)j rj t = σ(Wrxt + Urht−1)j

slide-12
SLIDE 12

Outline GF-RNN ReNet

Experiment Tasks

BOTH: representative examples of discrete sequence modeling Objective Function: to minimize the negative log-likelihood of training sequences

  • Character-level language modeling:
  • English Wikipedia: 100MB characters
  • Contents: Latin alphabets, non-Latin alphabets, XML markups and special

characters

  • Vocabulary: 205 characters (one token for unknown character)
  • Train/CV/Test: 90MB/5MB/5MB
  • Preformance measure: the average number of bits-per-character

(BPC, E[− log2 P(xt+1|ht)])

  • Pythong Program Evaluation:
  • Goal:to generate or predict a correct return value of a given Python script
  • Input: python scripts (include addition, multiplication, subtraction, for-loop,

variable assignment, logical comparison and if-else statement)

  • Output: predicted value of the given Python script
  • Input/Output: 41/31 symbols
slide-13
SLIDE 13

Outline GF-RNN ReNet

Examples for Python Program Evaluation2

2Zaremba, Wojciech and Sutskever, Ilya. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.

slide-14
SLIDE 14

Outline GF-RNN ReNet

Experiments: Character-level Language Modeling

  • The sizes of models:
  • Tuning parameters: RMSProp and momentum
  • Test set BPC of models trained on the Hutter dataset for a 100 epochs:
slide-15
SLIDE 15

Outline GF-RNN ReNet

Experiments: Character-level Language Modeling (cont.)

Text Generation based on character-level language modeling:

  • Given the seed at the left-most column (bold-faced font), the models predict next

200 - 300 characters.

  • Tabs, spaces and new-line characters are also generated by the models.
slide-16
SLIDE 16

Outline GF-RNN ReNet

Experiments: Python Program Evaluation

Using an RNN encoder-decoder approach:

  • Python scripts → ENCODER (50 timesteps) → ht → DECODER →

character-level results

slide-17
SLIDE 17

Outline GF-RNN ReNet

Outline

Gated Feedback Recurrent Neural Networks. arXiv1502. ReNet: A Recurrent Neural Network Based Alternative to Convolutional

  • Networks. arXiv1505.

Introduction ReNet: 4 RNNs that sweep over lower-layer features in 4 directions Experiments: MNIST & CIFAR-10 & Street View House Numbers

slide-18
SLIDE 18

Outline GF-RNN ReNet

Introduction

Object Recognition:

  • Convolutional Neural Networks (CNN): LeNet-5
  • based on local context window
  • Recurrent Neural Networks:
  • (Graves and Schmidhuber, 2009): a multi-dimensional RNN
  • ReNet: purely uni-dimensional RNNs:

replace each convolutional layer (conv. + pooling) in the CNN ⇒ 4 RNNs that sweep over lower-layer features in 4 directions: ↑, ↓, ←, →

  • each feature activation: at the specific location with respect to the

whole image

slide-19
SLIDE 19

Outline GF-RNN ReNet

A one-layer ReNet

  • The Input Image: x ∈ Rw×h×c (width, height, feature dimensionality)
  • Give a patch size - wp × hp: split the input image x into a set of I × J (non-
  • verlapping) patches X = {xij}, xij ∈ Rwp×hp×c
  • 1. Sweep the image vertically with 2 RNNs(↑, ↓):

Each RNN takes as an input one (flattened) patch at a time and updates its hidden state, working along each column j

  • f the split input image X.
  • 2. Concatenate the intermediate hidden states zF

i,j, zR i,j at

each location (i, j) to get a composite feature map v = {zi,j}j=1,...,J

i=1,...,I , zij ∈ R1×hp×2d

(d: the number of recurrent units)

  • 3. Sweep V horizonally with 2 RNNs(←, →) in a similar man-
  • ner. The resulting feature map H = z′

i,j, z′ ij ∈ R1×1×2d:

the features of the original image patch xi,j in the context

  • f the whole image

The deep ReNet: stacked multiple φ’s (φ: the function from X to H)

slide-20
SLIDE 20

Outline GF-RNN ReNet

Differences between LeNet and ReNet

BOTH: apply the same set of filters to patches of the input image or the feature map from the layer below LeNet:

  • Patches: overlap
  • Many levels of convolution+pooling layers: to detect redundant fea-

tures from different parts of the image

  • Using max-pooling to achieve local translation invariance
  • highly parallelizable due to the independence of computing activations

at each layer

slide-21
SLIDE 21

Outline GF-RNN ReNet

Differences between LeNet and ReNet

ReNet:

  • Patches: not overlap
  • The lateral connections: help extract a more compact feature repre-

sentation of the input image at each layer

  • The lateral connection in ReNet can emulate the local competition

among features induced by the max-pooling in LeNet

  • not easily parallelizable, due to the sequential nature of RNN
slide-22
SLIDE 22

Outline GF-RNN ReNet

Datasets

  • MNIST:
  • 70K handwritten digits from 0 to 9
  • 28 × 28 pixels, each pixel: grayscale in [0,255]
  • Train/Dev/Test: 50K/10K/10K
  • CIFAR-10:
  • 60K images: a curated subset of the 80M tiny images dataset
  • 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship and

truck

  • 32 × 32 pixels, each pixel: 3 color channels (red-green-blue)
  • Train/Dev/Test: 40K/10K/10K
  • Street View House Numbers (SVHN):
  • cropped images representing house numbers captured by Google StreetView

vehicles as a part of the Google Maps mapping process

  • These images consist of digits 0 through 9 with values in the range of [0, 255].
  • 32 × 32 pixels, each pixel: grayscale in [0,255]
  • Train/Dev/Test: 543,949/60,439/26,032

Data Augmentation:

  • Flippling: flipped each sample horizontally with 25% chance, flipped it vertically

with 25% chance

  • Shifting:
  • ← 2 pixels (25% chance), → 2 pixels (25% chance)
  • ↑ 2 pixels (25% chance), ↓ 2 pixels (25% chance)
slide-23
SLIDE 23

Outline GF-RNN ReNet

Model Architectures

  • Unit Implementation:
  • GRU: MNIST, CIFAR-10
  • LSTM: SVHN
  • General Architecture:
  • NRE: the number of ReNet layers
  • dRE: the feature dimensionality
  • NF C: the number of fully-connected layers
  • dF C: the feature dimensionality of fully-connected layers
  • fF C: types of hidden units
slide-24
SLIDE 24

Outline GF-RNN ReNet

Experiment Results

slide-25
SLIDE 25

Outline GF-RNN ReNet

Discussion

  • Choice of Recurrent Units
  • ReNet preforms well independently of the specific implementation of

the recurrent units (either LSTM or GRU).

  • Gated recurrent units, either the GRU or the LSTM, outperform a

usual sigmoidal unit (affine transformation followed by an element- wise sigmoid function.)

  • Analysis of the Trained ReNet
  • ReNet performs comparably to deep convolutional neural networks

which are the de facto standard for object recognition.

  • ReNet does not outperform state-of-the-art convolutional neural net-

works on any of the three benchmark datasets.

  • Authors expect that the internal behavior of ReNet will differ from

that of LeNet significantly, which need further investigation.

  • Computationally Efficient Implementation
  • ReNet is less parallelizable due to the sequential nature of RNN.
  • ReNet allows the forward and backward RNNs to be run independently

from each other, which allows for parallel computation.

slide-26
SLIDE 26

Outline GF-RNN ReNet

Summary

Gated Feedback Recurrent Neural Networks. arXiv1502.

  • Stacked RNN + Fully-connected recurrent transition + Global reset gates
  • Using 3 kinds of implementation units: tanh, LSTM, GRU
  • Experiments: Character-level Text Generation
  • English Wikipedia
  • Better than conventional stacked RNN
  • Experiments: Python Program Evaluation
  • Predict return values of Python scripts
  • Using a RNN Encoder-Decoder approach

ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks. arXiv1505.

  • A one-layer ReNet: 4 RNNs in 4 directions to sweep the input image
  • Stacked ReNet layers → Deep Architectures
  • Using 2 kinds of implementation units: LSTM, GRU
  • Experiments: MNIST & CIFAR-10 & Street View House Numbers
  • Using flipping and shifting as data augmentation
  • ReNet performs comparably to deep CNN, but does not outperform state-of-

the-art CNN on any of the three tasks.