LSTMs Overview Subhashini Venugopalan Neural Networks z t Output - - PowerPoint PPT Presentation

lstms overview
SMART_READER_LITE
LIVE PREVIEW

LSTMs Overview Subhashini Venugopalan Neural Networks z t Output - - PowerPoint PPT Presentation

LSTMs Overview Subhashini Venugopalan Neural Networks z t Output B Hidden Hidden Input WHY RNNs/LSTMs? Can we operate over sequences of inputs? Limitations of vanilla Neural Networks z t Output Outputs a fixed size vector. B Hidden


slide-1
SLIDE 1

LSTMs Overview

Subhashini Venugopalan

slide-2
SLIDE 2

Neural Networks

B

zt

Input Hidden Hidden Output

slide-3
SLIDE 3

WHY RNNs/LSTMs?

Accepts only fixed size input e.g 224x224 images. Performs a fixed number of computations (#layers). Outputs a fixed size vector.

B

zt

Input Hidden Hidden Output

Can we operate over sequences of inputs? Limitations of vanilla Neural Networks

slide-4
SLIDE 4

Recurrent Neural Networks

Image Credit: Chris Olah

They are networks with loops. [Elman ‘90]

slide-5
SLIDE 5

Un-Roll The Loop

Image Credit: Chris Olah

Recurrent Neural Network “unrolled in time”

  • Each time step has a layer with the same weights.
  • The repeating layer/module is a sigmoid or a tanh.
  • Learns to model (ht | x1, …, xt-1)
slide-6
SLIDE 6

Simple RNNs

Image Credit: Chris Olah sigmoid

  • r

tanh sigmoid or tanh

slide-7
SLIDE 7

Problems with Simple RNNs

  • Can’t seem to handle “long-term dependencies” in practice
  • Gradients shrink through the many layers (Vanishing Gradients)

[Hochreiter ‘91] [Bengio et. al. ‘94]

Image Credit: Chris Olah

slide-8
SLIDE 8

Long Short Term Memory (LSTMs)

Image Credit: Chris Olah

[Hochreiter and Schmidhuber ‘97]

slide-9
SLIDE 9

LSTM Unit

xt ht-1 xt ht-1 xt ht-1 xt ht-1 ht Memory Cell Output Gate Input Gate Forget Gate Input Modulation Gate + Memory Cell: Core of the LSTM Unit Encodes all inputs observed [Hochreiter and Schmidhuber ‘97] [Graves ‘13]

slide-10
SLIDE 10

LSTM Unit

xt ht-1 xt ht-1 xt ht-1 xt ht-1 ht Memory Cell Output Gate Input Gate Forget Gate Input Modulation Gate + Memory Cell: Core of the LSTM Unit Encodes all inputs observed Gates: Input, Output and Forget Sigmoid [0,1] [Hochreiter and Schmidhuber ‘97] [Graves ‘13]

slide-11
SLIDE 11

LSTM Unit

xt ht-1 xt ht-1 xt ht-1 xt ht-1 ht Memory Cell Output Gate Input Gate Forget Gate Input Modulation Gate + [Hochreiter and Schmidhuber ‘97] [Graves ‘13] Update the Cell state Learns long-term dependencies

slide-12
SLIDE 12

Can Model Sequences

  • Can handle longer-term dependencies
  • Overcomes Vanishing Gradients problem
  • GRUs - Gated Recurrent Units is a much simpler variant which also
  • vercomes these issues.

LSTM LSTM LSTM LSTM

[Cho et. al. ‘14]

slide-13
SLIDE 13

Putting Things Together

Image Credit: Sutskever et. al.

Encode a sequence of inputs to a vector. (ht | x1, …, xt-1) Decode from the vector to a sequence of outputs. Pr(xt | x1, …, xt-1)

slide-14
SLIDE 14

SOLVE A WIDER RANGE OF PROBLEMS

Image Captioning

Image Credit: Andrej Karpathy

Activity Recognition Sequence to Sequence

Vinyals et. al. ‘15, Donahue et. al. ‘15 Donahue et. al. ‘15

Machine Translation Speech Recognition Video Description VQA, POS tagging, ... Sutskever et. al. ‘14, Cho et. al. ‘14 Graves & Jaitly ‘14

  • V. et. al. ‘15, Li et. al. ‘15

3 of 4 papers to be discussed this class

slide-15
SLIDE 15

Resources

  • Graves’ paper - LSTMs explanation. Generating sequences with recurrent

neural networks. Applications to handwriting and speech recognition.

  • Chris’ Blog - LSTM unit explanation.
  • Karpathy’s Blog - Applications.
  • Tensorflow and Caffe - Code examples.
slide-16
SLIDE 16

Sequence to Sequence Video to Text

Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue Raymond Mooney, Trevor Darrell, Kate Saenko

slide-17
SLIDE 17

Objective

A monkey is pulling a dog’s tail and is chased by the dog.

slide-18
SLIDE 18

Encode

Recurrent Neural Networks (RNNs) can map a vector to a sequence.

[Donahue et al. CVPR’15] [Sutskever et al. NIPS’14] [Vinyals et al. CVPR’15] English Sentence RNN encoder RNN decoder French Sentence Encode RNN decoder Sentence Encode RNN decoder Sentence [Venugopalan et. al. NAACL’15] RNN decoder Sentence [Venugopalan et. al. ICCV’ 15] (this work) RNN encoder

slide-19
SLIDE 19

S2VT Overview

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM CNN CNN CNN CNN A man is talking ... ... Encoding stage Decoding stage

Now decode it to a sentence!

Sequence to Sequence - Video to Text (S2VT)

  • S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko
slide-20
SLIDE 20

Frames: RGB

CNN 1000 categories CNN Forward propagate Output: “fc7” features

(activations before classification layer)

fc7: 4096 dimension “feature vector”

  • 1. Train on Imagenet
  • 2. Take activations from layer before classification
slide-21
SLIDE 21

Frames: Flow

CNN

(modified AlexNet)

101 Action Classes CNN Forward propagate Output: “fc7” features

(activations before classification layer)

fc7: 4096 dimension “feature vector”

  • 1. Train CNN on

Activity classes

  • 3. Take activations from

layer before classification

  • 2. Use optical flow to

extract flow images.

UCF 101

[T. Brox et. al. ECCV ‘04]

slide-22
SLIDE 22

Dataset: Youtube

  • A man is walking on a rope.
  • A man is walking across a rope.
  • A man is balancing on a rope.
  • A man is balancing on a rope at the beach.
  • A man walks on a tightrope at the beach.
  • A man is balancing on a volleyball net.
  • A man is walking on a rope held by poles
  • A man balanced on a wire.
  • The man is balancing on the wire.
  • A man is walking on a rope.
  • A man is standing in the sea shore.
  • ~2000 clips
  • Avg. length: 11s per clip
  • ~40 sentence per clip
  • ~81,000 sentences
slide-23
SLIDE 23

Results (Youtube)

27.7 28.2 29.2 29.8

Mean-Pool (VGG) S2VT (RGB+Flow) S2VT (randomized) S2VT (RGB)

METEOR: MT metric. Considers alignment, para-phrases and similarity.

slide-24
SLIDE 24
slide-25
SLIDE 25

Evaluation: Movie Corpora

MPII-MD

  • MPII, Germany
  • DVS alignment: semi-

automated and crowdsourced

  • 94 movies
  • 68,000 clips
  • Avg. length: 3.9s per clip
  • ~1 sentence per clip
  • 68,375 sentences

M-VAD

  • Univ. of Montreal
  • DVS alignment: automated

speech extraction

  • 92 movies
  • 46,009 clips
  • Avg. length: 6.2s per clip
  • 1-2 sentences per clip
  • 56,634 sentences
slide-26
SLIDE 26

Movie Corpus - DVS

Processed: Looking troubled, someone descends the stairs. Someone rushes into the courtyard. She then puts a head scarf on ...

slide-27
SLIDE 27

Results (MPII-MD Movie Corpus)

5.6 6.7 7.1

Best Prior Work

[Rohrbach et al. CVPR’15]

Mean-Pool S2VT (RGB)

slide-28
SLIDE 28

Results (M-VAD Movie Corpus)

4.3 6.1 6.7

Best Prior Work

[Yao et al. ICCV’15]

Mean-Pool S2VT (RGB)

slide-29
SLIDE 29

M-VAD: https://youtu.be/pER0mjzSYaM

slide-30
SLIDE 30
  • What are the advantages/drawbacks of this approach?

○ End-to-end, annotations

  • Detaching recognition and generation.
  • Why only METEOR (not BLEU or other metrics)?
  • Domain adaptation, Re-use RNNs (youtube -> movies, activity recognition)
  • Languages other than English.
  • Features apart from Optical Flow, RGB; temporal representation.

Discussion

Sequence to Sequence - Video to Text (S2VT)

  • S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko
slide-31
SLIDE 31
slide-32
SLIDE 32

Code and more examples http://vsubhashini.github.io/s2vt.html

Sequence to Sequence - Video to Text (S2VT)

  • S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko