LSTMs Overview
Subhashini Venugopalan
LSTMs Overview Subhashini Venugopalan Neural Networks z t Output - - PowerPoint PPT Presentation
LSTMs Overview Subhashini Venugopalan Neural Networks z t Output B Hidden Hidden Input WHY RNNs/LSTMs? Can we operate over sequences of inputs? Limitations of vanilla Neural Networks z t Output Outputs a fixed size vector. B Hidden
Subhashini Venugopalan
B
zt
Input Hidden Hidden Output
Accepts only fixed size input e.g 224x224 images. Performs a fixed number of computations (#layers). Outputs a fixed size vector.
B
zt
Input Hidden Hidden Output
Can we operate over sequences of inputs? Limitations of vanilla Neural Networks
Image Credit: Chris Olah
They are networks with loops. [Elman ‘90]
Image Credit: Chris Olah
Recurrent Neural Network “unrolled in time”
Image Credit: Chris Olah sigmoid
tanh sigmoid or tanh
[Hochreiter ‘91] [Bengio et. al. ‘94]
Image Credit: Chris Olah
Image Credit: Chris Olah
[Hochreiter and Schmidhuber ‘97]
xt ht-1 xt ht-1 xt ht-1 xt ht-1 ht Memory Cell Output Gate Input Gate Forget Gate Input Modulation Gate + Memory Cell: Core of the LSTM Unit Encodes all inputs observed [Hochreiter and Schmidhuber ‘97] [Graves ‘13]
xt ht-1 xt ht-1 xt ht-1 xt ht-1 ht Memory Cell Output Gate Input Gate Forget Gate Input Modulation Gate + Memory Cell: Core of the LSTM Unit Encodes all inputs observed Gates: Input, Output and Forget Sigmoid [0,1] [Hochreiter and Schmidhuber ‘97] [Graves ‘13]
xt ht-1 xt ht-1 xt ht-1 xt ht-1 ht Memory Cell Output Gate Input Gate Forget Gate Input Modulation Gate + [Hochreiter and Schmidhuber ‘97] [Graves ‘13] Update the Cell state Learns long-term dependencies
LSTM LSTM LSTM LSTM
[Cho et. al. ‘14]
Image Credit: Sutskever et. al.
Encode a sequence of inputs to a vector. (ht | x1, …, xt-1) Decode from the vector to a sequence of outputs. Pr(xt | x1, …, xt-1)
Image Captioning
Image Credit: Andrej Karpathy
Activity Recognition Sequence to Sequence
Vinyals et. al. ‘15, Donahue et. al. ‘15 Donahue et. al. ‘15
Machine Translation Speech Recognition Video Description VQA, POS tagging, ... Sutskever et. al. ‘14, Cho et. al. ‘14 Graves & Jaitly ‘14
3 of 4 papers to be discussed this class
neural networks. Applications to handwriting and speech recognition.
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue Raymond Mooney, Trevor Darrell, Kate Saenko
Objective
A monkey is pulling a dog’s tail and is chased by the dog.
Encode
Recurrent Neural Networks (RNNs) can map a vector to a sequence.
[Donahue et al. CVPR’15] [Sutskever et al. NIPS’14] [Vinyals et al. CVPR’15] English Sentence RNN encoder RNN decoder French Sentence Encode RNN decoder Sentence Encode RNN decoder Sentence [Venugopalan et. al. NAACL’15] RNN decoder Sentence [Venugopalan et. al. ICCV’ 15] (this work) RNN encoder
S2VT Overview
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM CNN CNN CNN CNN A man is talking ... ... Encoding stage Decoding stage
Now decode it to a sentence!
Sequence to Sequence - Video to Text (S2VT)
Frames: RGB
CNN 1000 categories CNN Forward propagate Output: “fc7” features
(activations before classification layer)
fc7: 4096 dimension “feature vector”
Frames: Flow
CNN
(modified AlexNet)
101 Action Classes CNN Forward propagate Output: “fc7” features
(activations before classification layer)
fc7: 4096 dimension “feature vector”
Activity classes
layer before classification
extract flow images.
UCF 101
[T. Brox et. al. ECCV ‘04]
Dataset: Youtube
Results (Youtube)
27.7 28.2 29.2 29.8
Mean-Pool (VGG) S2VT (RGB+Flow) S2VT (randomized) S2VT (RGB)
METEOR: MT metric. Considers alignment, para-phrases and similarity.
Evaluation: Movie Corpora
MPII-MD
automated and crowdsourced
M-VAD
speech extraction
Movie Corpus - DVS
Processed: Looking troubled, someone descends the stairs. Someone rushes into the courtyard. She then puts a head scarf on ...
Results (MPII-MD Movie Corpus)
5.6 6.7 7.1
Best Prior Work
[Rohrbach et al. CVPR’15]
Mean-Pool S2VT (RGB)
Results (M-VAD Movie Corpus)
4.3 6.1 6.7
Best Prior Work
[Yao et al. ICCV’15]
Mean-Pool S2VT (RGB)
M-VAD: https://youtu.be/pER0mjzSYaM
○ End-to-end, annotations
Discussion
Sequence to Sequence - Video to Text (S2VT)
Code and more examples http://vsubhashini.github.io/s2vt.html
Sequence to Sequence - Video to Text (S2VT)