Long-term Recurrent Convolutional Networks for Visual Recognition - - PowerPoint PPT Presentation

long term recurrent convolutional networks for visual
SMART_READER_LITE
LIVE PREVIEW

Long-term Recurrent Convolutional Networks for Visual Recognition - - PowerPoint PPT Presentation

Long-term Recurrent Convolutional Networks for Visual Recognition and Description Donahue et al. Berkan Demirel Overview of LRCN LRCN is a class of models that is both spatially and temporally deep. It has the fmexibility to be applied


slide-1
SLIDE 1

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Donahue et al.

Berkan Demirel

slide-2
SLIDE 2

Overview of LRCN

 LRCN is a class of models that is both spatially and

temporally deep.

 It has the fmexibility to be applied to a variety of vision tasks

involving sequential inputs and outputs.

Image credit: main paper

slide-3
SLIDE 3

Convolutional Neural Networks

Krizhevsky, Alex, Ilya Sutskever, and Geofgrey E. Hinton. "Imagenet classifjcation with deep convolutional neural networks." Advances in neural information processing systems. 2012.

slide-4
SLIDE 4

Limitation 1

Fixed-size, static input - 224x224x3

slide-5
SLIDE 5

Limitation 2

Output is a single choice from list of options

slide-6
SLIDE 6

Jefg Donahue, CVPR Cafge Tutorial, June 6, 2015

Background: Sequence Learning

slide-7
SLIDE 7

Jefg Donahue, CVPR Cafge Tutorial, June 6, 2015

Background: Sequence Learning

slide-8
SLIDE 8

Jefg Donahue, CVPR Cafge Tutorial, June 6, 2015

Background: Sequence Learning

slide-9
SLIDE 9

Jefg Donahue, CVPR Cafge Tutorial, June 6, 2015

Background: Sequence Learning

slide-10
SLIDE 10

Jefg Donahue, CVPR Cafge Tutorial, June 6, 2015

Background: Sequence Learning

slide-11
SLIDE 11

Jefg Donahue, CVPR Cafge Tutorial, June 6, 2015

Background: Sequence Learning

slide-12
SLIDE 12

Contributions

 Mapping variable-length inputs (e.g., video frames) to

variable length outputs (e.g., natural language text).

 LRCN is directly connected to modern visual convnet models.  It is suitable for large-scale visual learning which is end-to-

end trainable.

slide-13
SLIDE 13

Sequential inputs /outputs

Image credit: main paper

slide-14
SLIDE 14

LRCN Model

 LRCN model works by passing each visual input (an image in

isolation, or a frame from a video) through a feature transformation parametrized by V to produce a fixed-length vector representation.

 In its most general form, a sequence model, parametrized by W, maps

an input xt and a previous timestep hidden state ht-1 to an output zt and update hidden state ht.

 The final step in predicting a distribution P(yt) at timestep t is to take a

softmax over the outputs zt of the sequential mode.

slide-15
SLIDE 15

LRCN Model - Activity Recognition

 Sequential input, fixed outputs:

<x1,x2,x3,... xT> → y

 With sequential inputs and

scalar outputs, we take a late fusion approach to merging the per-timestep predictions into a single prediction for the full sequence.

Image credit: main paper

slide-16
SLIDE 16

LRCN Model – Image Description

 Fixed input, sequential outputs: x

→ <y1,y2,y3,... yT>

 With fixed-size inputs and

sequential outputs, we simply duplicate the input x at all T timesteps.

Image credit: main paper

slide-17
SLIDE 17

LRCN Model – Video Description

 Fixed input, sequential outputs:

<x1,x2,x3,... xT> → <y1,y2,y3,... yT'>

 For a sequence-to-sequence

problem with (in general) different input and output lengths, take an “encoder-decoder” approach.

 In this approach, one sequence

model, the encoder, is used to map the input sequence to a fixed length vector, then another sequence model, the decoder, is used to unroll this vector to sequential outputs of arbitrary length.

Image credit: main paper

slide-18
SLIDE 18

LRCN Model

Under the proposed system, the weights (V;W)

  • f the model’s visual and sequential

components can be learned jointly by maximizing the likelihood of the ground truth

  • utputs.
slide-19
SLIDE 19

Activity Recognition

  • T individual frames are input to T convolutional networks

which are then connected to a single layer LSTM with 256 hidden units.

  • The CNN base of the LRCN is a hybrid of the Caffe

reference model, a minor variant of AlexNet, and the network used by Zeiler & Fergus which is pre-trained on the 1.2M image ILSVRC-2012 classification training subset of the ImageNet dataset.

slide-20
SLIDE 20

Activity Recognition

  • Two variants of the LRCN architecture are used: one in

which the LSTM is placed after the first fully connected layer of the CNN (LRCN-fc6) and another in which the LSTM is placed after the second fully connected layer

  • f the CNN (LRCN-fc7).
  • Networks are trained with video clips of 16 frames. The

LRCN predicts the video class at each time step and we average these predictions for final classification.

  • Consider both RGB and flow inputs.
slide-21
SLIDE 21

Activity recognition

CNN CNN CNN CNN

LSTM LSTM LSTM LSTM

running

sitting

jumping jumping

Average

jumping

slide-22
SLIDE 22

Evaluation

Architecture is evaluated on the UCF-101 dataset which consists of over 12,000 videos categorized into 101 human action classes.

slide-23
SLIDE 23

Image Description

  • In contrast to activity recognition, the static image

description task only requires a single convolutional network.

  • At each timestep, both the image features and the

previous word are provided as inputs to the sequential model, in this case a stack of LSTMs (each with 1000 hidden units),

  • This is used to learn the dynamics of the time-varying
  • utput sequence, natural language.
slide-24
SLIDE 24

Image Description

CNN

LSTM LSTM LSTM LSTM

a dog

is jumpi ng

LSTM

<EOS>

<BOS>

slide-25
SLIDE 25

a dog

is jumping

LSTM

<EOS>

<BOS>

CNN

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Image Description

slide-26
SLIDE 26

a dog

is jumping

LSTM

<EOS>

<BOS>

CNN LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Image Description

Two layered factor

slide-27
SLIDE 27

Evaluation – Image Retrieval

 Model is trained on the combined training sets of the

Flickr30k (28,000 training images) and COCO2014 dataset (80,000 training images).

 Results are reported on Flickr30k ( 1000 images each for

test and validation).

slide-28
SLIDE 28

Evaluation – Image Retrieval

 Image retrieval results for variants of the LRCN

architectures.

slide-29
SLIDE 29

Evaluation – Sentence Generation

 BLEU (bilingual evaluation understudy ) metric is used.  Additionally, Authors report results on the new COCO2014 dataset which has

80,000 training images, and 40,000 validation images.

 Authors isolate 5,000 images from the validation set for testing purposes and

the results are reported

slide-30
SLIDE 30

Evaluation – Human Evaluation Rankings

Human evaluator rankings from 1-6(low is good) averaged for each method and criterion.

slide-31
SLIDE 31

Image Description Results

slide-32
SLIDE 32

Image Description Results

slide-33
SLIDE 33

Image Description Results

slide-34
SLIDE 34

Video Description

  • Due to limitations of available video description datasets

authors take a different path.

  • They rely on more "traditional" activity and video

recognition processing for the input and use LSTMs for generating a sentence.

  • They assume they have predictions of objects, subjects,

and verbs present in the video from a CRF based on the full video input.

slide-35
SLIDE 35

Video Description

a dog

is jumping

LSTM

<EOS>

<BOS>

Pre-trained detector predictions

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

slide-36
SLIDE 36

LSTM Encoder & Decoder

Figure credit: supplementary material

slide-37
SLIDE 37

LSTM Decoder with CRF Max

Figure credit: supplementary material

slide-38
SLIDE 38

LSTM Decoder with CRF Prob.

Figure credit: supplementary material

slide-39
SLIDE 39

Evaluation – Video Description

TACoS multilevel dataset, which has 44,762 video/sentence pairs (about 40,000 for training/validation).

slide-40
SLIDE 40

Video Description

Figure credit: supplementary material

slide-41
SLIDE 41

Video Description

Figure credit: supplementary material

slide-42
SLIDE 42

Conclusion

 LRCN is a flexible framework for vision

problems involving sequences

 Able to handle:

✔ Sequences in the input (video) ✔ Sequences in the output (natural language

description)

slide-43
SLIDE 43

Future Directions

Image credit: Hu, Ronghang, Marcus Rohrbach, and Trevor Darrell. "Segmentation from Natural Language Expressions." arXiv preprint arXiv:1603.06180 (2016).

slide-44
SLIDE 44

Thank You!