Unsupervised Le Learning of Video Representations using LS LSTMs - - PowerPoint PPT Presentation

unsupervised le learning of video representations using
SMART_READER_LITE
LIVE PREVIEW

Unsupervised Le Learning of Video Representations using LS LSTMs - - PowerPoint PPT Presentation

Unsupervised Le Learning of Video Representations using LS LSTMs Srivastava et al. University of Toronto Presented by Shyam Tailor The Overall Idea Take a sequence of images and encode into a fixed size latent representation Decode


slide-1
SLIDE 1

Unsupervised Le Learning of Video Representations using LS LSTMs

Srivastava et al. University of Toronto

Presented by Shyam Tailor

slide-2
SLIDE 2

The Overall Idea

  • Take a sequence of images and encode into a fixed size latent

representation

  • Decode latent representation back into a target sequence
slide-3
SLIDE 3

What Should The Latent Representation Encode?

  • Significant redundancy between frames
  • Three things that seem reasonable to encode:
  • Background
  • Objects
  • Motion
slide-4
SLIDE 4

The Target Sequence

Predicting the future Reconstruction (in reverse!)

slide-5
SLIDE 5

Why Reverse The Reconstruction?

  • Idea – latent representation is like a stack
  • Encoder pushes on and decoder pops off

Encoder Decoder

slide-6
SLIDE 6

Future Prediction

  • To do well the latent representation must encode the objects and

how they’re moving

  • Note: this puts subtly different requirements on the encoder!
slide-7
SLIDE 7

Conditioning the Decoder

  • A small detail – the decoder can be conditioned on the previously

generated frame

  • Not really important but improves results a little.
slide-8
SLIDE 8

Combining the Tasks

  • The two tasks alone aren’t good enough L
  • Why?
  • Reconstruction requires memorisation but doesn’t require encoding to be

useful to predict future

  • Future prediction doesn’t incentivise keeping frames from the past
slide-9
SLIDE 9

An Experiment with MNIST

slide-10
SLIDE 10

Trying Natural Images

slide-11
SLIDE 11

Zooming In

slide-12
SLIDE 12

“Designing a loss function that respects our notion of visual similarity is a very hard problem” True… Let’s return to this at the end

slide-13
SLIDE 13

Seeding a Classifier with the Encoder

  • Going to do human action recognition on

some video datasets (UFC-101, HMDB- 51).

  • Is initializing with the encoder weights

better than starting from random?

  • What if the encoder is trained on

unrelated videos?

slide-14
SLIDE 14

Results of Pretraining

  • Encoder features transfer well and yield accuracy improvements
  • Especially pronounced with a small dataset
  • Using random YouTube videos doesn’t affect accuracy!
slide-15
SLIDE 15

Does the Encoding Really Have a Concept of Motion?

  • Instead of using the RGB images, it’s possible to train on the optical

flow vectors instead

  • Pretraining significantly less effective in this regime.

Photo credit: Mathworks

slide-16
SLIDE 16

Authors’ Conclusions

  • Great qualitative performance on the moving MNIST dataset – but

falls over on natural images

  • Nevertheless pretraining for natural images seems to have some

effect

  • It seems a stronger notion of optical flow is obtained
slide-17
SLIDE 17

Discussion: How do you make your frame predictions less blurry?

  • One idea is to use an adversarial loss.
  • Liang et al. 2017 tried this; their embedding was also great for

pretraining on UFC-101

slide-18
SLIDE 18

Discussion: Interpreting the Encoding

  • Is there any form of interpretability?
  • Examples:
  • Are encodings of motion, objects and background merged together or

distinct?

  • Is it possible to extract specific objects from the encoding?
slide-19
SLIDE 19

Discussion: What About Regularisation?

  • The authors saw no difference between pretraining on YouTube and

the activity recognition – how much does domain matter?

  • Is it possible to use a VAE by reframing the problem?
  • See “Learning to Decompose and Disentangle Representations for Video

Prediction” by Hsieh et al.

slide-20
SLIDE 20

References

  • 1. Liang, Xiaodan, et al. "Dual motion GAN for future-flow embedded

video prediction." Proceedings of the IEEE International Conference on Computer Vision. 2017.

  • 2. Hsieh, Jun-Ting, et al. "Learning to decompose and disentangle

representations for video prediction." Advances in Neural Information Processing Systems. 2018.