A Comprehensive Survey
- n Deep Future Frame
Video Prediction
Final Master Thesis Master on Artificial Intelligence by
Javier Selva Castelló
Supervised by Sergio Escalera Guerrero and Marc Oliu Simón
A Comprehensive Survey on Deep Future Frame Video Prediction by - - PowerPoint PPT Presentation
Final Master Thesis Master on Artificial Intelligence A Comprehensive Survey on Deep Future Frame Video Prediction by Javier Selva Castell Supervised by Sergio Escalera Guerrero and Marc Oliu Simn Future Frame Prediction Given a video
A Comprehensive Survey
Video Prediction
Final Master Thesis Master on Artificial Intelligence by
Javier Selva Castelló
Supervised by Sergio Escalera Guerrero and Marc Oliu Simón
Future Frame Prediction
Given a video sequence, generate the next frames.
Input Predictor Model Output
2
Background
Unsupervised learning based on autoencoders:
Source: Understanding Autoencoders 3
Learning Unsupervised Features
4
Applications
understanding:
5
Structure of the Presentation
◉ Fundamentals. ◉ Training techniques. ◉ Loss functions. ◉ Measuring prediction error. ◉ Models and main trends. ◉ Experiments. ◉ Results. ◉ Discussion. ◉ Conclusions and future work.
6
Fundamentals (I)
Convolutional Neural Networks (CNN)
Convolution [1] Deconvolution [2] Convolutional Autoencoder with Pooling layers [3]
[1] Intel Labs, Bringing Parallelism to the Web with River Trail, http://intellabs.github.io/RiverTrail/tutorial/ [2] Vincent Dumoulin, Convolutional Arithmetics: https://github.com/vdumoulin/conv_arithmetic [3] H. Noh, S. Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. ICCV (2015).
7
Fundamentals (II)
Long Short-Term Memory (LSTM)
An LSTM cell [1] Unrolled LSTM Network [2]
[1] A. Graves, A. R. Mohamed, and G. Hinton. Wikimedia commons: Peephole long short-term memory, 2017. [2] C. Olah. Understanding LSTM networks, 2015.
8
Fundamentals (III)
Generative Adversarial Networks (GAN)
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
[1]
Source: Generative Adversarial Networks
9
Source: HeuriTech Blog
Improved Training
10
Loss Functions
Distance Losses (Blurry) Other Common Losses
Gradient Difference Loss (GDL) Adversarial to ensure sharp predictions.
11
Measuring Results
From a given sequence and correct movement dynamics, multiple futures are possible.
12
Compare against Ground Truth Realistic looking sequences
Inception Metric
Human Evaluation
Application for other tasks
playing Atari Games. [1]
[1] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-Conditional Video Prediction using Deep Networks in Atari Games. In NIPS, 2015.
Models
13
Models (I)
Simple non-recurrent proposals
[1] [2] [3]
[1] R. Goroshin, M. Mathieu, and Y. LeCun. Learning to linearize under uncertainty. NIPS 2015. [2] M. Zhao, C. Zhuang, Y. Wang, and T. Sing Lee. Predictive encoding of contextual relationships for perceptual inference, interpolation and prediction. In ICLR’15, 2014. [3] Y. Zhou and T. L. Berg. Learning temporal transformations from time-lapse videos. In ECCV, 2016.
14
Models (II)
Predict filter which is applied to last input frame(s)
[1] [2]
[1] Z. Liu, R. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In ICCV, 2017. [1] C. Vondrick and A. Torralba. Generating the future with adversarial transformers, 2017.
15
Models (III)
Predict filter which is applied to last input frame(s) (recurrent)
[1] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool. Dynamic filter networks. In NIPS, 2016. [2] V. Pătrăucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory. ICLR Workshop, 2016.
16
[1] [2]
Models (IV) Predict at some feature level, then generate future frame.
[1] [2]
[1] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to generate long-term future via hierarchical prediction. 2017. [2] J. R. van Amersfoort, A. Kannan, M.’A. Ranzato, A. Szlam, D. Tran, and S. Chintala. Transformation-based models of video sequences. CoRR, 2017.
17
Models (V) Explicit separation of content and motion.
[1] [2]
[1] X. Liang, L. Lee, W. Dai, and E. P. Xing. Dual motion gan for future-flow embedded video prediction. In ICCV, 2017. [2] E. L. Denton and V. Birodkar. Unsupervised learning of disentangled representations from video. In NIPS, 2017.
18
Models (VI) Others.
[1] [2] [3]
[1] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. CoRR, 2016. [2] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-Conditional Video Prediction using Deep Networks in Atari Games. In NIPS, 2015. [3] F. Cricri, X. Ni, M. Honkala, E. Aksu, and M. Gabbouj. Video ladder networks. CoRR, 2016.
19
Tested Models
proposed datasets.
20
Tested Model (I)
Srivastava
predictions.
[1] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.
21
Tested Model (II)
Mathieu
[1] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond Mean Square Error. 2015.
22
Tested Model (III)
Finn
transforms at pixel level.
separation.
[1] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
23
Tested Model (IV)
Lotter
[1] W. Lotter, G. Kreiman, and D. Cox. Deep predictive coding networks for video prediction and unsupervised learning. ICLR, 2016.
mistakes.
predictor state.
predictions, errors and targets.
24
Tested Model (V)
Villegas
connections.
[1] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. 2017.
CNN + LSTM (Motion).
CNN (Content).
Adversarial and GDL.
25
Tested Model (VI)
Oliu
[1] M. Oliu, J. Selva, and S. Escalera. Folded recurrent neural networks for future video prediction, 2017.
shared weights:
sequence.
predicted sequence.
26
Experimental Setting
sampling:
DSSIM.
27
Datasets (I)
Moving MNIST
28
[1]
[1] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.
Datasets (II)
KTH
4 different settings.
64 x 80) and grayscale.
background.
29
[1]
[1] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In ICPR, 2004.
Datasets (III)
UCF101
and camera motion.
Test: 3361 seq.
30
[1]
[1] K. Soomro, A. R. Zamir, and M. Shah. UCF101: Action Recognition dataset, 2011.
Quantitative Results (I)
31
Moving MNIST KTH UCF101
32
PSNR DSSIM
Srivastava Mathieu Lotter Finn Villegas Oliu
Qualitative MMNIST Results (I)
5 frames input 10 Ground Truth Srivastava Mathieu Finn Oliu
33
Qualitative MMNIST Results (II)
34
Qualitative KTH Results (I)
5 frames input 10 Ground Truth Srivastava Mathieu Finn Oliu Villegas Lotter
35
Qualitative II Results (II)
36
Qualitative UCF101 Results (I)
5 frames input 10 Ground Truth Srivastava Mathieu Finn Oliu Villegas Lotter
37
Qualitative UCF101 Results (II)
38
Discussion
39
Conclusion
Future work:
40
[1] M. Oliu, J. Selva, and S. Escalera. Folded recurrent neural networks for future video prediction, 2017.
Thank you!