Videos Saurabh Gupta CS 543 / ECE 549 Computer Vision Spring 2020

Outline • Optical Flow • Tracking • Correspondence • Recognition in Videos

Optical Flow • Data / Supervision • Architecture

Datasets • Traditional datasets: Yosemite, Middlebury • KITTI: http://www.cvlibs.net/datasets/kitti/eval_scene_flo w.php?benchmark=flow • Sintel: http://sintel.is.tue.mpg.de/ • Synthetic Datasets • Flying Chairs et al: https://lmb.informatik.uni- freiburg.de/resources/datasets/FlyingChairs.en.html • Supervision: from Simulation • Metrics: End-point Error

“Classical Optical Flow Pipeline”

PWC Net cv l ( x 1 , x 2 )= 1 ⌘ T ⇣ c l c l 1 ( x 1 ) w ( x 2 ) , N Models Matter, So Does Training: An Empirical Study of CNNs for Optical Flow Estimation. Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. arXiv 2018.

PWC Net Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 6 Frame 6 Frame 6 Frame 6 Frame 6 Frame 6 Frame 6 W/o context W/o context W/o context W/o context W/o context W/o context W/o context Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 6 Frame 6 Frame 6 Frame 6 Frame 6 W/o context W/o context W/o context W/o context W/o context Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 5 of “Ambush 3” (test, final) Frame 6 Frame 6 Frame 6 Frame 6 Frame 6 W/o context W/o context W/o context W/o context W/o context W/o DenseNet W/o DenseNet W/o DenseNet PWC-Net PWC-Net PWC-Net PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft W/o DenseNet W/o DenseNet W/o DenseNet W/o DenseNet PWC-Net PWC-Net PWC-Net PWC-Net PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft W/o DenseNet W/o DenseNet W/o DenseNet W/o DenseNet W/o DenseNet W/o DenseNet W/o DenseNet PWC-Net PWC-Net PWC-Net PWC-Net PWC-Net PWC-Net PWC-Net PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft W/o DenseNet W/o DenseNet W/o DenseNet PWC-Net PWC-Net PWC-Net PWC-Net-Sintel-ft PWC-Net-Sintel-ft PWC-Net-Sintel-ft Max. Chairs Sintel Sintel KITTI 2012 KITTI 2015 Disp. Clean Final AEPE Fl-all AEPE Fl-all 2.13 3.66 5.09 5.25 29.82% 13.85 43.52% 0 2.09 3.30 4.50 5.26 25.99 % 13.67 38.99 % 2 Full model ( 4 ) 2.00 3.33 4.59 5.14 28.67% 13.20 41.79% 1.97 3.31 4.60 4.96 27.05% 12.97 40.94% 6 (b) Cost volume. Removing the cost volume ( 0 ) results in moderate performance loss. PWC-Net can handle large motion using a small search range to compute the cost volume.

Flying Chairs Dataset initial object random object motion sampling transform transform object prototype Outputs: optical flow first frame second frame initial background random background motion sampling transform transform background prototype

“FlyingChairs” (synth.) “FlyingThings3D” (synth.) “Monkaa” (synth.) “Virtual KITTI” (synth.) Dosovitskiy et al (2015) Mayer et al (2016) Mayer et al (2016) Gaidon et al (2016) Test data Training data Sintel KITTI2015 FlyingChairs Sintel 6 . 42 18 . 13 5 . 49 FlyingChairs 5 . 73 16 . 23 3 . 32 FlyingThings3D 6 . 64 18 . 31 5 . 21 Monkaa 8 . 47 16 . 17 7 . 08 Driving 10 . 95 11 . 09 9 . 88

Tracking • Problem Statements • Tracking by Detection • General Object Tracking

Problem Statements • Single Object Tracking (eg: https://nanonets.com/blog/content/images/2019/07/ messi_football_track.gif) • Multi-object Tracking (eg: https://motchallenge.net/vis/MOT20-02/gt/) • Multi-object Tracking and Segmentation (eg: https://www.youtube.com/watch?v=K38_pZw_P9s)

Tracking by Detection Detections per frame Final Video sequence trajectories Object Data detection association . . . Detector Tracker F IGURE 2.2: Tracking-by-detection paradigm. Firstly, an independent detector is ap- plied to all image frames to obtain likely pedestrian detections. Secondly, a tracker is run on the set of detections to perform data association, i.e. , link the detections to obtain full trajectories. Source: Laura Leal-Taixé

Tracking by Detection Strike a Pose! Tracking People by Learning Their Appearance. D. Ramanan et al. , PAMI 2007

General Object Tracking Current frame Conv Layers Search Region Crop Fully-Connected Layers Crop Predicted loca3on of target within search region What to track Conv Layers Previous frame Learning to Track at 100 FPS with Deep Regression Networks. D. Held et al., ECCV16.

Correspondence in Time Tracking Middle Ground Optical Flow (Box-level, long-range) (Mid-level, long-range) (Pixel-level, short-range) Self-Supervised / Unsupervised Learning Human Annotations Synthetic Data Source: Xiaolong Wang

Learning to Track ℱ : a deep tracker ℱ ℱ ℱ How to obtain supervision? Source: Xiaolong Wang

Supervision: Cycle-Consistency in Time Track backwards ℱ ℱ ℱ ℱ ℱ ℱ Track forwards, back to the future Source: Xiaolong Wang

Supervision: Cycle-Consistency in Time ℱ ℱ ℱ ℱ ℱ ℱ Backpropagation through time, along the cycle Source: Xiaolong Wang

Multiple Cycles Sub-cycles: a natural curriculum Source: Xiaolong Wang

Multiple Cycles Shorter cycles: a natural curriculum Source: Xiaolong Wang

Tracker ℱ Densely match features in learned feature space φ 𝑄 ! 𝑄 !"# φ (𝑌, 𝑍) Correlatio Crop n Filter φ 𝐽 !"# Source: Xiaolong Wang

Visualization of Training Source: Xiaolong Wang

Test Time: Nearest Neighbors in Feature Space φ 𝑢 − 1 𝑢 Source: Xiaolong Wang

Evaluation: Label Propagation Source: Xiaolong Wang

Source: Xiaolong Wang Instance Mask Tracking DAVIS Dataset Source: Xiaolong Wang DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

Pose Keypoint Tracking JHMDB Dataset Source: Xiaolong Wang

Comparison Optical Flow Our Correspondence Source: Xiaolong Wang

Texture Tracking DAVIS Dataset Source: Xiaolong Wang DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

Semantic Masks Tracking Video Instance Parsing Dataset Source: Xiaolong Wang Zhou et al. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing. ACM MM 2018.

Outline • Optical Flow • Tracking • Correspondence • Recognition in Videos • Tasks • Datasets • Models • Applications

Recognition in Videos • Tasks / Datasets • Models

Tasks and Datasets • Action Classification • Kinetics Dataset: https://arxiv.org/pdf/1705.06950.pdf • ActivityNet, Sports-8M, … • Action “Detection” • In space, in time. Eg: JHMDB, AV

Tasks and Datasets • Time scale • Atomic Visual Actions (AVA) Dataset: https://research.goo gle.com/ava/explor e.html • Bias • Something Something Dataset: We don’t quite know how do https://20bn.com/da define good meaningful tasks for tasets/something- videos. More on this later. something

Models • Recurrent Neural Nets (See: https://colah.github.io/posts/2015-08- Understanding-LSTMs/) • Simple Extensions of 2D CNNs k H k k k H L d < L H k k output output L L output W W W (a) (c) (b) 2D convolution 2D convolution on multiple frames 3D convolution • 3D Convolution Networks • Two-Stream Networks • Inflated 3D Conv Nets • Slow Fast Networks • Non-local Networks

Recurrent Neural Networks Source: https://colah.github.io/posts/2015-09-NN-Types-FP/

3D Convolutions Karpathy et al. Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

3D Convolutions k H k output W (a) 2D convolution k d < L H k L output W (c) 3D convolution k L H k output L W (b) 2D convolution on multiple frames

Videos Saurabh Gupta CS 543 / ECE 549 Computer Vision Spring 2020 - PowerPoint PPT Presentation

Videos Saurabh Gupta CS 543 / ECE 549 Computer Vision Spring 2020 Outline Optical Flow Tracking Correspondence Recognition in Videos Optical Flow Data / Supervision Architecture Datasets Traditional datasets:

Creating Videos Session will begin shortly Why create instructional videos for your courses?

Consuming videos with the ForkBrowser Consuming videos with the ForkBrowser Ork de Rooij, Cees

Dennis Rosenberg http://DennisRosenberg.com Why Videos? People love watching videos Higher

Understand Basketball Games 2018.6.15 Sports Videos Large quantity, high

Content-Based Projections for Panoramic Images and Panoramic Images and Videos Videos

Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid

Action recognition in videos II Action recognition in videos II Cordelia Schmid INRIA Grenoble

All videos should be recorded horizontally if using your phone or tablet/ipad. Make sure

THE HOME OF INDEPENDENT MUSIC HD quality music videos from all over the world 24/7 Todays

City-Identification of Flickr videos using semantic acoustic features Benjamin Elizalde - Carnegie

the home of INDEPENDENT music 360 TUNEbox HD quality music videos from all over Todays

Virtual CSUCI Science Carnival March is STEAM Month Virtual Science Carnival Daily videos of

Visual Semantic Search: Retrieving Videos via Complex Textual Queries [Lin et al] CSC2523

Temporal Quality Assessment for Mobile Videos An (Jack) Chan, Amit Pande, Eilwoo Baik, Prasant

Robot Learning Collaborative Manipulation Plans from YouTube Cooking Videos Zhang, H. and

Welcome to CS 61A About the Course Parts of the Course 4 Parts of the Course Lecture : Videos

Investor Call SECOND QUARTER 2020 July 22, 2020 Time: 8:30 AM CDT Webcast: www.pnfp.com

Pagers Arent Safe Slide 1 What is a Pager? Pagers... those ubiquitous "beepers"

1 PH: 970-672-0287 Drawn by | WWW.SKETCHDATA.COM B 3 110 1/4" 2 A 32 1/2"

Seminar Outreach and Education- Members Service Bureau WRS Benefits Seminar March 16, 2017 1

Proc. UIST 2000 , pp. 217-226 Saul Greenberg,

Towards Certified Algorithms for Exact Real Arithmetic CCC 2017 (LORIA, Nancy, June 26-30 2017)

Chiral Anomaly Phenomena in Weyl Superconductors Satoshi Fujimoto Department of Materials

global financial network A study of M&A and IPO activities Kurtulus Gemici