Towards web-scale video understanding Olga Russakovsky Serena - PowerPoint PPT Presentation

Towards web-scale video understanding Olga Russakovsky Serena Yeung Achal Dave (Stanford) (CMU)

400 hours of video are uploaded to YouTube every minute 70% of Internet traffic was videos in 2016, will be over 80% by 2020 1 http:// 2 White paper: Cisco VNI Forecast and Methodology, 2015-2020

Videos Knowledge of the dynamic visual world

Capture temporal cues (while handling correlations)

Allocate computation

Forego expensive annotation (while embracing ambiguity) ] Time [ Cat? Cat walking? Agreement over Agreement over spatial boundaries in images: temporal boundaries in videos: 96-98% above 0.5 IOU 76% above 0.5 IOU [Papadopoulos et al. ICCV 2017] [Sigurdsson et al. ICCV 2017]

Challenges of videos @ scale Modeling Learning Learn new concepts Capture temporal cheaply and while cues while handling embracing correlations ambiguity Inference Allocate computation to enable large-scale processing

Some desired modeling properties • Capture temporal cues • Effectively handle correlated examples • Provide an interpretable notion of memory • Operate in an online manner

Current approaches • Two-stream networks [Simonyan et al. NIPS 2014]: incorporates motion through optical flow • Computationally intensive! • C3D [Tran et al. ICCV 2015]: Operates via 3D convolutions on groups of video frames • Memory intensive • Tends to oversmooth • Recurrent networks, e.g., Clockwork RNNs [Koutnik et al. ICML 2014]: Maintain memory of “entire” history of video • History not easily interpretable • Training requires SGD on correlated data

Predictive-corrective networks • Key idea: Inspired by Kalman Filtering • Suppose our images and action scores evolve smoothly, as with a linear dynamical system: Actions Frames • Can create improved estimates of action scores by: Prediction Correction [Dave, Russakovsky, Ramanan. “Predictive-Corrective Networks for Action Detection.” CVPR 2017]

Predictive-corrective instantiation Frames FC8 prediction - + correction Prediction Correction [Dave, Russakovsky, Ramanan. “Predictive-Corrective Networks for Action Detection.” CVPR 2017]

De-correlate data (conv4-3 layer) [Dave, Russakovsky, Ramanan. “Predictive-Corrective Networks for Action Detection.” CVPR 2017]

Visualizing the corrections [Dave, Russakovsky, Ramanan. “Predictive-Corrective Networks for Action Detection.” CVPR 2017]

To summarize Predict t=1 Observe t=0 Observe t=1 Correct [Dave, Russakovsky, Ramanan. “Predictive-Corrective Networks for Action Detection.” CVPR 2017]

Results Per-frame classification (mAP) THUMOS MultiTHUMOS Charades Single-frame 34.7 25.4 7.9 Two-stream 36.2 27.6 8.9 LSTM (RGB) 39.3 28.1 7.7 Predictive-Corrective 38.9 29.7 8.9 [Dave, Russakovsky, Ramanan. “Predictive-Corrective Networks for Action Detection.” CVPR 2017]

Challenges of videos @ scale Modeling Learning Learn new concepts Capture temporal cheaply and while cues using a embracing Kalman filter ambiguity • Competitive with two-stream without optical flow Inference • Simplifies learning by decorrelating the input Allocate computation to enable large-scale [Dave, Russakovsky, Ramanan. processing CVPR 2017]

Back to predictive-corrective Frames FC8 FC8 prediction - + correction • Can save computation by ignoring the frame if correction is too small (~2x savings) • But still need to look at every frame! [Dave, Russakovsky, Ramanan. “Predictive-Corrective Networks for Action Detection.” CVPR 2017]

Efficient video processing t = T t = 0

Efficient video processing t = T t = 0 Running Talking

Efficient video processing t = T t = 0 Running Talking [start … end] [start … end]

Efficient video processing t = T t = 0 Running Talking [start … end] [start … end] “Knowing the output or the final state… there is no need to explicitly store many previous states” [N. I. Badler. “Temporal Scene Analysis…” 1975 ]

Efficient video processing t = T t = 0 Running Talking [start … end] [start … end] “Knowing the output or the final state… there is no need to explicitly store many previous states” “Time may be represented in several ways… The intervals between ‘pulses’ need not be equal.” [N. I. Badler. “Temporal Scene Analysis…” 1975 ]

Our model for efficient action detection Output Frame model Input: A frame t = 0 t = T [Yeung, Russakovsky, Mori, Fei-Fei. “End-to-end learning of action detection from frame glimpses in videos.” CVPR 2016]

Our model for efficient action detection [ ] Output: Detection instance [start, end] Output Next frame to glimpse Frame model Input: A frame t = 0 t = T [YeuRusMorFei CVPR’16] [Yeung, Russakovsky, Mori, Fei-Fei. “End-to-end learning of action detection from frame glimpses in videos.” CVPR 2016]

Our model for efficient action detection [ ] [ ] Output: Detection instance [start, end] Output Output Next frame to glimpse Frame model t = 0 t = T [YeuRusMorFei CVPR’16] [Yeung, Russakovsky, Mori, Fei-Fei. “End-to-end learning of action detection from frame glimpses in videos.” CVPR 2016]

Our model for efficient action detection [ ] [ ] [ ] Output: Detection instance [start, end] Output Output Output … Next frame to glimpse Frame model t = 0 t = T [YeuRusMorFei CVPR’16] [Yeung, Russakovsky, Mori, Fei-Fei. “End-to-end learning of action detection from frame glimpses in videos.” CVPR 2016]

Our model for efficient action detection [ ] [ ] [ ] Output: Detection instance [start, end] Output Output Output … Next frame to glimpse Recurrent neural network (time information) Convolutional neural network (frame information) t = 0 t = T [YeuRusMorFei CVPR’16] [Yeung, Russakovsky, Mori, Fei-Fei. “End-to-end learning of action detection from frame glimpses in videos.” CVPR 2016]

Our model for efficient action detection ⍉ ⍉ [ ] Optional output: Detection instance [start, end] Output Output Output Output: … Next frame to glimpse Recurrent neural network (time information) Convolutional neural network (frame information) t = 0 t = T [YeuRusMorFei CVPR’16] [Yeung, Russakovsky, Mori, Fei-Fei. “End-to-end learning of action detection from frame glimpses in videos.” CVPR 2016]

Training L 2 distance cross-entropy localization loss classification loss ⍉ ⍉ [ ] Optional output: Detection instance [start, end] Output Output Output Output: … Next frame to glimpse Recurrent neural network (time information) Convolutional neural network (frame information) t = 0 t = T [YeuRusMorFei CVPR’16] [Yeung, Russakovsky, Mori, Fei-Fei. “End-to-end learning of action detection from frame glimpses in videos.” CVPR 2016]

Train a policy using REINFORCE ⍉ ⍉ [ ] Optional output: Detection instance [start, end] Output Output Output Output: … Next frame to glimpse Recurrent neural network (time information) Convolutional neural network (frame information) t = 0 t = T [YeuRusMorFei CVPR’16] [Yeung, Russakovsky, Mori, Fei-Fei. “End-to-end learning of action detection from frame glimpses in videos.” CVPR 2016]

Accuracy Efficiency Interpretability [YeuRusMorFei CVPR’16] [Yeung, Russakovsky, Mori, Fei-Fei. “End-to-end learning of action detection from frame glimpses in videos.” CVPR 2016]

✓ Accuracy Efficiency Detection AP at IOU 0.5 Dataset State-of-the-art Our result 17.1 THUMOS 2014 14.4 36.7 ActivityNet sports 33.2 39.9 ActivityNet work 31.1 Interpretability [YeuRusMorFei CVPR’16] [Yeung, Russakovsky, Mori, Fei-Fei. “End-to-end learning of action detection from frame glimpses in videos.” CVPR 2016]

Towards web-scale video understanding Olga Russakovsky Serena - PowerPoint PPT Presentation

Towards web-scale video understanding Olga Russakovsky Serena Yeung Achal Dave (Stanford) (CMU) 400 hours of video are uploaded to YouTube every minute 70% of Internet traffic was videos in 2016, will be over 80% by 2020 1 http:// 2 White

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Introduction to Web Design Web Audio and Video Introduction to Web Design Web Audio and Video

Towards Understanding Towards Understanding Objectives Objectives Good basic understanding of

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

A New Two- -Scale Mix Model: Towards Scale Mix Model: Towards a Multi a Multi- - A New Two A

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

Learning from Unlabeled Video Carl Vondrick Columbia University Survivor Bias of Video Data

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/26/2018 NVIDIA Video Technologies Overview Video

Video Sur Video Sur rveillance, rveillance, , Video Analyti Video Analyti ics, and You.

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

Sequence-Structure Alignment A General Formulation Unifying view on Edit Distance,

Agenda Overview of Important Dates Evaluation Requirements Reporting Requirements

The Top 5: Sustainability for Libraries Shanta Tucker, PE, Atelier Ten No 1 Get Data Know your

Exploring the Use of Succession Planning to Inform Recruitment and Onboarding At the end of the

Technology Management Instructor: Carson Block Twitter: @CarsonBlock http://www.carsonblock.com

Continued from part a Characteristic Amide Vibrations A often obscured ~3300 cm -1 by solvent

Data modeling in and beyond BIBFRAME Tiziana Possemato, @Cult - Casalini Libri Share -VDE

CSE 427 Computational Biology Genes and Gene Prediction 1 Gene Finding: Motivation