Action recognition in videos Action recognition in videos Cordelia - PowerPoint PPT Presentation

Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid

Action recognition - goal Action recognition goal • Short actions, i.e. answer phone, shake hands hand shake hand shake answer phone h

Action recognition - goal Action recognition goal • Activities/events, i.e. making a sandwich, doing homework Activities/events i e making a sandwich doing homework M ki Making sandwich d i h D i Doing homework h k TrecVid Multi-media event detection dataset

Action recognition - goal Action recognition goal • Activities/events, i.e. birthday party, parade Activities/events i e birthday party parade Birthday party Parade TrecVid Multi-media event detection dataset

Action recognition - tasks Action recognition tasks Tasks Tasks • Action classification: assigning an action label to a video clip Action classification: assigning an action label to a video clip M ki Making sandwich: present d i h Feeding animal: not present …

Action recognition - tasks Action recognition tasks Tasks Tasks • Action classification: assigning an action label to a video clip Action classification: assigning an action label to a video clip M ki Making sandwich: present d i h Feeding animal: not present … • Action localization: search locations of an action in a video Action locali ation search locations of an action in a ideo

State of the art in action recognition State of the art in action recognition Spatial motion descriptor Motion history image [Efros et al. ICCV 2003] [Bobick & Davis, 2001] Sign language recognition [Zisserman et al. 2009] Learning dynamic prior [Blake et al. 1998]

Advantages/disadvantages Temporal templates: p p Active shape models: p Tracking with motion priors: g p + simple, fast + shape regularization + improved tracking and - sensitive to simultaneous action recognition - sensitive to initialization and - sensitive to initialization and segmentation errors g tracking failures tracking failures tracking failures tracking failures Motion-based recognition: + generic descriptors; + generic descriptors; less depends on appearance - sensitive to - sensitive to localization/tracking errors

State of the art in action recognition State of the art in action recognition • Bag of space-time features [Laptev’03, Schuldt’04, Niebles’06, Zhang’07] Bag of space time features [L t ’03 S h ldt’04 Ni bl ’06 Zh ’07] Extraction of space-time features C ll Collection of space-time patches ti f ti t h Histogram of visual words HOG & HOF SVM classifier patch descriptors t h d i t

Space Space- -time local features time local features

Space Space- p -Time Interest Points: Detection Time Interest Points: Detection What neighborhoods to consider? Look at the High image Distinctive   distribution of the variation in space neighborhoods g gradient and time and time Definitions: O i i Original image sequence l i Space-time Gaussian with covariance Gaussian derivative of Space-time gradient Space-time gradient Second-moment matrix

Space Space- p -Time Interest Points: Detection Time Interest Points: Detection Properties of : p defines second order approximation for the local distribution of within neighborhood  1D space-time variation of , e.g. moving bar  2D space-time variation of , e.g. moving ball g g  3D space-time variation of , e.g. jumping ball Large eigenvalues of  can be detected by the local maxima of H over (x,y,t): (similar to Harris operator [Harris and Stephens, 1988])

Space-time features Space time features • Detector [Laptev’05] Detector [L t ’05]

Space-time features Space time features • Descriptors: HOG / HOF Descriptors: HOG / HOF Histogram of Histogram oriented spatial of optical  grad. (HOG) d (HOG) flow (HOF) 3x3x2x5bins HOF 3x3x2x4bins HOG descriptor descriptor

Visual Vocabulary: K Visual Vocabulary: K- y -means clustering means clustering g  Group similar points in the space of image descriptors using K- p p p g p g means clustering  Select significant clusters Clustering c1 c1 c2 c3 c4 Classification

Local features: Matching Local features: Matching  Finds similar events in pairs of video sequences

Bag of features Bag of features • Cluster descriptors with k-means (~4000 clusters) Cluster descriptors with k means ( 4000 clusters) • Assign each descriptor to the closest center • Measure frequency M f equency fre ….. ….. codewords

Action classification results Action classification results ct o ct o c ass c ass cat o cat o esu ts esu ts KTH dataset Hollywood-2 dataset AnswerPhone GetOutCar H HandShake dSh k St StandUp dU DriveCar Kiss [Laptev, Marsza ł ek, Schmid, Rozenfeld 2008]

Action Action classification classification Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”

Improved descriptors: Dense trajectories Improved descriptors: Dense trajectories • Dense sampling improves results over sparse interest D li i lt i t t points for image classification [Fei-Fei'05, Nowak'06] • Recent progress by using feature trajectories for action recognition [Messing'09 Sun'09] recognition [Messing 09, Sun 09] • The 2D space domain and 1D time domain in videos have • The 2D space domain and 1D time domain in videos have very different characteristics  Dense trajectories: a combination of dense sampling with feature trajectories [Wang, Klaeser, Schmid & Lui, CVPR’11] feature trajectories [Wang, Klaeser, Schmid & Lui, CVPR 11]

Approach Approach • Dense multi-scale sampling D lti l li • Feature tracking over L frames with optical flow • Trajectory-aligned descriptors with a spatio-temporal grid T j t li d d i t ith ti t l id

Approach Approach Dense sampling – remove untrackable points remove untrackable points – based on the eigenvalues of the auto-correlation matrix Feature tracking – by median filtering in dense optical flow field – length is limited to avoid drifting

Feature tracking Feature tracking SIFT tracks KLT tracks Dense tracks

Trajectory descriptors Trajectory descriptors • Motion boundary descriptor Motion boundary descriptor – spatial derivatives are calculated separately for optical flow in x and y , quantized into a histogram q g – relative dynamics of different regions – suppresses constant motions as appears for example due to b background camera motion k d ti

Trajectory descriptors Trajectory descriptors • Trajectory shape described by normalized relative point Trajectory shape described by normalized relative point coordinates • HOG, HOF and MBH are encoded along each trajectory

Experimental setup Experimental setup • Bag-of-features with 4000 clusters obtained by k-means, Bag of features with 4000 clusters obtained by k means classification by non-linear SVM with RBF + chi-square kernel kernel – Ialso possible to use Fisher vector + linear SVM • Descriptors are combined by addition of distances • Evaluation on two datasets: UCFSport (classification accuracy) and Hollywood2 (mean average precision) y) y ( g p ) • Two baseline trajectories: KLT and SIFT j

UCF Sports UCF Sports Diving Kicking Skateboarding High-Bar-Swinging 10 action classes videos from TV broadcasts 10 action classes, videos from TV broadcasts

Comparison of descriptors Comparison of descriptors Hollywood2 Hollywood2 UCFSports UCFSports Trajectory 47.8% 75.4% HOG 41.2% 84.3% HOF 50.3% 76.8% MBH 55.1% 84.2% Combined 58.2% 88.0% • Trajectory descriptor performs well • HOF >> HOG for Hollywood2, dynamic information is relevant • HOG >> HOF for sports datasets, spatial context is relevant • MBH consistently outperforms HOF, robust to camera motion

Comparison of trajectories Comparison of trajectories Hollywood2 y UCFSports p Dense trajectory + MBH 55.1% 84.2% KLT trajectory + MBH 48.6% 78.4% SIFT trajectory + MBH 40.6% 72.1% • Dense >> KLT >> SIFT trajectories

Improved trajectories (Wang & Schmid ICCV’13) Improved trajectories (Wang & Schmid ICCV 13) • Dense trajectories impacted by camera motion Dense trajectories impacted by camera motion – Stabilize camera motion before computing optical flow – Use human detector and robust homography estimation – Wrap optical flow and remove background trajectories student presentation

Results Results

Excellent results in TrecVid MED’13 Excellent results in TrecVid MED 13 • Combination of MBH SIFT, audio, text & speech recognition Combination of MBH SIFT audio text & speech recognition • First in the know event challenge, first in the adhoc event challenge challenge Making sandwich Making sandwich – results results Rank 1 (pos) R k 1 ( ) R Rank 20 (pos) k 20 ( ) R Rank 21 (neg) k 21 ( )

Excellent results in TrecVid MED’13 Excellent results in TrecVid MED 13 Fl FlashMob gathering – results hM b th i lt Rank 1 (pos) Rank 18 (pos) Rank 19 (neg)

Impact of different channels Impact of different channels

Action recognition in videos Action recognition in videos Cordelia - PowerPoint PPT Presentation

Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid Action recognition - goal Action recognition goal Short actions, i.e. answer phone, shake hands hand shake hand shake answer phone h Action

Action recognition in videos II Action recognition in videos II Cordelia Schmid INRIA Grenoble

Action recognition in videos Cordelia Schmid Action recognition - goal Short actions, i.e.

Action recognition in videos Cordelia Schmid Action recognition - goal Short actions, i.e.

Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1 Background Action

Creating Videos Session will begin shortly Why create instructional videos for your courses?

Consuming videos with the ForkBrowser Consuming videos with the ForkBrowser Ork de Rooij, Cees

Dennis Rosenberg http://DennisRosenberg.com Why Videos? People love watching videos Higher

Understand Basketball Games 2018.6.15 Sports Videos Large quantity, high

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A.

Action recognition Cordelia Schmid INRIA Grenoble Action recognition examples Short

Action recognition Cordelia Schmid INRIA Grenoble Action recognition examples Short

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Keypoint-Based Action Keypoint-Based Action Recognition Recognition Presenter: Jianchao Yang

Towards efficient end-to-end architectures for action recognition and detection in videos Limin

Extracting correlation structure from large random matrices Alfred Hero University of Michigan -

Structure determination of genomes and genomic domains by satisfaction of spatial restraints

Topics for today Introduction to Bioconductor: Getting started with Bioconductor g Using R

On Construction of Probabilistic Boolean Networks Wai-Ki CHING Advanced Modeling and Applied

6.869 Advances in Computer Vision Prof. Bill Freeman March 3, 2005 Image and shape descriptors

Trinity College London Operations Team - Webinar Guidance for ESOL tutors giving Centre

Descriptors Unix processes use descriptors to reference I/O Local File Systems in UNIX

3D Rotation-Invariant Description from Tensor Operation on Spherical HOG Field Kun Liu 1 , 3