action recognition in videos action recognition in videos
play

Action recognition in videos Action recognition in videos Cordelia - PowerPoint PPT Presentation

Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid Action recognition - goal Action recognition goal Short actions, i.e. answer phone, shake hands hand shake hand shake answer phone h Action


  1. Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid

  2. Action recognition - goal Action recognition goal • Short actions, i.e. answer phone, shake hands hand shake hand shake answer phone h

  3. Action recognition - goal Action recognition goal • Activities/events, i.e. making a sandwich, doing homework Activities/events i e making a sandwich doing homework M ki Making sandwich d i h D i Doing homework h k TrecVid Multi-media event detection dataset

  4. Action recognition - goal Action recognition goal • Activities/events, i.e. birthday party, parade Activities/events i e birthday party parade Birthday party Parade TrecVid Multi-media event detection dataset

  5. Action recognition - tasks Action recognition tasks Tasks Tasks • Action classification: assigning an action label to a video clip Action classification: assigning an action label to a video clip M ki Making sandwich: present d i h Feeding animal: not present …

  6. Action recognition - tasks Action recognition tasks Tasks Tasks • Action classification: assigning an action label to a video clip Action classification: assigning an action label to a video clip M ki Making sandwich: present d i h Feeding animal: not present … • Action localization: search locations of an action in a video Action locali ation search locations of an action in a ideo

  7. State of the art in action recognition State of the art in action recognition Spatial motion descriptor Motion history image [Efros et al. ICCV 2003] [Bobick & Davis, 2001] Sign language recognition [Zisserman et al. 2009] Learning dynamic prior [Blake et al. 1998]

  8. Advantages/disadvantages Temporal templates: p p Active shape models: p Tracking with motion priors: g p + simple, fast + shape regularization + improved tracking and - sensitive to simultaneous action recognition - sensitive to initialization and - sensitive to initialization and segmentation errors g tracking failures tracking failures tracking failures tracking failures Motion-based recognition: + generic descriptors; + generic descriptors; less depends on appearance - sensitive to - sensitive to localization/tracking errors

  9. State of the art in action recognition State of the art in action recognition • Bag of space-time features [Laptev’03, Schuldt’04, Niebles’06, Zhang’07] Bag of space time features [L t ’03 S h ldt’04 Ni bl ’06 Zh ’07] Extraction of space-time features C ll Collection of space-time patches ti f ti t h Histogram of visual words HOG & HOF SVM classifier patch descriptors t h d i t

  10. Space Space- -time local features time local features

  11. Space Space- p -Time Interest Points: Detection Time Interest Points: Detection What neighborhoods to consider? Look at the High image Distinctive   distribution of the variation in space neighborhoods g gradient and time and time Definitions: O i i Original image sequence l i Space-time Gaussian with covariance Gaussian derivative of Space-time gradient Space-time gradient Second-moment matrix

  12. Space Space- p -Time Interest Points: Detection Time Interest Points: Detection Properties of : p defines second order approximation for the local distribution of within neighborhood  1D space-time variation of , e.g. moving bar  2D space-time variation of , e.g. moving ball g g  3D space-time variation of , e.g. jumping ball Large eigenvalues of  can be detected by the local maxima of H over (x,y,t): (similar to Harris operator [Harris and Stephens, 1988])

  13. Space-time features Space time features • Detector [Laptev’05] Detector [L t ’05]

  14. Space-time features Space time features • Descriptors: HOG / HOF Descriptors: HOG / HOF Histogram of Histogram oriented spatial of optical  grad. (HOG) d (HOG) flow (HOF) 3x3x2x5bins HOF 3x3x2x4bins HOG descriptor descriptor

  15. Visual Vocabulary: K Visual Vocabulary: K- y -means clustering means clustering g  Group similar points in the space of image descriptors using K- p p p g p g means clustering  Select significant clusters Clustering c1 c1 c2 c3 c4 Classification

  16. Local features: Matching Local features: Matching  Finds similar events in pairs of video sequences

  17. Bag of features Bag of features • Cluster descriptors with k-means (~4000 clusters) Cluster descriptors with k means ( 4000 clusters) • Assign each descriptor to the closest center • Measure frequency M f equency fre ….. ….. codewords

  18. Action classification results Action classification results ct o ct o c ass c ass cat o cat o esu ts esu ts KTH dataset Hollywood-2 dataset AnswerPhone GetOutCar H HandShake dSh k St StandUp dU DriveCar Kiss [Laptev, Marsza ł ek, Schmid, Rozenfeld 2008]

  19. Action Action classification classification Test episodes from movies “The Graduate”, “It’s a Wonderful Life”, “Indiana Jones and the Last Crusade”

  20. Improved descriptors: Dense trajectories Improved descriptors: Dense trajectories • Dense sampling improves results over sparse interest D li i lt i t t points for image classification [Fei-Fei'05, Nowak'06] • Recent progress by using feature trajectories for action recognition [Messing'09 Sun'09] recognition [Messing 09, Sun 09] • The 2D space domain and 1D time domain in videos have • The 2D space domain and 1D time domain in videos have very different characteristics  Dense trajectories: a combination of dense sampling with feature trajectories [Wang, Klaeser, Schmid & Lui, CVPR’11] feature trajectories [Wang, Klaeser, Schmid & Lui, CVPR 11]

  21. Approach Approach • Dense multi-scale sampling D lti l li • Feature tracking over L frames with optical flow • Trajectory-aligned descriptors with a spatio-temporal grid T j t li d d i t ith ti t l id

  22. Approach Approach Dense sampling – remove untrackable points remove untrackable points – based on the eigenvalues of the auto-correlation matrix Feature tracking – by median filtering in dense optical flow field – length is limited to avoid drifting

  23. Feature tracking Feature tracking SIFT tracks KLT tracks Dense tracks

  24. Trajectory descriptors Trajectory descriptors • Motion boundary descriptor Motion boundary descriptor – spatial derivatives are calculated separately for optical flow in x and y , quantized into a histogram q g – relative dynamics of different regions – suppresses constant motions as appears for example due to b background camera motion k d ti

  25. Trajectory descriptors Trajectory descriptors • Trajectory shape described by normalized relative point Trajectory shape described by normalized relative point coordinates • HOG, HOF and MBH are encoded along each trajectory

  26. Experimental setup Experimental setup • Bag-of-features with 4000 clusters obtained by k-means, Bag of features with 4000 clusters obtained by k means classification by non-linear SVM with RBF + chi-square kernel kernel – Ialso possible to use Fisher vector + linear SVM • Descriptors are combined by addition of distances • Evaluation on two datasets: UCFSport (classification accuracy) and Hollywood2 (mean average precision) y) y ( g p ) • Two baseline trajectories: KLT and SIFT j

  27. UCF Sports UCF Sports Diving Kicking Skateboarding High-Bar-Swinging 10 action classes videos from TV broadcasts 10 action classes, videos from TV broadcasts

  28. Comparison of descriptors Comparison of descriptors Hollywood2 Hollywood2 UCFSports UCFSports Trajectory 47.8% 75.4% HOG 41.2% 84.3% HOF 50.3% 76.8% MBH 55.1% 84.2% Combined 58.2% 88.0% • Trajectory descriptor performs well • HOF >> HOG for Hollywood2, dynamic information is relevant • HOG >> HOF for sports datasets, spatial context is relevant • MBH consistently outperforms HOF, robust to camera motion

  29. Comparison of trajectories Comparison of trajectories Hollywood2 y UCFSports p Dense trajectory + MBH 55.1% 84.2% KLT trajectory + MBH 48.6% 78.4% SIFT trajectory + MBH 40.6% 72.1% • Dense >> KLT >> SIFT trajectories

  30. Improved trajectories (Wang & Schmid ICCV’13) Improved trajectories (Wang & Schmid ICCV 13) • Dense trajectories impacted by camera motion Dense trajectories impacted by camera motion – Stabilize camera motion before computing optical flow – Use human detector and robust homography estimation – Wrap optical flow and remove background trajectories student presentation

  31. Results Results

  32. Results Results

  33. Excellent results in TrecVid MED’13 Excellent results in TrecVid MED 13 • Combination of MBH SIFT, audio, text & speech recognition Combination of MBH SIFT audio text & speech recognition • First in the know event challenge, first in the adhoc event challenge challenge Making sandwich Making sandwich – results results Rank 1 (pos) R k 1 ( ) R Rank 20 (pos) k 20 ( ) R Rank 21 (neg) k 21 ( )

  34. Excellent results in TrecVid MED’13 Excellent results in TrecVid MED 13 Fl FlashMob gathering – results hM b th i lt Rank 1 (pos) Rank 18 (pos) Rank 19 (neg)

  35. Impact of different channels Impact of different channels

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend