Computer Vision by Learning: Motion in Action Jan van Gemert, UvA 2 - - PowerPoint PPT Presentation

computer vision by learning
SMART_READER_LITE
LIVE PREVIEW

Computer Vision by Learning: Motion in Action Jan van Gemert, UvA 2 - - PowerPoint PPT Presentation

Computer Vision by Learning: Motion in Action Jan van Gemert, UvA 2 Motion and perceptual organization Even impoverished motion data can evoke a strong percept 3 Motion and perceptual organization Even impoverished motion


slide-1
SLIDE 1

Computer Vision by Learning: Motion in Action

Jan van Gemert, UvA

slide-2
SLIDE 2

2

Motion and perceptual organization

  • Even “impoverished” motion data can evoke a

strong percept

slide-3
SLIDE 3

3

Motion and perceptual organization

  • Even “impoverished” motion data can evoke a

strong percept

slide-4
SLIDE 4

4

Uses of motion

  • Estimating 3D structure
  • Segmenting objects based on motion cues
  • Learning dynamical models
  • Improving video quality (motion stabilization)
  • Recognizing actions, activities, events
slide-5
SLIDE 5

Action Recognition Pipeline

5

Spatio-temporal Interest point detection Space-time patch/trajectory Space-time descriptor Similar setup as in static image-classification Followed by Bag-of-Words/Fisher vector and SVM

slide-6
SLIDE 6

6

Measuring Motion

Lagrangian Eulerian

6

Lagrangian and Eulerian Perspectives: Fluid Dynamics

slide-7
SLIDE 7

7

  • 1. Lagrangian Perspective: Optical Flow

Lagrangian

  • Track each pixel as it moves

through a video

slide-8
SLIDE 8

8

Problem definition: optical flow

How to estimate pixel motion from image H to image I?

  • Solve pixel correspondence problem

– given a pixel in H, look for nearby pixels of the same color in I

Key assumptions

  • color constancy: a point in H looks the same in I

– For grayscale images, this is brightness constancy

  • small motion: points do not move very far

This is called the optical flow problem

[Lukas-­‑Kanade, ¡1981] ¡

slide-9
SLIDE 9

Visualizing optical flow

9

Color Legend

(Hue is an angular color space)

Q: What do you notice?

  • Camera Motion
  • Parallax
slide-10
SLIDE 10

10

Optical flow and parallax

p(t) p(t+dt) P(t) P(t+dt) V v

  • P(t) is a moving 3D point
  • Velocity of scene point:

V = dP/dt

  • p(t) = (x(t),y(t)) is the

projection of P in the image

  • Apparent velocity v in the

image: given by components vx = dx/dt and vy = dy/dt

  • Length of v inversely

proportional to the depth Z

  • f the 3d point
slide-11
SLIDE 11

11 Zoom out Zoom in Pan right to left

Q: Name the camera motion:

Optical flow and camera motion

slide-12
SLIDE 12

12

Motion boundaries

[Dalal, ¡eccv06] ¡

Video frame Optical flow (quivers) Spatial gradient Optical flow (hue) Horizontal motion boundaries Vertical motion boundaries

Color legend

  • Motion boundaries are invariant to constant camera motion

Q: What do you notice?

slide-13
SLIDE 13

13

Animated Example

Video frame Optical flow (quivers) Spatial gradient Optical flow (hue) Horizontal motion boundaries Vertical motion boundaries

Color legend

Q: What do you notice?

Motion boundaries are the spatial gradients of the x and y flow images

  • Similar properties as the spatial gradient
  • No motion: motion boundaries disappear
  • Parallax
slide-14
SLIDE 14

14

Modeling camera motion

[wang, ¡iccv13] ¡ [jain, ¡cvpr13] ¡

Video frame Optical Flow Subtracted Camera motion

Remove global motion:

  • 1. Globally align frames
  • 2. Optical flow on the

aligned frames

(assumes homography)

Video Frame Subtracted Cam Human Detector Subtracted Cam

Subtract background motion:

  • Assume the

background is where the human is not

slide-15
SLIDE 15

Flow trajectory descriptors

15 [Wang, ¡ijcv13] ¡

slide-16
SLIDE 16

16

  • 2. Eulerian Perspective: stationary
  • Treat each pixel as a time series

through a video

y x time

Eulerian

slide-17
SLIDE 17

Spatio-Temporal Interest Points (STIP)

17 [Laptev, ¡ijcv03] ¡

Spatio-Temporal Harris Corners

slide-18
SLIDE 18
  • Spatio-Temporal extension of hessian blob detector
  • Strength S of the interest point computed with the

determinant of the Hessian matrix H

  • Approximations with integral videos

Spatio-Temporal Blobs (hes-Stip)

18 [willems,eccv08] ¡

slide-19
SLIDE 19

Periodic Interest Points (Cuboids)

19

2D Gaussian smoothing kernel: 1D Gabor filters applied temporally:

[Dollar, ¡vspets05] ¡

Beyond Spatio-Temporal Corners

slide-20
SLIDE 20

Dense Sampling

20

  • Motivation: Dense sampling outperforms interest-

points for object recognition

  • Extract 3D cubes at regular positions (x,y,t) with

varying scales

slide-21
SLIDE 21

Spatio-Temporal Gradient Descriptor (HOG3D)

21 [Kläser, ¡bmvc08] ¡

Quantization of 3D gradient : HOG3D: 3D HOG/SIFT :

(polyhedrons)

2D HOG/SIFT :

(polygon)

Extensions to color Concatenation: Integration (tensors):

[Everts, ¡cvpr13] ¡

slide-22
SLIDE 22
  • 3. Action Recognition

22

  • Automatically recognizing actions, activities, events
  • Learn from training data
  • Apply on unseen test data
slide-23
SLIDE 23
  • Create feature vocabulary

– Kmeans (BOW) – GMM (Fisher)

  • Assign features to vocabulary

– Hard assignment (BOW) – Vector differences (Fisher)

  • Aggregate over whole video

– Spatio-Temporal Pyramid

  • Classifier (SVM)

Video Coding

23

Bag of words Fisher

slide-24
SLIDE 24

Detectors and Descriptors

24

Interest point detection Space-time patch/trajectory Space-time descriptor Dense Harris3D STIP Hessian STIP Cuboids KLTtraj DenseTraj HOG3D HOG HOF MBH Detectors: Descriptors: Modeling the camera yes/no

slide-25
SLIDE 25

Action Recognition Datasets

HMDB51, ¡51 ¡classes ¡6,766 ¡vids: ¡body ¡moJon, ¡Facial ¡expressions, ¡human ¡InteracJons ¡ UCF50, ¡50 ¡classes ¡6,618 ¡vids: ¡sports, ¡daily ¡exercises ¡ Hollywood2, ¡12 ¡classes ¡1,707 ¡vids: ¡movie ¡acJons ¡

slide-26
SLIDE 26

Results Hollywood2

ref Detector HOG3D HOG HOF MBH [wang,bmvc09] Harris3D STIP 43,7 32,8 43,3 [wang,bmvc09] Cuboids 45,7 39,4 42,9 [wang,bmvc09] Hessian STIP 41,3 36,2 43 [wang,bmvc09] Dense 45,3 39,4 45,5 [klaser,bmvc08] Cuboids 48,6 38,2 43,8

26

Motion is important

slide-27
SLIDE 27

Results Hollywood2

ref Detector HOG3D HOG HOF MBH [wang,bmvc09] Harris3D STIP 43,7 32,8 43,3 [wang,bmvc09] Cuboids 45,7 39,4 42,9 [wang,bmvc09] Hessian STIP 41,3 36,2 43 [wang,bmvc09] Dense 45,3 39,4 45,5 [klaser,bmvc08] Cuboids 48,6 38,2 43,8 [wang,ijcv13] Dense 43,3 48 52,1 [wang,ijcv13] KLTtraj 41 48,4 48,6 [wang, ijcv13] DenseTraj 41,2 50,3 55,1 [wang, ijcv13] Harris3D STIP 40,4 44,9 [jain,cvpr13] DenseTrajCam 45,6 54,1 54,2

27

Camera motion invariance Motion is important Dense Traject

slide-28
SLIDE 28

Results Hollywood2

ref Detector HOG3D HOG HOF MBH [wang,bmvc09] Harris3D STIP 43,7 32,8 43,3 [wang,bmvc09] Cuboids 45,7 39,4 42,9 [wang,bmvc09] Hessian STIP 41,3 36,2 43 [wang,bmvc09] Dense 45,3 39,4 45,5 [klaser,bmvc08] Cuboids 48,6 38,2 43,8 [wang,ijcv13] Dense 43,3 48 52,1 [wang,ijcv13] KLTtraj 41 48,4 48,6 [wang, ijcv13] DenseTraj 41,2 50,3 55,1 [wang, ijcv13] Harris3D STIP 40,4 44,9 [jain,cvpr13] DenseTrajCam 45,6 54,1 54,2 [oneata,iccv13] DenseTrajFisher 42,5 61,9 [wang,iccv13] DenseTrajFisher 46,9 51,4 57,4 [wang,iccv13] DenseTrajFisherCam 47,1 58,8 60,5

28

Motion is important Camera motion invariance Fisher Dense Traject

slide-29
SLIDE 29

Results UCF50 and HMDB51

ref Detector HOG3D HOG HOF MBH [everts,cvpr13] Cuboids 68,3 [everts,cvpr13] CuboidsColor 72,9 [wang,ijcv13] Dense 64,4 65,9 78,3 [wang,ijcv13] KLTtraj 57,4 57,9 71,1 [wang,ijcv13] DenseTraj 68 68,2 82,2 [shi,cvpr13] Dense10k 72,4 58,6 69,7 80,1 [oneata,iccv13] DenseTrajFisher 76,3 87,8 [wang,iccv13] DenseTrajFisher 81,8 74,3 86,5 [wang,iccv13] DenseTrajFisherCam 82,6 85,1 88,9 29 ref Detector HOG3D HOG HOF MBH [wang,ijcv13] Dense 25,2 29,4 40,9 [wang,ijcv13] KLTtraj 22,2 23,7 33,7 [wang,ijcv13] DenseTraj 27,9 31,5 43,2 [shi,cvpr13] Dense10k 34,7 21 33,5 43 [oneata,iccv13] DenseTrajFisher 34,8 51,9 [jain,cvpr13] DenseTrajCam 29,1 38,6 40,9 [wang,iccv13] DenseTrajFisher 38,4 39,5 49,1 [wang,iccv13] DenseTrajFisherCam 40,2 48,9 52,1

UCF50: HMDB51: Dense Dense Camera motion invariance Camera motion invariance Fisher Fisher

slide-30
SLIDE 30

Reflection

  • Detector: dense (trajectories)
  • Descriptor: camera motion invariance

– MBH descriptor

  • Fisher Vector
  • Ignored:

– Combinatorics of hog+hof+mbh (muddies the analysis) – Human pose modeling literature – Deep learning performs still below State-of-the-art

  • Gaps:

– Fisher on HOG3D? – Camera motion invariance for Eularian methods? – Parallax?

30

slide-31
SLIDE 31

Goal: Finding Actions in Videos: Where, When and What is happening (tube) Challenges: Exponential search space, Occlusion, Motion, Non-rigid deformations Applications: Video Indexing, Security, Sport Statistics, Animal Monitoring, Elderly Safety, Marketing Research.

  • 4. Action Localisation

31

slide-32
SLIDE 32

Inspired ¡by ¡Object ¡LocalizaJon ¡ In ¡StaJc ¡Images ¡

[Lampert, ¡pami09] ¡ [Rodriguez, ¡cvpr08] ¡

Sliding Window Branch and Bound Deformable Parts

[Yuan, ¡pami11] ¡ [Felzenswalb, ¡pami10] ¡ [Tian, ¡cvpr13] ¡ [ke, ¡iccv05] ¡ [violaJones, ¡ijcv04] ¡

Boosting Cascade

[Rowley, ¡pami98] ¡

Image Video Image Video Image Video Image Video 32

slide-33
SLIDE 33

SelecJve ¡Search ¡for ¡ ¡ StaJc ¡Image ¡Object ¡LocalizaJon ¡

[Uijlings, ¡ijcv13] ¡

  • High recall with modest nr of Object Hypotheses.
  • Train an expensive classifier for single hypothesis

Object hypotheses based on hierarchical grouping of super-pixels

Q: How would you extend it to video?

33

slide-34
SLIDE 34

Selective Search for Action Localisation in Video

[Xu, ¡cvpr12] ¡

Super-voxel video segmentation with high boundary recall Tubelet hypothesis generation by merging independent cues Tubelet classification based on MBH and SVM

[Jain, ¡CVPR14] ¡ 34

slide-35
SLIDE 35

Example of Super-Voxel Segmentation

35

slide-36
SLIDE 36

Example of Super-Voxel Segmentation

36

slide-37
SLIDE 37

Merging Super-Voxels

Tubelet hypothesis generation by merging independent cues:

– Color: HSV histograms – Texture: HOG – Motion: Independent motion, ML-estimate of affine camera deviation – Size: Smallest-first – Fill: Joined_cuboid – voxel_pair

37

slide-38
SLIDE 38

Experiments: Datasets

UCF-Sports, 150 vids, 10 actions MSR-II, 54 vids, 3 actions, unconstrained temporal location (boxing, hand-clapping, hand-waving)

38

slide-39
SLIDE 39

Experiment 1: Tubelet Recall

Overlap: Avg Best Overlap (ABO): MABO = mean ABO over all classes

39

slide-40
SLIDE 40

Experiment 2: Action Localisation

UCF-Sports: MSR-II:

Precision Precision Precision Recall

40

slide-41
SLIDE 41

Example Detections

41

slide-42
SLIDE 42

Summary

  • Lagrangian Perspectives: Optical Flow

– Tracking points, dense trajectories – Camera motion (MBH)

  • Eulerian Perspective: Temporal Derivatives

– Dense, Stips, Cuboids, HOG3D

  • Action Recognition

– Bag of words / Fisher

  • Action localization

– Temporal Extensions of 2D localisation – Temporal Selective Search

42

slide-43
SLIDE 43

Referred Literature

  • [dalal, eccv06]: Human detection using oriented histograms of flow and appearance
  • [dollár, vspets05]: Behavior recognition via sparse spatio-temporal features
  • [everts, cvpr13]: Evaluation of Color STIPs for Human Action Recognition
  • [Felzenswalb, pami10]: Object detection with discriminatively trained part-based models
  • [jain,cvpr13]: Better exploiting motion for better action recognition
  • [jain, CVPR14]: Action Localization by Tubelets from Motion
  • [kläser, bmvc08]: A spatio-temporal descriptor based on 3d-gradients
  • [ke, iccv05]: Efficient Visual Event Detection using Volumetric Features
  • [lampert, pami09]: Efficient subwindow search: a branch and bound framework for object localization
  • [laptev, ijcv05]: On space-time interest points
  • [lukas-Kanade, 1981]: An iterative image registration technique with an application to stereo vision
  • [oneata,iccv13]: Action and Event Recognition with Fisher Vectors on a Compact Feature Set
  • [rodriguez, cvpr08]: Action mach: a spatio-temporal maximum average correlation height filter for action recognition
  • [rowley, pami98]: Neural network-based face detection
  • [shi,cvpr13]: Sampling strategies for real-time action recognition
  • [Tian, cvpr13]: Spatiotemporal de-formable part models for action detection
  • [Uijlings, ijcv13]: Selective search for object Recognition
  • [violaJones, ijcv04]: Robust real-time face detection
  • [wang,bmvc09]: Evaluation of local spatio-temporal features for action recognition
  • [wang,iccv13]: Action recognition with improved trajectories
  • [wang, ijcv13]: Dense trajectories and motion boundary descriptors for action recognition
  • [willems,eccv08]: An efficient dense and scale-invariant spatio-temporal interest point detector
  • [xu, cvpr12]: Evaluation of Super-Voxel Methods for Early Video Processing
  • [yuan, pami11]: Discriminative video patern search for efficient action detection

43

slide-44
SLIDE 44

Questions?

44