BIL 722: Advanced Topics in Computer Vision Mehmet Kerim Y cel - - PowerPoint PPT Presentation

bil 722 advanced
SMART_READER_LITE
LIVE PREVIEW

BIL 722: Advanced Topics in Computer Vision Mehmet Kerim Y cel - - PowerPoint PPT Presentation

BIL 722: Advanced Topics in Computer Vision Mehmet Kerim Y cel Deep Structured Models For Group Activity Recognition Deng et al. BMVC 2015 Simon Fraser University, SPORTLOGIQ , Canada Overview 25 April 2016 Deep Structured Models For


slide-1
SLIDE 1

BIL 722: Advanced Topics in Computer Vision

Deep Structured Models For Group Activity Recognition

Deng et al. BMVC 2015 Simon Fraser University, SPORTLOGIQ , Canada

Mehmet Kerim Yücel

slide-2
SLIDE 2

Brunel University London

Overview

25 April 2016

2

Deep Structured Models For Group Activity Recognition

  • Deng, Zhiwei et al., BMVC 2015
  • Simon Fraser University, SPORTLOGIQ, Canada

Useful links

  • Paper

http://arxiv.org/pdf/1506.04191.pdf

  • Code / BMVC presentation not available

Deep Structured Models For Group Activity Recognition

slide-3
SLIDE 3

Brunel University London

Overview

25 April 2016

3

  • Individual / group activity recognition problem
  • Combining atomic action information with their dependencies
  • Use deep CNNs to learn atomic actions / scene labels
  • Then refine these labels with NN-made graphical models
  • State of the art achieved for Collective Activity and Nursing

Home (?) Datasets

Deep Structured Models For Group Activity Recognition

slide-4
SLIDE 4

Brunel University London

Overview

25 April 2016

4

  • Major contributions
  • First to combine CNNs and GM for group activity recognition
  • Message passing phase created by Neural Nets
  • Results comparable to state of the art

Deep Structured Models For Group Activity Recognition

slide-5
SLIDE 5

Brunel University London

Literature Review

25 April 2016

5

  • Event understanding is a notorious problem
  • Need to acquire spot-on info on atomic actions
  • Such actions include walking, running, waving, etc...
  • Hand-crafted features (HOG, MBH, improved dense trajectories) in

the context of BoW [Wang, Heng, and Cordelia Schmid]

  • Then feed these into a discriminative or generative model
  • These are swept by DL approaches [Karpathy, Andrej et al.] [Simonyan, Karen,

and Andrew Zisserman]

Deep Structured Models For Group Activity Recognition

slide-6
SLIDE 6

Brunel University London

Literature Review

25 April 2016

6

  • Action Recognition with

Improved Trajectories 1

  • Improve dense trajectories with

camera motion

  • Remove trajectories consistent

with the estimated camera motion

  • Cancel out camera motion from
  • ptical flow

Deep Structured Models For Group Activity Recognition

slide-7
SLIDE 7

Brunel University London

Literature Review

25 April 2016

7

  • Large-scale Video Classification with convolutional neural

networks 2

  • CNN variants experimented with (taking into account time-domain)

Deep Structured Models For Group Activity Recognition

slide-8
SLIDE 8

Brunel University London

Literature Review

25 April 2016

8

  • Two-stream Convolutional Networks for Action Recognition in

Videos 3

  • Spatial stream net trained on single frame
  • Temporal stream net trained on optical flow

Deep Structured Models For Group Activity Recognition

slide-9
SLIDE 9

Brunel University London

Literature Review

25 April 2016

9

  • Event understanding is a notorious problem
  • We need: interactions of individuals, higher level information
  • Such interactions and high level activities are suitable for a hierarchical

structure

  • Rich features to capture context; social cues [Lan, Tian, Leonid Sigal, and Greg

Mori]

  • Hierarchical Graphical Models [Amer, Mohamed Rabie, Peng Lei, and Sinisa Todorovic]
  • Dynamic Bayesian Networks [Zhu, Yingying, Nandita Nayak, and Amit Roy-Chowdhury]

Deep Structured Models For Group Activity Recognition

slide-10
SLIDE 10

Brunel University London

Literature Review

25 April 2016

10

  • Combination of convolutional neural nets with graphical

models

  • Tompson, Jonathan J., et al. one step message passing implemented

as convolution operation, incorporating spatial relations between local responses for human body pose estimation

  • Deng, Jia, et al. relations between predicted labels considered via

training a GM on top of a neural net with joint training

Deep Structured Models For Group Activity Recognition

slide-11
SLIDE 11

Brunel University London

Problem Statement & Motivation

25 April 2016

11

  • Motivation of this work
  • Further the state of the art in group activity recognition
  • Accurately detect atomic actions/scene labels
  • Incorporate dependencies between labels for actions/activities
  • Perform label refinement through a hierarchical structure

incorporating said dependencies ... via using a CNN and a HGM based on a neural net that mimics message passing

Deep Structured Models For Group Activity Recognition

slide-12
SLIDE 12

Brunel University London

Graphical Models in a Neural Network

25 April 2016

12

Deep Structured Models For Group Activity Recognition

  • Graphical Models...
  • Defines a joint distribution over states of a set of nodes
  • Take a factor graph;
  • Inference done by belief propagation
  • Belief propagation, at each step of message passing, collects

relevant info from connected nodes to a factor node, then passes these messages to variable nodes

slide-13
SLIDE 13

Brunel University London

Graphical Models in a Neural Network

25 April 2016

13

Deep Structured Models For Group Activity Recognition

  • Key point: Mimic message passing using a neural network!
  • Represent each combination of states as a neuron (factor neuron)
  • Factor neuron can learn dependencies between states and pass

messages

  • Can adopt various neuron types (linear, ReLU, TanH, etc...)
  • Parameter sharing due to GM integration into NN; reduced free

parameters

slide-14
SLIDE 14

Brunel University London

Graphical Models in a Neural Network

25 April 2016

14

Deep Structured Models For Group Activity Recognition

slide-15
SLIDE 15

Brunel University London

Message Passing CNN Architecture

25 April 2016

15

Deep Structured Models For Group Activity Recognition

  • Key point: Two-stage architecture
  • First stage: Fine-tuned CNNs that produce scene scores for a

frame, and action and pose scores for each person in that frame

  • Second stage: Message Passing NN that captures label

dependencies

slide-16
SLIDE 16

Brunel University London

Message Passing CNN Architecture

25 April 2016

16

Deep Structured Models For Group Activity Recognition

slide-17
SLIDE 17

Brunel University London

Message Passing CNN Architecture

25 April 2016

17

Deep Structured Models For Group Activity Recognition

  • First stage: Three separate CNNs, for scene, action and pose

information

  • All are fine-tuned using an AlexNet architecture trained on

ImageNet

  • Quite similar architecture, except pooling is done before

normalization

  • Five convolutional layer, two FC layers with softmax output
slide-18
SLIDE 18

Brunel University London

Message Passing CNN Architecture

25 April 2016

18

Deep Structured Models For Group Activity Recognition

  • Second stage: outputs of first stage taken as input
  • Can contain several steps of message passing
  • In each step, two types of passes occur:
  • from outputs of step k-1 to factor layer
  • from factor layer to k step outputs
slide-19
SLIDE 19

Brunel University London

Message Passing CNN Architecture

25 April 2016

19

Deep Structured Models For Group Activity Recognition

  • Second stage:
  • In the kth message passing step, the first pass computes

dependencies between the states

  • Inputs to this step;
  • The first term is the scene score of Image I for label g
  • The second term is the action score of person Im for label h
  • The third term is the pose score of person Im for label z
slide-20
SLIDE 20

Brunel University London

Message Passing CNN Architecture

25 April 2016

20

Deep Structured Models For Group Activity Recognition

  • Second stage:
  • In the factor layer, interactions of pose, action and scene are

calculated as;

  • αg,h,z is 3-d parameter template for combination of scene g, action h and

pose z.

slide-21
SLIDE 21

Brunel University London

Message Passing CNN Architecture

25 April 2016

21

Deep Structured Models For Group Activity Recognition

  • Second stage:
  • Pose actions for all people in the scene are calculated as;
  • r is all output nodes for all people, t is the factor neuron index for scene g.
  • T latent neurons are used for a scene g.
  • Parameters β & α are shared within factors with the same semantic

meaning.

slide-22
SLIDE 22

Brunel University London

Message Passing CNN Architecture

25 April 2016

22

Deep Structured Models For Group Activity Recognition

  • Second stage:
  • Output of kth step message passing, score for the scene label g is;
  • . is the factor node connected with scene g in scene-action-pose
  • component. is the pose-global factor node.
slide-23
SLIDE 23

Brunel University London

Message Passing CNN Architecture

25 April 2016

23

Deep Structured Models For Group Activity Recognition

  • Second stage:
  • Output of kth step message passing, action score is;
  • Pose score;
  • Model parameters are weights on the edges of NN.
  • Concatenation of weights from factor layer to output (W) (2nd pass)
  • β & α are weights from inputs to factor layer (1st pass)
slide-24
SLIDE 24

Brunel University London

Components in Factor Layers

25 April 2016

24

Deep Structured Models For Group Activity Recognition

  • Unary component
  • Group activity scores for an image I, action and pose scores for

each person Im in frame I

  • Acquired from previous message passing step and added to the
  • utput of next step
slide-25
SLIDE 25

Brunel University London

Components in Factor Layers

25 April 2016

25

Deep Structured Models For Group Activity Recognition

  • Group activity-action-pose layer ϕ
  • Measure the compatibility between individuals and groups
  • Capture dependencies between a person’s fine-grained action and

the scene label

slide-26
SLIDE 26

Brunel University London

Components in Factor Layers

25 April 2016

26

Deep Structured Models For Group Activity Recognition

  • Poses-all factor layer Ψ
  • Global pose information captured for a scene
  • T factor nodes ( T= 10) one scene label
slide-27
SLIDE 27

Brunel University London

Multi Step Message Passing CNN Training

25 April 2016

27

Deep Structured Models For Group Activity Recognition

  • Two message passing steps adopted
  • Multi-loss training
  • Remove the loss layers for actions/poses and learn activity params
  • Fix the softmax layer for scene and learn actions/poses.
  • Trained model is then used in the next message passing step
slide-28
SLIDE 28

Brunel University London

Multi Step Message Passing CNN Training

25 April 2016

28

Deep Structured Models For Group Activity Recognition

  • Learning semantic features for group activity
  • In addition to learning features for classification task;
  • Semantic high level features are learned
  • Different layers’ features are explored; semantic features proven to

useful for scene understanding

slide-29
SLIDE 29

Brunel University London

Implementation Details

25 April 2016

29

Deep Structured Models For Group Activity Recognition

  • Not every frame might have same number of detections
  • NN structure should be fixed!
  • Solve this by dummy-image padding to reach a detection cap
  • Deactive neurons related to these dummy components
  • After 1st message passing, prior to next step
  • Softmax-normalize pose,action and scene scores to get

probabilities

slide-30
SLIDE 30

Brunel University London

Experiments

25 April 2016

30

Deep Structured Models For Group Activity Recognition

  • Caffe framework used
  • Two types of sparsely connected and weight shared input

layers

  • From variable nodes to factor nodes
  • The reverse direction
  • TanH neurons in these layers
slide-31
SLIDE 31

Brunel University London

Experiments

25 April 2016

31

Deep Structured Models For Group Activity Recognition

  • Two datasets for scene classification
  • Collective Activity [Gupta, Arpan, et al.]
  • Nursing Home [Lan,Tian et al.]
  • Features extracted from GM layer after each step of message

passing

  • Fed to RBF-kernel SVM to predict scene labes for each frame
slide-32
SLIDE 32

Brunel University London

Experiments

25 April 2016

32

Deep Structured Models For Group Activity Recognition

  • Collective Activity Dataset
  • 44 clips from handheld cameras
  • Five action labels (crossing,waiting, queuing, walking and talking)
  • Eight pose labels (right, front-right, front, front-left, left, back-left,

back, back-right)

  • Five activity labels (crossing, waiting, queuing, walking, talking)
  • Activity category chosen by taking the majority of actions (ignoring

poses)

slide-33
SLIDE 33

Brunel University London

Results

25 April 2016

33

Deep Structured Models For Group Activity Recognition

  • Collective Activity Dataset
  • Concatanate global features with AC descriptors by HOG features
  • Average AC descriptors for all people (not included in message

passing, helps with the limited amount of data)

AlexNet only classification achieves %48 accuracy.

slide-34
SLIDE 34

Brunel University London

Results

25 April 2016

34

Deep Structured Models For Group Activity Recognition

slide-35
SLIDE 35

Brunel University London

Experiments

25 April 2016

35

Deep Structured Models For Group Activity Recognition

  • Nursing Home Dataset
  • 80 videos captured in a nursing home
  • Actions: standing, sitting, bending, squatting and falling
  • Each frame given activity: fall or non-fall
  • Due to intra-class diversity, action primitive based detectors

[Lan,Tian et al.] used

  • No pose attribute, only one scene-action factor layer for message

passing

  • SVM classifer fed DL features only
slide-36
SLIDE 36

Brunel University London

Results

25 April 2016

36

Deep Structured Models For Group Activity Recognition

  • Nursing Home Dataset
  • Lower increase for additional message passing, why?

(hint:pose)

AlexNet baseline achieves %69 accuracy.

slide-37
SLIDE 37

Brunel University London

Wrap Up

25 April 2016

37

Deep Structured Models For Group Activity Recognition

slide-38
SLIDE 38

Brunel University London

Wrap Up

25 April 2016

38

  • Group activity recognition / scene understanding
  • Atomic actions captured by CNNs
  • Dependency of scene labels and atomic actions

formulated as GM

  • GM formulated as a NN (with several additional perks) that

mimics message passing

  • Label refinement through dependencies achieved
  • Results: comparable to state of the art

Deep Structured Models For Group Activity Recognition

slide-39
SLIDE 39

Brunel University London

Strength and Weaknesses

20.04.16

39

  • NN as a GM is a good idea
  • Formulates the dependency of various information
  • No comparison for Nursing Home Dataset
  • Models/code not available
  • Improved results by the authors

http://arxiv.org/pdf/1511.04196.pdf

Deep Structured Models For Group Activity Recognition

slide-40
SLIDE 40

Brunel University London

Q & A

25 April 2016

40

Thank you for listening

Deep Structured Models For Group Activity Recognition