Learning from actions Temporal structures for human action - - PowerPoint PPT Presentation

learning from actions
SMART_READER_LITE
LIVE PREVIEW

Learning from actions Temporal structures for human action - - PowerPoint PPT Presentation

Learning from actions Temporal structures for human action recognition Hilde Kuehne Computer Vision Group, Prof. Juergen Gall, Institute of Computer Science III Deep Learning for Computer Vision Dagstuhl Seminar 1739 Overview Motivation:


slide-1
SLIDE 1

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Hilde Kuehne

Computer Vision Group, Prof. Juergen Gall, Institute of Computer Science III

Learning from actions

Temporal structures for human action recognition

slide-2
SLIDE 2

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Overview

  • Motivation:
  • Sequence models for activity recognition
  • Weak action learning
  • Weak learning of sequential data
  • Weak learning with CNNs/RNNs
  • Outlook:
  • Current projects

– Learning of unordered action sets – Mining Youtube

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 2

slide-3
SLIDE 3

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Why activity recognition?

Human machine interaction

– e.g. robotics, services e.g. assisted living, entertainment …

Video transcription, movie labeling and indexing Surveillance

– Who does what when?

Scientific studies

– e.g. behavior and motion analysis, sport science …

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 3

SFB 588 – Humanoid Robot Armar III HMDB51 [Kuehne2011] Project AutoTIP - GoHuman

slide-4
SLIDE 4

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Activity recognition - Problem Statement

Action recognition

  • … means (usually) one label per image / per clip

Doesn‘t work for complex activity

  • One image may not be enough for reliable recognition
  • One label per video can be too coarse
  • Look for a representation that captures the structure of complex

action sequences: – human actions as time series (robotics) – models of complex relations between entities (speech) Problem: Find representations that fit the structure of human actions

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 4

Weizmann

[Blank2005]

BKT

[Kuehne2012]

Pascal

[Everingham2010]

slide-5
SLIDE 5

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Action primitives

Action (primitives) / (units) :

  • A motion that is performed continuously and without interruption.
  • The smallest entity, which order can be changed during the execution.
  • Complex tasks, e.g. in the household domain, consist of concatenated

action primitives

  • An action primitive usually is made up of a set of motion phases:

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 5

Start position End position

start → preparation → main action → finalize → adjust

(Energy compensation - Preparation for following action)

slide-6
SLIDE 6

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Action Grammar

  • All tasks, as long as they have a meaningful aim, are executed in a certain
  • rder.
  • The order in which the tasks are executed is not random
  • It is possible to formulate a grammar, which has to be followed.
  •  The action grammar defines the action sequences, which are a

concatenation of action primitives that result in a meaningful task.

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 6

slide-7
SLIDE 7

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Time based modeling

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 7

states s1: move hand towards the bowl s2: grasp the bowl s3: take bowl to target position s1 s2 s3 transition state

  • 1. Action units: Linear n-state model

e.g. action unit: „picking a bowl“

idle_position idle_position picking_bottle idle_position picking_bottle idle_position picking_bowl idle_position

  • 2. Activitiy: Context-free grammar

[Gehrig2008] [Gehrig2008]

slide-8
SLIDE 8

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Modelling of action units

  • The task of recognizing an action unit is defined by the best match of the input

sequence x

  • with xi representing the feature vector at frame I, to a set of action units u
  • Corresponding to maximizing the probability of an action unit ui given the input

sequence x

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 8

T

x x x x ... , , x

3 2 1

N

u u u u .. , , u

3 2 1

) x ( ) ( ) | x ( ) x | ( max arg P u P u P u P

i i i

slide-9
SLIDE 9

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Modelling of action units

The joint probability of the model Mui moving through the state sequence Sx can be calculated as the product of transition probabilities and

  • bservation probabilities given the input x:

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 9

slide-10
SLIDE 10

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Modelling of action sequences

Action sequences are realized as a concatenation of action units :

  • Computation of probabilities with a combination of Viterbi and pruning
  • Can include grammar specification

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 10

slide-11
SLIDE 11

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Practical realization

Recognition of action units:

  • n the level of action units = a n-state left-to-right

HMM

  • state = two equally likely transitions, one to the

current state, and one to the next state

  • number of states = adaptive to mean unit length
  • initialization = equal distribution of samples

Recognition of sequences:

  • action sequences = defined by a context free

grammar

  • build by automatic parsing of labels or definition by

hand

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 11

Describes stirring, mashing and pouring.

slide-12
SLIDE 12

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Properties

Implicit segmentation – Output sequence contains semantic and temporal information in addition to the general label Continuous recognition – Hypothesizes are based on beams of (theoretically) unlimited length Temporal variations are handled by HMMs: – Temporal flexibility without need for more training samples – Only constrained by number of states – Handel large variations

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 12

#!MLF!# "bend.rec" 0 3700000 bend_down 45358.023438 3700000 6200000 bend_up 35816.691406 . "jack.rec" 0 1700000 jack 6247.286621 1700000 2700000 jack -544.383606 2700000 5000000 jack 10465.790039 . "pjump.rec" 0 1400000 pjump 11971.578125 1400000 2800000 pjump 15659.549805 2800000 4100000 pjump -25356.494141 . #!MLF!# "bend.lab" 0 3800000 bend_down 3700000 6200000 bend_up . "jack.lab" 0 2800000 jack 2700000 5000000 jack . "pjump.lab" 0 2300000 pjump 2200000 4100000 pjump Ground truth. Result

slide-13
SLIDE 13

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Example

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 13

slide-14
SLIDE 14

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Example

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 14

slide-15
SLIDE 15

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Weak learning of sequential data

Idea:

Given:

  • sequences of input data
  • transcripts, i.e. a list of the order the actions
  • ccur in the videos

 infer the scripted actions and train the related action models without any boundary information Usually applied for training of ASR systems

  • lots of training data:

(e.g. TIMIT: ~6300 sentences * ~8.2 words per sentence * ~3.9 phones per word ≈ 201474 phone samples , Breakfast: ~ 11000)

  • well defined vocabulary
  • low signal variance

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 15

Segmentation from video input + transcripts

Pour milk, Take cup, Stir coffee, ….. Pour coffee Stir coffee Pour milk ….

+

slide-16
SLIDE 16

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Segment Annotation vs. Transcript Annotation

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 16

Full segmented annotation requires the start and end frames for each action: transcript annotations contain only the actions within a video and the order in which they

  • ccur:

Cost of the different annotation techniques: Annotators label both types on 11 videos (making coffee ) with 7 possible actions

  • Full segmented annotation: real-time factor 3.85 ( = 3.85 x video duration)
  • Transcript annotations : real-time factor is 1.36

 about a third of the time compared to a full annotation

slide-17
SLIDE 17

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Video Segmentation given the Action Transcripts

Given the action transcripts, a large sequence-HMM can be build that is a concatenation of the HMMs for each action class in the order they occur in the transcript for this sequence. Video segmentation: finding the best alignment of video frames to the sequence-HMM (e.g. Viterbi algorithm)

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 17

slide-18
SLIDE 18

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

System overview

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 18

slide-19
SLIDE 19

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Example

Example for segmentation during the training iterations:

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 19

slide-20
SLIDE 20

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Alignment vs. Segmentation

Alignment: Video + Transcript given Result: Segment boundaries

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 20

Segmentation: Video given Result: Action classes +Segment boundaries

slide-21
SLIDE 21

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Evaluation

18.09.2017 Insti tute of Computer Science III – Computer Vision Group 21

Hollywood Extended

[Bojanowski2014]

Breakfast [Kuehne2014] CRIM13 [Burgos2012]

Evaluation on four large-scale datasets:

Hollywood Extended : 937 clips extracted from Hollywood movies, 15 action classes with a mean of 2.5 segments per video. Breakfast : is a large scale database 77h of video, 4 million frames, 10 breakfast related activities with 48 action classes. MPII-Cooking : a large database for fine grained cooking activities with 8h of video data /12 persons / 65 action classes. CRIM13: a large-scale mice behavior dataset, 50h of annotated mice activities, 13 different action classes MPII Cooking

[Rohrbach2012]

slide-22
SLIDE 22

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Example – Segmentation

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 22

slide-23
SLIDE 23

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Example – Segmentation

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 23

slide-24
SLIDE 24

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Extension by CNN

Problem:

  • CNNs need to be fine-tuned for each dataset / split separately

 unsuitable for iterative methods  Generic features are based on fixed algorithms do not depend on any training information The resulting segmentation of the training data can be used to train any other model in a fully supervised manner e.g. replace low-level GMM based observation probabilities by ones gained from other models such as CNNs.  Softmax layer of a CNN generates a posterior distribution over all output classes Given the sequence input x at frame t the output for the class s is p(s|xt) , to get conditional probabilities, we can transform the softmax-layer output by using the Bayes` rule:

  • p(s) class prior probability (the relative frequency of the state s )
  • p(xt) can be omitted (does not affect the maximizing arguments)

18.09.2017 Insti tute of Computer Science III – Computer Vision Group 24

slide-25
SLIDE 25

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Implementation

Two-stream architecture (VGG, pretrained on UCF101): Every state = one class

  • Fine-tune with the respective state classes:

e.g. 238 for Hollywood extended and 1392 for Breakfast

  • Train for 30,000 iterations (batch size 50)

 Evaluated for Hollywood extended and Breakfast

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 25

slide-26
SLIDE 26

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Results Alignment

  • Breakfast better, Hollywood worse
  • Mean over classes always improves with CNNs :

 better in classifying underrepresented classes  Less prone to background

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 26

slide-27
SLIDE 27

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Weak learning with RNNs

Same idea: Replace GMM observation probabilities by RNNs:

  • GMMs are just distributions without temporal

information

  • RNNs are good in capturing short temporal

information (≈ 10 – 20 frames HMM states) Train RNNs with 20 chunks of 20 frames:  No need for more information

  • HMMs can handle long-term inference

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 27

slide-28
SLIDE 28

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Weak learning with RNNs

Add-on: How to find the right number of states?  GMM based modeling not necessary best for RNN Idea: Reestimate number of states after each iteration 1) Start with uniform distribution 2) Train RNN states 3) Align transcripts to training videos 4) Recompute number of states: 5) Split new segments according to number of states Repeat step 2) – 5) until less than 2% of overall frames are assigned to a new class  Reestimation helps to approximate length prior!!!

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 28

slide-29
SLIDE 29

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Overview training

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 29

slide-30
SLIDE 30

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Evaluation

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 30

slide-31
SLIDE 31

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Evaluation

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 31

slide-32
SLIDE 32

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Outlook

Temporal Action Labeling using Action Sets Idea: Forget the temporal ordering, just train classes and learn a grammar

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 32

Weak Learning Action Sets

  • Each video is associated with a set of

tags/classes

  • All classes are trained based on tags
  • Generate grammar based on tag

combinations

slide-33
SLIDE 33

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Conclusion

  • Weakly supervised learning of human actions from video transcriptions with

structured models

  • Given sequences of input data transcripts, infer the scripted actions and train

the related action models without any boundary information

  • Address the field of action recognition by using less annotation but more data

for a better understanding of action concepts

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 33

slide-34
SLIDE 34

Deep Learning for Computer Vision

Dagstuhl Seminar 1739

Thank you for your attention.

17.09.2017 Insti tute of Computer Science III – Computer Vision Group 34