Deep Learning for Computer Vision
Dagstuhl Seminar 1739
Hilde Kuehne
Computer Vision Group, Prof. Juergen Gall, Institute of Computer Science III
Learning from actions Temporal structures for human action - - PowerPoint PPT Presentation
Learning from actions Temporal structures for human action recognition Hilde Kuehne Computer Vision Group, Prof. Juergen Gall, Institute of Computer Science III Deep Learning for Computer Vision Dagstuhl Seminar 1739 Overview Motivation:
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
Computer Vision Group, Prof. Juergen Gall, Institute of Computer Science III
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 2
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
– e.g. robotics, services e.g. assisted living, entertainment …
– Who does what when?
– e.g. behavior and motion analysis, sport science …
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 3
SFB 588 – Humanoid Robot Armar III HMDB51 [Kuehne2011] Project AutoTIP - GoHuman
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
Action recognition
Doesn‘t work for complex activity
action sequences: – human actions as time series (robotics) – models of complex relations between entities (speech) Problem: Find representations that fit the structure of human actions
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 4
Weizmann
[Blank2005]
BKT
[Kuehne2012]
Pascal
[Everingham2010]
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 5
(Energy compensation - Preparation for following action)
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 6
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 7
idle_position idle_position picking_bottle idle_position picking_bottle idle_position picking_bowl idle_position
[Gehrig2008] [Gehrig2008]
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 8
i i i
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 9
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 10
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 11
Describes stirring, mashing and pouring.
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
Implicit segmentation – Output sequence contains semantic and temporal information in addition to the general label Continuous recognition – Hypothesizes are based on beams of (theoretically) unlimited length Temporal variations are handled by HMMs: – Temporal flexibility without need for more training samples – Only constrained by number of states – Handel large variations
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 12
#!MLF!# "bend.rec" 0 3700000 bend_down 45358.023438 3700000 6200000 bend_up 35816.691406 . "jack.rec" 0 1700000 jack 6247.286621 1700000 2700000 jack -544.383606 2700000 5000000 jack 10465.790039 . "pjump.rec" 0 1400000 pjump 11971.578125 1400000 2800000 pjump 15659.549805 2800000 4100000 pjump -25356.494141 . #!MLF!# "bend.lab" 0 3800000 bend_down 3700000 6200000 bend_up . "jack.lab" 0 2800000 jack 2700000 5000000 jack . "pjump.lab" 0 2300000 pjump 2200000 4100000 pjump Ground truth. Result
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 13
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 14
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
Given:
infer the scripted actions and train the related action models without any boundary information Usually applied for training of ASR systems
(e.g. TIMIT: ~6300 sentences * ~8.2 words per sentence * ~3.9 phones per word ≈ 201474 phone samples , Breakfast: ~ 11000)
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 15
Segmentation from video input + transcripts
Pour milk, Take cup, Stir coffee, ….. Pour coffee Stir coffee Pour milk ….
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 16
Full segmented annotation requires the start and end frames for each action: transcript annotations contain only the actions within a video and the order in which they
Cost of the different annotation techniques: Annotators label both types on 11 videos (making coffee ) with 7 possible actions
about a third of the time compared to a full annotation
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
Given the action transcripts, a large sequence-HMM can be build that is a concatenation of the HMMs for each action class in the order they occur in the transcript for this sequence. Video segmentation: finding the best alignment of video frames to the sequence-HMM (e.g. Viterbi algorithm)
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 17
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 18
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 19
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 20
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
18.09.2017 Insti tute of Computer Science III – Computer Vision Group 21
Hollywood Extended
[Bojanowski2014]
Breakfast [Kuehne2014] CRIM13 [Burgos2012]
Evaluation on four large-scale datasets:
Hollywood Extended : 937 clips extracted from Hollywood movies, 15 action classes with a mean of 2.5 segments per video. Breakfast : is a large scale database 77h of video, 4 million frames, 10 breakfast related activities with 48 action classes. MPII-Cooking : a large database for fine grained cooking activities with 8h of video data /12 persons / 65 action classes. CRIM13: a large-scale mice behavior dataset, 50h of annotated mice activities, 13 different action classes MPII Cooking
[Rohrbach2012]
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 22
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 23
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
Problem:
unsuitable for iterative methods Generic features are based on fixed algorithms do not depend on any training information The resulting segmentation of the training data can be used to train any other model in a fully supervised manner e.g. replace low-level GMM based observation probabilities by ones gained from other models such as CNNs. Softmax layer of a CNN generates a posterior distribution over all output classes Given the sequence input x at frame t the output for the class s is p(s|xt) , to get conditional probabilities, we can transform the softmax-layer output by using the Bayes` rule:
18.09.2017 Insti tute of Computer Science III – Computer Vision Group 24
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 25
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 26
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 27
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 28
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 29
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 30
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 31
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 32
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 33
Deep Learning for Computer Vision
Dagstuhl Seminar 1739
17.09.2017 Insti tute of Computer Science III – Computer Vision Group 34