Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn - PowerPoint PPT Presentation

GPU Accelerated Sequence Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1

Background Action Recognition Object Recognition (Video Classification) (Image Classification)  Action Recognition VS Object Recognition  Temporal domain, Long-term dependence, High computational complexity.  General methods are not good enough for action recognition.  Existing methods are still far from practical use

Research Trends Datasets Year Actions Videos Annotations Source Localization HMDB51 2011 51 7K 7K YouTube/ No Movie UCF101 2012 101 13K 13K YouTube No Sports 1M 2014 487 1.1M 1.1M YouTube No THUMOS 15 2014 101 24K 21K YouTube Yes ActivityNet 2015 200 20K 23K YouTube Yes Charades 2016/ 157 10K 67K 267 Yes 2017 Homes AVA 2017 80 214 197K Movie Yes Kinetics 2017 400 305K 305K YouTube No MIT 2017 339 1M 1M 10 sources No SLAC 2017 200 520K 1.75M YouTube Yes

Action Recognition  Modeling temporal domain is one of the most important target of action recognition.  Shortcomings of existing methods:  Action have long duration: High complexity  LSTM is not good enough.  Therefore, we need ：  Some more efficient sequence learning model to improve the ability of modeling temporal information.

Overview Hand-crafted Temporal Attentive Network Features Hand-crafted Features Importance of Each Frame and Deep Features The Ability of Modeling Temporal Domain Deep Trajectory Descriptor One-shot Action Recognition shuttleNet Open-set Action Recognition Open Deep Network Hierarchical Temporal Memory Enhanced One-shot Distance Learning

Overview  Sequence learning for action recognition  Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

Deep Trajectory Descriptor  Problems and Solutions  Hand-crafted feature can hardly describe movement process; CNNs are good at describe structure.  Integrate hand-crafted feature and CNN to improve performance. Hand-crafted feature: CNN: More statistics, less structure. Structure is important.

Deep Trajectory Descriptor  Improve Dense Trajectory with Background Subtraction  Only extract trajectories and optical flow on foreground. Where 𝑇 𝑔𝑝𝑠𝑓 is the sum of the foreground square area. (𝑗, 𝑘) index around the square area. Videos Masks Foreground

Deep Trajectory Descriptor  Main Idea  Trajectory Texture Image: Project trajectories onto a canvas. Projection in an Dense trajectories adaptative duration Trajectory Project into 2D space Input video Texture Image  CNN is employed for structural feature learning. … … Conv Pooling LRN Conv FC

Deep Trajectory Descriptor

Deep Trajectory Descriptor  Improve trajectory projection method

Deep Trajectory Descriptor  DTD with LSTM  Treat each Trajectory Texture Image as one time step input, LSTM is used to model temporal domain.  Improve the ability of DTD to model complex action. 𝑦 𝑢 is the input at time t. ℎ 𝑢 is the hidden state at time t. 𝑢 , 𝑑 𝑢 , 𝑝 𝑢 are the input Gate 、 𝑗 𝑢 , 𝑔 forget gate 、 memory cell and output gate at time t. Our LSTM Model

Deep Trajectory Descriptor  Learn long-term action description ApplyEyeMakeup  CNN for DTD feature learning;  Sequential DTD for long- term action representation;  RNN(LSTM) for temporal domain modeling. Softmax Loss Loss function ： 𝑛 𝑙 𝑈 𝑦 𝑗 𝑓 θ 𝑘 𝐾 𝜄 = − 1 1 𝑧 𝑗 = 𝑘 log 𝑛 ෍ ෍ 𝑈 𝑦 𝑗 𝑙 𝑓 𝜄 𝑚 σ 𝑚=1 𝑗=1 𝑘=1 𝑙 𝑜 + 𝜇 2 2 ෍ ෍ 𝜄 𝑗𝑘 𝑋𝑓𝑗𝑕 ℎ 𝑢 R 𝑓𝑕𝑣𝑚 ar 𝑗𝑨𝑓 r 𝑗=1 𝑘=0

Deep Trajectory Descriptor  Three-stream Framework

Deep Trajectory Descriptor  Experiment results

Temporal Attentive Network  Problems and solutions  Not all postures contribute equally to the successful recognition of an action.  Texture and motion are not independent from each other.  The most important frames for RGB and optical flow may not be corresponding (not the same frame id).

Temporal Attentive Network  Attention mechanism

Temporal Attentive Network Spatial domain Temporal Domain 𝑓 𝑗𝑘 = 𝑤 𝑈 tanh 𝑋 𝑘𝑗 = 𝑣 𝑈 tanh 𝑋 ′ ℎ 𝑗 + 𝑋 ′ 𝑕 𝑘 ′ 𝑕 𝑗 + 𝑋 ′ ℎ 𝑘 𝑔 1 2 3 4 exp 𝑓 𝑗𝑘 exp 𝑔 𝑘𝑗 𝛽 𝑗𝑘 = 𝛾 𝑘𝑗 = 𝑈 𝑈 σ 𝑙=1 σ 𝑙=1 exp 𝑓 𝑙𝑘 exp 𝑔 𝑘𝑙 𝑈 𝑈 Weight for each input 𝑕 = ෍ ℎ = ෍ 𝑝 𝛽 𝑗𝑘 ℎ 𝑗 𝑝 𝑗 𝛾 𝑘𝑗 𝑕 𝑘 𝑘 𝑗=1 𝑘=1 Weighted sum for all inputs

Temporal Attentive Network  Experiment results

shuttleNet  Problems and solutions  Most deep neural networks are generated by only feed-forward connections.  Existing RNN are still not good enough in practice. V2 V4 Blue arrow: feed-forward connection Red arrow: feed-back connection V1 TEO IT Visual Cortical Pathways [Siegelbaum’00 ] [Siegelbaum’00 ] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.

shuttleNet Input Projection Output Loop Selection Connections

shuttleNet  Experiment results Comparing with existing RNNs Comparing with other action recognition methods

Motivation  Videos are complicated because of temporal complexity and variation  Distance learning can decrease intra-class distance while increasing inter-class distance.  Method: Triplet loss  Not all frames equally contribute to recognition  The harder to predict one frame, the more representative it is.  Method: Hierarchical Temporal Memory (HTM) Hawkins, Jeff (2004). On Intelligence (1st ed.). Times Books. p. 272. ISBN 0805074562.

Framework

Seen-class Stage  Matching Network training  Sample a target video and a support set video from seen classes, maximize the probability of the class that the target video belongs to.  HTM training  Make HTM accustomed to seen class videos.

Unseen-class Stage  Triplet loss

Experiments

Open Deep Network  Motivation  Action recognition in the real world is essentially an open-set problem  Impossible to know all action categories beforehand;  Infeasible to prepare sufficient training samples for those emerging categories.  Most of recognition systems are designed for a static closed world  Primary assumption: all categories are known as priori. Train Test Train/Test Known Known Unknown

Open Deep Network  Multi-class unknown category detection  The multi-class triplet thresholding method  Consider the inter-class relation for unknown category detection, accept the knowns and reject the unknowns  Training a triplet threshold [ 𝜃 𝑗 , 𝜈 𝑗 , 𝜀 𝑗 ] per category  Applying the triplet threshold on each sample during the detection process Define: [ 𝜃 𝑗 , 𝜈 𝑗 , 𝜀 𝑗 ] Accept threshold ： 𝜃 𝑗 = alpha ∗ 𝑁𝑓𝑏𝑜 σ 𝑘=1 𝑌 𝑔 𝑗,𝑘 Reject threshold ： 𝜈 𝑗 = beta ∗ 𝜃 𝑗 Distance threshold ： 𝜀 𝑗 = sigma ∗ 𝑁𝑓𝑏𝑜(σ 𝑘=1 𝑌 (𝑔 𝑗,𝑘 − 𝑡 𝑗,𝑘 )) where: 𝑗,𝑘 is the maximal score of the i-th category 𝑔 𝑡 𝑗,𝑘 is the second maximal score of the i-th category

Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn - PowerPoint PPT Presentation

GPU Accelerated Sequence Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1 Background Action Recognition Object Recognition (Video Classification) (Image Classification) Action Recognition VS Object Recognition

Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid

Action recognition in videos II Action recognition in videos II Cordelia Schmid INRIA Grenoble

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Action recognition in videos Cordelia Schmid Action recognition - goal Short actions, i.e.

Action recognition in videos Cordelia Schmid Action recognition - goal Short actions, i.e.

Action recognition Cordelia Schmid INRIA Grenoble Action recognition examples Short

Action recognition Cordelia Schmid INRIA Grenoble Action recognition examples Short

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Keypoint-Based Action Keypoint-Based Action Recognition Recognition Presenter: Jianchao Yang

EMPLOYEE RECOGNITION OBJECTIVES Types of recognition Creating a culture of recognition

License Plate Recognition License Plate Recognition License Plate Recognition License Plate

Instance-level Recognition Pingmei Xu Object Recognition Friends SE01EP02 Recognition: Find the

Face detection and recognition Detection Recognition Sally Face detection &

Municipal Water District of Orange County May 1, 2019 Action 1 Action 1 Action 2 Action 2

Video-based Action Recognition Ying Wu Electrical Engineering and Computer Science Northwestern

Inverse Modeling with the aid of Surrogate Models Dongxiao Zhang, Qinzhuo Liao, Haibin Chang

NetPower Overview 2014 www.netpowercorp.com About NetPower Company founded in 2000 by

Session K1 Actuaries in China - Actuarial Techniques as an Important Cornerstone of Insurance

How ICT is used in On- -line Education School of China line Education School of China How ICT is

University of Oslo Sino-Norwegian Interdisciplinary Centre on Environmental Research Rolf D.

WASTE MANAGEMENT QINGDAO - 25 JULY 30 JULY 2014 Group 6: Phoebe Chung (CUHK) Kyle Chung

Health for all Children Mark 4 - Introduction Manchester October 19th 2001 Child Health

development: a case study Phil Woodward, Ros Walley, Claire Birch, Jem Gale 0 Outline

Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn - PowerPoint PPT Presentation

GPU Accelerated Sequence Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1 Background Action Recognition Object Recognition (Video Classification) (Image Classification) Action Recognition VS Object Recognition

Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid

Action recognition in videos II Action recognition in videos II Cordelia Schmid INRIA Grenoble

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Action recognition in videos Cordelia Schmid Action recognition - goal Short actions, i.e.

Action recognition in videos Cordelia Schmid Action recognition - goal Short actions, i.e.

Action recognition Cordelia Schmid INRIA Grenoble Action recognition examples Short

Action recognition Cordelia Schmid INRIA Grenoble Action recognition examples Short

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Keypoint-Based Action Keypoint-Based Action Recognition Recognition Presenter: Jianchao Yang

EMPLOYEE RECOGNITION OBJECTIVES Types of recognition Creating a culture of recognition

License Plate Recognition License Plate Recognition License Plate Recognition License Plate

Instance-level Recognition Pingmei Xu Object Recognition Friends SE01EP02 Recognition: Find the

Face detection and recognition Detection Recognition Sally Face detection &amp;

Municipal Water District of Orange County May 1, 2019 Action 1 Action 1 Action 2 Action 2

Video-based Action Recognition Ying Wu Electrical Engineering and Computer Science Northwestern

Inverse Modeling with the aid of Surrogate Models Dongxiao Zhang, Qinzhuo Liao, Haibin Chang

NetPower Overview 2014 www.netpowercorp.com About NetPower Company founded in 2000 by

Session K1 Actuaries in China - Actuarial Techniques as an Important Cornerstone of Insurance

How ICT is used in On- -line Education School of China line Education School of China How ICT is

University of Oslo Sino-Norwegian Interdisciplinary Centre on Environmental Research Rolf D.

WASTE MANAGEMENT QINGDAO - 25 JULY 30 JULY 2014 Group 6: Phoebe Chung (CUHK) Kyle Chung

Health for all Children Mark 4 - Introduction Manchester October 19th 2001 Child Health

development: a case study Phil Woodward, Ros Walley, Claire Birch, Jem Gale 0 Outline

Face detection and recognition Detection Recognition Sally Face detection &