Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn - - PowerPoint PPT Presentation

learning for action recognition
SMART_READER_LITE
LIVE PREVIEW

Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn - - PowerPoint PPT Presentation

GPU Accelerated Sequence Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1 Background Action Recognition Object Recognition (Video Classification) (Image Classification) Action Recognition VS Object Recognition


slide-1
SLIDE 1

GPU Accelerated Sequence Learning for Action Recognition

Yemin Shi shiyemin@pku.edu.cn 2018-03

1

slide-2
SLIDE 2

Background

Object Recognition (Image Classification) Action Recognition (Video Classification)

 Action Recognition VS Object Recognition

 Temporal domain, Long-term dependence, High computational complexity.  General methods are not good enough for action recognition.  Existing methods are still far from practical use

slide-3
SLIDE 3

Research Trends

Datasets Year Actions Videos Annotations Source Localization HMDB51 2011 51 7K 7K YouTube/ Movie No UCF101 2012 101 13K 13K YouTube No Sports 1M 2014 487 1.1M 1.1M YouTube No THUMOS 15 2014 101 24K 21K YouTube Yes ActivityNet 2015 200 20K 23K YouTube Yes Charades 2016/ 2017 157 10K 67K 267 Homes Yes AVA 2017 80 214 197K Movie Yes Kinetics 2017 400 305K 305K YouTube No MIT 2017 339 1M 1M 10 sources No SLAC 2017 200 520K 1.75M YouTube Yes

slide-4
SLIDE 4

Action Recognition

 Modeling temporal domain is one of the most important target of action recognition.  Shortcomings of existing methods:

 Action have long duration: High complexity  LSTM is not good enough.

 Therefore, we need:

 Some more efficient sequence learning model to improve the ability of modeling temporal information.

slide-5
SLIDE 5

Overview

Hand-crafted Features

Hand-crafted Features and Deep Features Importance of Each Frame The Ability of Modeling Temporal Domain One-shot Action Recognition Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Open-set Action Recognition Open Deep Network Hierarchical Temporal Memory Enhanced One-shot Distance Learning

slide-6
SLIDE 6

Overview

 Sequence learning for action recognition

 Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

slide-7
SLIDE 7

Overview

 Sequence learning for action recognition

 Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

slide-8
SLIDE 8

Deep Trajectory Descriptor

 Problems and Solutions

 Hand-crafted feature can hardly describe movement process; CNNs are good at describe structure.  Integrate hand-crafted feature and CNN to improve performance.

Hand-crafted feature: More statistics, less structure. CNN: Structure is important.

slide-9
SLIDE 9

Deep Trajectory Descriptor

 Improve Dense Trajectory with Background Subtraction

 Only extract trajectories and optical flow on foreground. Videos Masks Foreground

Where 𝑇𝑔𝑝𝑠𝑓 is the sum of the foreground square

  • area. (𝑗, 𝑘) index around the square area.
slide-10
SLIDE 10

Deep Trajectory Descriptor

 Main Idea

 Trajectory Texture Image: Project trajectories onto a canvas.  CNN is employed for structural feature learning.

Input video

Dense trajectories

Project into 2D space

Projection in an adaptative duration Trajectory Texture Image

Conv Pooling LRN … … Conv FC

slide-11
SLIDE 11

Deep Trajectory Descriptor

slide-12
SLIDE 12

Deep Trajectory Descriptor

 Improve trajectory projection method

slide-13
SLIDE 13

Deep Trajectory Descriptor

 DTD with LSTM

 Treat each Trajectory Texture Image as one time step input, LSTM is used to model temporal domain.  Improve the ability of DTD to model complex action.

Our LSTM Model

𝑦𝑢 is the input at time t. ℎ𝑢 is the hidden state at time t. 𝑗𝑢, 𝑔

𝑢, 𝑑𝑢, 𝑝𝑢 are the input Gate、

forget gate、memory cell and

  • utput gate at time t.
slide-14
SLIDE 14

 Learn long-term action description

 CNN for DTD feature learning;  Sequential DTD for long- term action representation;  RNN(LSTM) for temporal domain modeling.

Deep Trajectory Descriptor

ApplyEyeMakeup

𝐾 𝜄 = − 1 𝑛 ෍

𝑗=1 𝑛

𝑘=1 𝑙

1 𝑧 𝑗 = 𝑘 log 𝑓θ𝑘

𝑈𝑦 𝑗

σ𝑚=1

𝑙

𝑓𝜄𝑚

𝑈𝑦 𝑗

+ 𝜇 2 ෍

𝑗=1 𝑙

𝑘=0 𝑜

𝜄𝑗𝑘

2

Loss function:

Softmax Loss 𝑋𝑓𝑗𝑕ℎ𝑢 R𝑓𝑕𝑣𝑚ar𝑗𝑨𝑓r

slide-15
SLIDE 15

Deep Trajectory Descriptor

 Three-stream Framework

slide-16
SLIDE 16

Deep Trajectory Descriptor

 Experiment results

slide-17
SLIDE 17

Overview

 Sequence learning for action recognition

 Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

slide-18
SLIDE 18

Temporal Attentive Network

 Problems and solutions

 Not all postures contribute equally to the successful recognition of an action.  Texture and motion are not independent from each

  • ther.

 The most important frames for RGB and optical flow may not be corresponding (not the same frame id).

slide-19
SLIDE 19

Temporal Attentive Network

 Attention mechanism

slide-20
SLIDE 20

Temporal Attentive Network

𝑓𝑗𝑘 = 𝑤𝑈 tanh 𝑋

1 ′ℎ𝑗 + 𝑋 2 ′𝑕𝑘

𝛽𝑗𝑘 = exp 𝑓𝑗𝑘 σ𝑙=1

𝑈

exp 𝑓𝑙𝑘 𝑝

𝑘 ℎ = ෍ 𝑗=1 𝑈

𝛽𝑗𝑘ℎ𝑗 𝑔

𝑘𝑗 = 𝑣𝑈 tanh 𝑋 3 ′𝑕𝑗 + 𝑋 4 ′ℎ𝑘

𝛾𝑘𝑗 = exp 𝑔

𝑘𝑗

σ𝑙=1

𝑈

exp 𝑔

𝑘𝑙

𝑝𝑗

𝑕 = ෍ 𝑘=1 𝑈

𝛾𝑘𝑗𝑕𝑘

Spatial domain Temporal Domain Weight for each input Weighted sum for all inputs

slide-21
SLIDE 21

Temporal Attentive Network

 Experiment results

slide-22
SLIDE 22

Overview

 Sequence learning for action recognition

 Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

slide-23
SLIDE 23

shuttleNet

 Problems and solutions

 Most deep neural networks are generated by only feed-forward connections.  Existing RNN are still not good enough in practice.

[Siegelbaum’00] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.

Visual Cortical Pathways[Siegelbaum’00]

Blue arrow: feed-forward connection Red arrow: feed-back connection

IT V1 V2 V4 TEO

slide-24
SLIDE 24

shuttleNet

 Problems and solutions

 Most deep neural networks are generated by only feed-forward connections.  Existing RNN are still not good enough in practice.

[Siegelbaum’00] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.

Visual Cortical Pathways[Siegelbaum’00]

Blue arrow: feed-forward connection Red arrow: feed-back connection

IT V1 V2 V4 TEO

slide-25
SLIDE 25

shuttleNet

Loop Connections Input Projection Output Selection

slide-26
SLIDE 26

shuttleNet

 Experiment results

Comparing with existing RNNs Comparing with other action recognition methods

slide-27
SLIDE 27

Overview

 Sequence learning for action recognition

 Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

slide-28
SLIDE 28

Motivation

 Videos are complicated because of temporal complexity and variation

 Distance learning can decrease intra-class distance while increasing inter-class distance.  Method: Triplet loss

 Not all frames equally contribute to recognition

 The harder to predict one frame, the more representative it is.  Method: Hierarchical Temporal Memory (HTM)

Hawkins, Jeff (2004). On Intelligence (1st ed.). Times Books. p. 272. ISBN 0805074562.

slide-29
SLIDE 29

Framework

slide-30
SLIDE 30

Seen-class Stage

 Matching Network training

 Sample a target video and a support set video from seen classes, maximize the probability of the class that the target video belongs to.

 HTM training

 Make HTM accustomed to seen class videos.

slide-31
SLIDE 31

Unseen-class Stage

 Triplet loss

slide-32
SLIDE 32

Experiments

slide-33
SLIDE 33

Overview

 Sequence learning for action recognition

 Deep Trajectory Descriptor  Temporal Attentive Network  shuttleNet  Hierarchical Temporal Memory Enhanced One- shot Distance Learning  Open Deep Network

slide-34
SLIDE 34

Open Deep Network

 Motivation

 Action recognition in the real world is essentially an

  • pen-set problem

 Impossible to know all action categories beforehand;  Infeasible to prepare sufficient training samples for those emerging categories.

 Most of recognition systems are designed for a static closed world

 Primary assumption: all categories are known as priori.

Known Train/Test Known Unknown Train Test

slide-35
SLIDE 35

Open Deep Network

 Multi-class unknown category detection

 The multi-class triplet thresholding method

 Consider the inter-class relation for unknown category detection, accept the knowns and reject the unknowns  Training a triplet threshold [𝜃𝑗, 𝜈𝑗, 𝜀𝑗] per category  Applying the triplet threshold on each sample during the detection process

Define: [𝜃𝑗, 𝜈𝑗, 𝜀𝑗] Accept threshold:𝜃𝑗 = alpha ∗ 𝑁𝑓𝑏𝑜 σ𝑘=1

𝑌

𝑔

𝑗,𝑘

Reject threshold:𝜈𝑗 = beta ∗ 𝜃𝑗 Distance threshold:𝜀𝑗 = sigma ∗ 𝑁𝑓𝑏𝑜(σ𝑘=1

𝑌

(𝑔

𝑗,𝑘 − 𝑡𝑗,𝑘))

where: 𝑔

𝑗,𝑘 is the maximal score of the i-th category

𝑡𝑗,𝑘 is the second maximal score of the i-th

category

slide-36
SLIDE 36

 Updating deep network

 Reconstruct the classification layer

37

wN+1 = wN+1

+ wN+1

′′

= σn=1

N

wn N + σm=1

M

wm M

  • Transfer knowledge from the trained

categories: Calculate the mean value of known categories as part of the weights so that the new category obey the same distribution as known categories.

Open Deep Network

Weight matrix Updated Weight matrix New weight column for new category

slide-37
SLIDE 37

 Updating deep network

 Reconstruct the classification layer

38

wN+1 = wN+1

+ wN+1

′′

= σn=1

N

wn N + σm=1

M

wm M

  • Transfer knowledge from the trained

categories: Calculate the mean value of known categories as part of the weights so that the new category obey the same distribution as known categories.

Open Deep Network

Weight matrix Updated Weight matrix New weight column for new category

  • Transfer knowledge from

the similar categories :The similar categories should play a more critical role in the initialization.

Similar categories

slide-38
SLIDE 38

 Incremental training

 Balanced training strategy:Do guarantee that each of the known categories have the same number of samples as the new category for fine-tuning to reduce jitter of the model.  Allometry training strategy :Adopt learning rate decay matrix, which varies between known categories and new categories to force new categories learn much faster than known categories during the fine-tuning.

 Allometry training

 Allometry training factor:  Updating the weights:

𝑋

𝑗,𝑘 = 𝑋 𝑗,𝑘 − 𝛽𝑗𝜇 𝜖 𝜖 𝑋𝑗,𝑘 𝐾 𝑋, 𝑐

𝑐𝑗 = 𝑐𝑗 − 𝛽𝑗𝜇 𝜖 𝜖 𝑐𝑗 𝐾 𝑋, 𝑐

39

αi = ቐ 0.1,i ≤ N 1, i > N

Open Deep Network

slide-39
SLIDE 39

Open Deep Network

Comparing Our Initialization and Stochastic Initialization The accuracies under different openness Testing accuracies at each iteration The accuracies of known categories and unknown categories at each iteration

① ② ③ ④

slide-40
SLIDE 40

Open Deep Network

 Experiment results

 ODN VS closed-set action recognition

 5.39 samples vs 94.4 samples  Comparable performance

slide-41
SLIDE 41

Current Situation of Action Recognition

 High computation cost of optical flow;  High computation cost of deep learning models.

 3D ConvNet is much faster, but still very heavy.

slide-42
SLIDE 42

Real-time capability of optical flows

Optical flow TVL1 FB S2D PCA DIS Flownet2 Brox LK Accuracy of UCF101 split 1 72.98 70.58 65.34 69.6 63.28 57.71 72.35 37.51 Per frame time (340x256) 0.181 0.056 0.026 0.053 0.07 0.123(+-) 0.074 0.0034 fps 5.5 17.9 38.5 18.9 14.3 8.1 13.5 294.1 Speed and accuracy of several optical flow methods

Trend: Abandon optical flow

slide-43
SLIDE 43

3D CNN

Year Model Method Input dimension 2012 3D CNN Replace 2D conv with 3D conv 7x33x60x40 2015 C3D Deep 3D ConvNet 16x3x112x112 2016 ResNet3D 3D Resnet50 80x6x80x80 2017 I3D Inflate 2D kernels to 3D kernels and copy weights along the 3rd dimension 64x3x224x224 2017 NL I3D Non-local operation 32x8x224x224 2017 P3D Simulate 3x3x3 kernel with 1x3x3 and 3x1x1 kernels 16x3x160x160 complexity I3D has 107.9B FLOPs Google uses 64 GPUs to train I3D model

slide-44
SLIDE 44

GPU selection?

 I3D experiments based on TensorFlow

 The train/test code is available at https://github.com/shiyemin/shuttleNet  Input size: 64x3x224x224  Batch Norm needs big enough batch size, which will consume a lot

  • f GPU memory.

 107.9B FLOPS: Need sufficient computational power to reduce training time.

slide-45
SLIDE 45

GPU selection?

 I3D experiments based on TensorFlow

 Note: We do not use the NVLink connections between GPUs.  Because of no further multi-gpu optimization, our results should be slower than others.

GPU #GPU Max Batch Size Train FPS Test FPS K40 2 16 149.87 449.1 K80* 2 16 148.65 411.3 K80* 4 32 278.61 808.2 K80* 8 64 562.46 1649.4 K80* 16 112 1046.24 3298.8 P100 2 22 622.33 1692.5 P100 4 44 1212.01 3338

*Here, one K80 is one core of K80 card (each K80 card has two core).

slide-46
SLIDE 46

Reference

 Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 20-36.  Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2625-2634.  Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 4694-4702.  Wu Z, Wang X, Jiang Y G, et al. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification[C]//Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015: 461-470.  Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention[J]. arXiv preprint arXiv:1511.04119, 2015.  Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local SVM approach[C]//Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on. IEEE, 2004, 3: 32-36.  Reddy K K, Shah M. Recognizing 50 human action categories of web videos[J]. Machine Vision and Applications, 2013, 24(5): 971-981.

slide-47
SLIDE 47

Reference

 Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.  Wang L M, Qiao Y, Tang X. Motionlets: Mid-level 3d parts for human motion recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013: 2674-2681.  Wang H, Kläser A, Schmid C, et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International journal of computer vision, 2013, 103(1): 60-79.  Cai Z, Wang L, Peng X, et al. Multi-view super vector for action recognition[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 596-603.  Peng X, Wang L, Wang X, et al. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice[J]. Computer Vision and Image Understanding, 2016.  Liu L, Shao L, Rockett P. Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition[J]. Pattern recognition, 2013, 46(7): 1810-1818.  Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep- convolutional descriptors[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 4305-4314.

slide-48
SLIDE 48

Reference

 Wang L, Qiao Y, Tang X. MoFAP: A multi-level representation for action recognition[J]. International Journal of Computer Vision, 2015: 1-18.  Varol G, Laptev I, Schmid C. Long-term Temporal Convolutions for Action Recognition[J]. arXiv preprint arXiv:1604.04494, 2016.  Wang L, Xiong Y, Wang Z, et al. Towards good practices for very deep two-stream convnets[J]. arXiv preprint arXiv:1507.02159, 2015.

slide-49
SLIDE 49

Thanks!