GPU Accelerated Sequence Learning for Action Recognition
Yemin Shi shiyemin@pku.edu.cn 2018-03
1
Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn - - PowerPoint PPT Presentation
GPU Accelerated Sequence Learning for Action Recognition Yemin Shi shiyemin@pku.edu.cn 2018-03 1 Background Action Recognition Object Recognition (Video Classification) (Image Classification) Action Recognition VS Object Recognition
Yemin Shi shiyemin@pku.edu.cn 2018-03
1
Object Recognition (Image Classification) Action Recognition (Video Classification)
Action Recognition VS Object Recognition
Temporal domain, Long-term dependence, High computational complexity. General methods are not good enough for action recognition. Existing methods are still far from practical use
Datasets Year Actions Videos Annotations Source Localization HMDB51 2011 51 7K 7K YouTube/ Movie No UCF101 2012 101 13K 13K YouTube No Sports 1M 2014 487 1.1M 1.1M YouTube No THUMOS 15 2014 101 24K 21K YouTube Yes ActivityNet 2015 200 20K 23K YouTube Yes Charades 2016/ 2017 157 10K 67K 267 Homes Yes AVA 2017 80 214 197K Movie Yes Kinetics 2017 400 305K 305K YouTube No MIT 2017 339 1M 1M 10 sources No SLAC 2017 200 520K 1.75M YouTube Yes
Modeling temporal domain is one of the most important target of action recognition. Shortcomings of existing methods:
Action have long duration: High complexity LSTM is not good enough.
Therefore, we need:
Some more efficient sequence learning model to improve the ability of modeling temporal information.
Hand-crafted Features
Hand-crafted Features and Deep Features Importance of Each Frame The Ability of Modeling Temporal Domain One-shot Action Recognition Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Open-set Action Recognition Open Deep Network Hierarchical Temporal Memory Enhanced One-shot Distance Learning
Sequence learning for action recognition
Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Sequence learning for action recognition
Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Problems and Solutions
Hand-crafted feature can hardly describe movement process; CNNs are good at describe structure. Integrate hand-crafted feature and CNN to improve performance.
Hand-crafted feature: More statistics, less structure. CNN: Structure is important.
Improve Dense Trajectory with Background Subtraction
Only extract trajectories and optical flow on foreground. Videos Masks Foreground
Where 𝑇𝑔𝑝𝑠𝑓 is the sum of the foreground square
Main Idea
Trajectory Texture Image: Project trajectories onto a canvas. CNN is employed for structural feature learning.
Input video
Dense trajectories
Project into 2D space
Projection in an adaptative duration Trajectory Texture Image
Conv Pooling LRN … … Conv FC
Improve trajectory projection method
DTD with LSTM
Treat each Trajectory Texture Image as one time step input, LSTM is used to model temporal domain. Improve the ability of DTD to model complex action.
Our LSTM Model
𝑦𝑢 is the input at time t. ℎ𝑢 is the hidden state at time t. 𝑗𝑢, 𝑔
𝑢, 𝑑𝑢, 𝑝𝑢 are the input Gate、
forget gate、memory cell and
Learn long-term action description
CNN for DTD feature learning; Sequential DTD for long- term action representation; RNN(LSTM) for temporal domain modeling.
ApplyEyeMakeup
𝐾 𝜄 = − 1 𝑛
𝑗=1 𝑛
𝑘=1 𝑙
1 𝑧 𝑗 = 𝑘 log 𝑓θ𝑘
𝑈𝑦 𝑗
σ𝑚=1
𝑙
𝑓𝜄𝑚
𝑈𝑦 𝑗
+ 𝜇 2
𝑗=1 𝑙
𝑘=0 𝑜
𝜄𝑗𝑘
2
Loss function:
Softmax Loss 𝑋𝑓𝑗ℎ𝑢 R𝑓𝑣𝑚ar𝑗𝑨𝑓r
Three-stream Framework
Experiment results
Sequence learning for action recognition
Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Problems and solutions
Not all postures contribute equally to the successful recognition of an action. Texture and motion are not independent from each
The most important frames for RGB and optical flow may not be corresponding (not the same frame id).
Attention mechanism
𝑓𝑗𝑘 = 𝑤𝑈 tanh 𝑋
1 ′ℎ𝑗 + 𝑋 2 ′𝑘
𝛽𝑗𝑘 = exp 𝑓𝑗𝑘 σ𝑙=1
𝑈
exp 𝑓𝑙𝑘 𝑝
𝑘 ℎ = 𝑗=1 𝑈
𝛽𝑗𝑘ℎ𝑗 𝑔
𝑘𝑗 = 𝑣𝑈 tanh 𝑋 3 ′𝑗 + 𝑋 4 ′ℎ𝑘
𝛾𝑘𝑗 = exp 𝑔
𝑘𝑗
σ𝑙=1
𝑈
exp 𝑔
𝑘𝑙
𝑝𝑗
= 𝑘=1 𝑈
𝛾𝑘𝑗𝑘
Spatial domain Temporal Domain Weight for each input Weighted sum for all inputs
Experiment results
Sequence learning for action recognition
Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Problems and solutions
Most deep neural networks are generated by only feed-forward connections. Existing RNN are still not good enough in practice.
[Siegelbaum’00] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.
Visual Cortical Pathways[Siegelbaum’00]
Blue arrow: feed-forward connection Red arrow: feed-back connection
IT V1 V2 V4 TEO
Problems and solutions
Most deep neural networks are generated by only feed-forward connections. Existing RNN are still not good enough in practice.
[Siegelbaum’00] Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000.
Visual Cortical Pathways[Siegelbaum’00]
Blue arrow: feed-forward connection Red arrow: feed-back connection
IT V1 V2 V4 TEO
Loop Connections Input Projection Output Selection
Experiment results
Comparing with existing RNNs Comparing with other action recognition methods
Sequence learning for action recognition
Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Videos are complicated because of temporal complexity and variation
Distance learning can decrease intra-class distance while increasing inter-class distance. Method: Triplet loss
Not all frames equally contribute to recognition
The harder to predict one frame, the more representative it is. Method: Hierarchical Temporal Memory (HTM)
Hawkins, Jeff (2004). On Intelligence (1st ed.). Times Books. p. 272. ISBN 0805074562.
Matching Network training
Sample a target video and a support set video from seen classes, maximize the probability of the class that the target video belongs to.
HTM training
Make HTM accustomed to seen class videos.
Triplet loss
Sequence learning for action recognition
Deep Trajectory Descriptor Temporal Attentive Network shuttleNet Hierarchical Temporal Memory Enhanced One- shot Distance Learning Open Deep Network
Motivation
Action recognition in the real world is essentially an
Impossible to know all action categories beforehand; Infeasible to prepare sufficient training samples for those emerging categories.
Most of recognition systems are designed for a static closed world
Primary assumption: all categories are known as priori.
Known Train/Test Known Unknown Train Test
Multi-class unknown category detection
The multi-class triplet thresholding method
Consider the inter-class relation for unknown category detection, accept the knowns and reject the unknowns Training a triplet threshold [𝜃𝑗, 𝜈𝑗, 𝜀𝑗] per category Applying the triplet threshold on each sample during the detection process
Define: [𝜃𝑗, 𝜈𝑗, 𝜀𝑗] Accept threshold:𝜃𝑗 = alpha ∗ 𝑁𝑓𝑏𝑜 σ𝑘=1
𝑌
𝑔
𝑗,𝑘
Reject threshold:𝜈𝑗 = beta ∗ 𝜃𝑗 Distance threshold:𝜀𝑗 = sigma ∗ 𝑁𝑓𝑏𝑜(σ𝑘=1
𝑌
(𝑔
𝑗,𝑘 − 𝑡𝑗,𝑘))
where: 𝑔
𝑗,𝑘 is the maximal score of the i-th category
𝑡𝑗,𝑘 is the second maximal score of the i-th
category
Updating deep network
Reconstruct the classification layer
37
wN+1 = wN+1
′
+ wN+1
′′
= σn=1
N
wn N + σm=1
M
wm M
categories: Calculate the mean value of known categories as part of the weights so that the new category obey the same distribution as known categories.
①
Weight matrix Updated Weight matrix New weight column for new category
Updating deep network
Reconstruct the classification layer
38
wN+1 = wN+1
′
+ wN+1
′′
= σn=1
N
wn N + σm=1
M
wm M
categories: Calculate the mean value of known categories as part of the weights so that the new category obey the same distribution as known categories.
①
Weight matrix Updated Weight matrix New weight column for new category
the similar categories :The similar categories should play a more critical role in the initialization.
②
Similar categories
Incremental training
Balanced training strategy:Do guarantee that each of the known categories have the same number of samples as the new category for fine-tuning to reduce jitter of the model. Allometry training strategy :Adopt learning rate decay matrix, which varies between known categories and new categories to force new categories learn much faster than known categories during the fine-tuning.
Allometry training
Allometry training factor: Updating the weights:
𝑋
𝑗,𝑘 = 𝑋 𝑗,𝑘 − 𝛽𝑗𝜇 𝜖 𝜖 𝑋𝑗,𝑘 𝐾 𝑋, 𝑐
𝑐𝑗 = 𝑐𝑗 − 𝛽𝑗𝜇 𝜖 𝜖 𝑐𝑗 𝐾 𝑋, 𝑐
39
αi = ቐ 0.1,i ≤ N 1, i > N
Comparing Our Initialization and Stochastic Initialization The accuracies under different openness Testing accuracies at each iteration The accuracies of known categories and unknown categories at each iteration
① ② ③ ④
Experiment results
ODN VS closed-set action recognition
5.39 samples vs 94.4 samples Comparable performance
High computation cost of optical flow; High computation cost of deep learning models.
3D ConvNet is much faster, but still very heavy.
Optical flow TVL1 FB S2D PCA DIS Flownet2 Brox LK Accuracy of UCF101 split 1 72.98 70.58 65.34 69.6 63.28 57.71 72.35 37.51 Per frame time (340x256) 0.181 0.056 0.026 0.053 0.07 0.123(+-) 0.074 0.0034 fps 5.5 17.9 38.5 18.9 14.3 8.1 13.5 294.1 Speed and accuracy of several optical flow methods
Trend: Abandon optical flow
Year Model Method Input dimension 2012 3D CNN Replace 2D conv with 3D conv 7x33x60x40 2015 C3D Deep 3D ConvNet 16x3x112x112 2016 ResNet3D 3D Resnet50 80x6x80x80 2017 I3D Inflate 2D kernels to 3D kernels and copy weights along the 3rd dimension 64x3x224x224 2017 NL I3D Non-local operation 32x8x224x224 2017 P3D Simulate 3x3x3 kernel with 1x3x3 and 3x1x1 kernels 16x3x160x160 complexity I3D has 107.9B FLOPs Google uses 64 GPUs to train I3D model
I3D experiments based on TensorFlow
The train/test code is available at https://github.com/shiyemin/shuttleNet Input size: 64x3x224x224 Batch Norm needs big enough batch size, which will consume a lot
107.9B FLOPS: Need sufficient computational power to reduce training time.
I3D experiments based on TensorFlow
Note: We do not use the NVLink connections between GPUs. Because of no further multi-gpu optimization, our results should be slower than others.
GPU #GPU Max Batch Size Train FPS Test FPS K40 2 16 149.87 449.1 K80* 2 16 148.65 411.3 K80* 4 32 278.61 808.2 K80* 8 64 562.46 1649.4 K80* 16 112 1046.24 3298.8 P100 2 22 622.33 1692.5 P100 4 44 1212.01 3338
*Here, one K80 is one core of K80 card (each K80 card has two core).
Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//European Conference on Computer Vision. Springer International Publishing, 2016: 20-36. Donahue J, Anne Hendricks L, Guadarrama S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2625-2634. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: Deep networks for video classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 4694-4702. Wu Z, Wang X, Jiang Y G, et al. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification[C]//Proceedings of the 23rd ACM international conference on Multimedia. ACM, 2015: 461-470. Sharma S, Kiros R, Salakhutdinov R. Action recognition using visual attention[J]. arXiv preprint arXiv:1511.04119, 2015. Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local SVM approach[C]//Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on. IEEE, 2004, 3: 32-36. Reddy K K, Shah M. Recognizing 50 human action categories of web videos[J]. Machine Vision and Applications, 2013, 24(5): 971-981.
Siegelbaum S A, Hudspeth A J. Principles of neural science[M]. New York: McGraw-hill, 2000. Wang L M, Qiao Y, Tang X. Motionlets: Mid-level 3d parts for human motion recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013: 2674-2681. Wang H, Kläser A, Schmid C, et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International journal of computer vision, 2013, 103(1): 60-79. Cai Z, Wang L, Peng X, et al. Multi-view super vector for action recognition[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014: 596-603. Peng X, Wang L, Wang X, et al. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice[J]. Computer Vision and Image Understanding, 2016. Liu L, Shao L, Rockett P. Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition[J]. Pattern recognition, 2013, 46(7): 1810-1818. Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep- convolutional descriptors[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 4305-4314.
Wang L, Qiao Y, Tang X. MoFAP: A multi-level representation for action recognition[J]. International Journal of Computer Vision, 2015: 1-18. Varol G, Laptev I, Schmid C. Long-term Temporal Convolutions for Action Recognition[J]. arXiv preprint arXiv:1604.04494, 2016. Wang L, Xiong Y, Wang Z, et al. Towards good practices for very deep two-stream convnets[J]. arXiv preprint arXiv:1507.02159, 2015.