Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features
Saimunur Rahman, John See and Ho Chiung Ching Center of Visual Computing Multimedia University, Cyberjaya
Action Recognition in Low Quality Videos by Jointly Using Shape, - - PowerPoint PPT Presentation
Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features Saimunur Rahman, John See and Ho Chiung Ching Center of Visual Computing Multimedia University, Cyberjaya Motivation Local space-time features
Saimunur Rahman, John See and Ho Chiung Ching Center of Visual Computing Multimedia University, Cyberjaya
IEEE ICSIPA '15 2
− Spatial downsampling − Temporal downsampling
IEEE ICSIPA '15 3
Shao’09]
IEEE ICSIPA '15 4
IEEE ICSIPA '15 5
IEEE ICSIPA '15 6
IEEE ICSIPA '15 7
IEEE ICSIPA '15 8
IEEE ICSIPA '15 11
IEEE ICSIPA '15 12
LBP − TOP PXYPXTPYTRXRYRT 2 ∙ 3P
IEEE ICSIPA '15 13 XY Plane XT Plane YT Plane XY Plane XT Plane YT Plane
+
Final histogram
IEEE ICSIPA '15 15
IEEE ICSIPA '15 16
Input Video Harris3D HOG/HOF Codebook SVM LBP-TOP Feature Encoding Classification
Feature Detection+Description Bag-of-words +
IEEE ICSIPA '15 17 t t t
Interest Points Textures Shape - Motion Dynamic Textures Feature Detection Spatio-temporal Description Feature Vector Representation Input Video
IEEE ICSIPA '15 18
Bag of space-time features + SVM with 𝜓2 kernel [Vedaldi’08]
Training feature vectors are clustered with k-means Each feature vector is assigned to its closest cluster center (visual word) An entire video sequence is represented as occurrence histogram of visual words Classification with multi-class non-linear SVM and 𝜓2 kernel
IEEE ICSIPA '15 19
IEEE ICSIPA '15 20
SD Factor Description 𝑇𝐸1 Original Res. 𝑇𝐸2
1 2 Res. of Original
𝑇𝐸3
1 3 Res. of Original
𝑇𝐸4
1 4 Res. of Original
TD Factor Description T𝐸1 Original F.R. T𝐸2
1 2 F.R. of Original
T𝐸3
1 3 F.R. of Original
T𝐸4
1 4 F.R. of Original
Fig: Temporal Downsampling; (a) Original video (b) T𝐸2 (c) T𝐸3 Fig: Spatially downsampled videos. (a) 𝑇𝐸1 (b) 𝑇𝐸2 (c) 𝑇𝐸3 (d) 𝑇𝐸4 .
IEEE ICSIPA '15 21
Original Video
SD2 SD3 SD4 TD2 TD3 TD4
IEEE ICSIPA '15 22
IEEE ICSIPA '15 23
− Combination I : (HOG + HOF) - linear kernel − Combination II : (HOG + HOF) - χ2 kernel − Combination III : (HOG + HOF + LBP-TOP) - linear kernel − Combination IV : (HOG + HOF) + LBP-TOP - χ2 kernel − Combination V : (HOG + HOF + LBP-TOP) - χ2 kernel
IEEE ICSIPA '15 24
IEEE ICSIPA '15 25
IEEE ICSIPA '15 28
KTH dataset HOGHOF vs. HOG+HOF
IEEE ICSIPA '15 29
IEEE ICSIPA '15 30
Spatial downsampling (k=2000) Temporal downsampling (k=2000)
IEEE ICSIPA '15 31
IEEE ICSIPA '15 32
IEEE ICSIPA '15 33
IEEE ICSIPA '15 35
result within each mode
are able to strengthen results
HOF features
combinations but performs better after combining with LBP-TOP
IEEE ICSIPA '15 36
Spatial downsampling SD2, SD3 (k=2000) & SD4 (k=1500) Temporal downsampling SD2 (k=2000), SD3 (k=400)
IEEE ICSIPA '15 37
IEEE ICSIPA '15 38
Recognition accuracy with and without χ2-kernel, on the original KTH videos.
recognition in low quality videos
benefitted from textural information with shape and motion.
cases.
10% when the video resolutions and frame rates deteriorate to a fourth of their original values.
IEEE ICSIPA '15 39
IEEE ICSIPA '15 40
IEEE ICSIPA '15 41