DYNAMIC FACIAL ANALYSIS: FROM BAYESIAN FILTERING TO RNN Jinwei Gu, 2017/4/18 with Xiaodong Yang, Shalini De Mello, and Jan Kautz
FACIAL ANALYSIS IN VIDEOS Exploit temporal coherence to track facial features in videos Head/Face Tracking Performance 3D Capture HeadPoseFromDepth, 2015 DeepHeadPose, 2015 HyperFace, 2016 2
CLASSICAL APPROACH: BAYESIAN FILTERING It is challenging to design Bayesian filters specific for each task! Particle Filters Tree-based DPM Spatial-Temporal RNN Head Pose Tracking Face Landmark Tracking Face Landmark [2010] [ICCV2015] [ECCV2016] 3
FROM BAYESIAN FILTERING TO RNN Use RNN to avoid tracker-engineering Output π³ π’β1 π³ π’ π³ π’β1 π³ π’ (Target) Hidden π’ π’β1 π’ π’ π’ π’β1 π’ π’ State Input π² π’β1 π² π’ π² π’β1 π² π’ (Measurement) Bayesian Filter RNN (unfolded) 4
FROM BAYESIAN FILTERING TO RNN Use RNN to avoid tracker-engineering 5
AN EXAMPLE: KALMAN FILTERS VS. RNN state transition process noise (process model) noisy estimated input state β π’ = π 1 (πβ π’β1 + ππ¦ π’ + π 1 ) π¦ π’ = ππ¦ π’β1 + π 1 measurement noise π§ π’ = π 2 (πβ π’ + π 2 ) π§ π’ = ππ¦ π’ + π 2 target measurement model output noisy observation Simple RNN (i.e., vanilla RNN) Linear Kalman Filter 6
AN EXAMPLE: KALMAN FILTERS VS. RNN Kalman Gain noisy input π¦ π’ = ππ¦ π’β1 + πΏ π’ (π§ π’ β ππ¦ π’β1 ) β π’ = π 1 (πβ π’β1 + ππ¦ π’ + π 1 ) noisy Input π¦ π’ = (π βπΏ π’ π)π¦ π’β1 +πΏ π’ π§ π’ π§ π’ = π 2 (πβ π’ + π 2 ) π¨ π’ = ππ¦ π’ target output target output Linear Kalman Filter Simple RNN (i.e., vanilla RNN) 7
AN EXAMPLE: KALMAN FILTERS VS. RNN Kalman Gain π¦ π’ = ππ¦ π’β1 + πΏ π’ (π§ π’ β ππ¦ π’β1 ) noisy noisy Input Input π¦ π’ = ππ¦ π’β1 + ππ§ π π¦ π’ = (π βπΏ π’ π)π¦ π’β1 +πΏ π’ π§ π’ π¨ π’ = ππ¦ π’ π¨ π’ = ππ¦ π’ target target output output Simple RNN (i.e., vanilla RNN): Linear Kalman Filter assume linear activation & no bias 8
A TOY EXAMPLE: TRACKING A MOVING CURSOR Input: a noisy curve y(t) state: [x, xβ, xββ] Kalman Filter: π¦ π’ = (π βπΏ π’ π)π¦ π’β1 +πΏ π’ π§ π’ π¨ π’ = ππ¦ π’ LSTM: π¦ π’ = ππππ(π¦ π’β1 , π§ π’ ) π¨ π’ = ππ¦ π’ 9
FACIAL ANLYSIS IN VIDEOS WITH RNN Variants of RNN: FC-RNN*, LSTM, GRU 10
HEAD POSE FROM VIDEOS Results on BIWI dataset 11
HEAD POSE FROM VIDEOS Input Per-Frame + KF RNN (Ours) 12
LARGE SYNTHETIC DATASET MATTERS! The SynHead Dataset 10 high-quality 3D scans of head models 51,096 head poses from 70 motion tracks 510,960 RGB images in total Accurate head pose and landmark annotations (2D/3D) Available at: https://research.nvidia.com (BIWI Dataset: 24 videos and 15,678 frames in total) 13
LARGE SYNTHETIC DATASET MATTERS! The SynHead Database 14
FACIAL LANDMARKS FROM VIDEO HyperFace Per-Frame RNN (Ours) Ground Truth Estimated 15
MORE EXAMPLES 16
VARIANTS OF RNN FOR LANDMARK ESTIMATION FC-RNN FC-LSTM FC-GRU fc6 0.7567, 0.10 0.7690, 0.13 0.7715 , 0.15 fc7 0.7424, 0.06 0.7539, 0.06 0.7554, 0.36 fc6+fc7 0.7630, 0.28 0.7456, 0.27 0.7605, 0.19 (Latest results) 17
CO-PILOT DEMO IN THE CES KEYNOTE (together with GazeNet by Shalini et.al.) 18
DYNAMIC FACIAL ANALYSIS: From Bayesian Filtering to RNN RNNs can be views as a variant of Bayesian Filters β’ β’ A general framework to leverage temporal coherence in videos Large synthetic datasets improve the performance β’ The SynHead Dataset Available at: https://research.nvidia.com 19
Recommend
More recommend