April 4-7, 2016 | Silicon Valley
Pavlo Molchanov
Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz
GESTURE RECOGNITION WITH 3D CNNS
4/6/2016
GESTURE RECOGNITION WITH 3D CNNS Pavlo Molchanov 4/6/2016 Xiaodong - - PowerPoint PPT Presentation
April 4-7, 2016 | Silicon Valley GESTURE RECOGNITION WITH 3D CNNS Pavlo Molchanov 4/6/2016 Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz Motivation Problem statement AGENDA Selecting the best classifier Online gesture
April 4-7, 2016 | Silicon Valley
Pavlo Molchanov
Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz
4/6/2016
2
Motivation Problem statement Selecting the best classifier Online gesture detection and classification Demos
3
4
GESTURE IS NATURAL FORM OF COMMUNICATION photo.elsoar.com
5
SAFE INTERFACES @ bmw.com
6
IN NEED FOR VIDEO RELAY SERVICES @ http://relayservice.gov.au/
7
GAMMING @ leapmotion
8
9
Single commodity sensor:
Kinectv1 SoftKinetic
10
Hand model fitting and tracking
*http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/
Thumb up Wave hand
Classifier Classifier
11
Hand model fitting and tracking
*http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/
Thumb up Wave hand
Classifier Classifier ??????
12
13
19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total
14
19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total
Gesture example: Slide 2 fingers left
15
19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total
Gesture example: Zoom out
16
19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total
Gesture example: Rotate CCW
17
ReLU ReLU Softmax
Prediction
RGB Depth
3D convolution and max-pooling 3D convolution and max-pooling 3D convolution and max-pooling 3D convolution and max-pooling
18
3D CNN Back propagation error update RGB Depth
19
Classification accuracy, higher better
1 Oreifej and Liu. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences, CVPR, 2013 2 Ohn-Bar and Trivedi, IEEE Trans. on Intelligent Transportation Systems, 2014.
HON4D1 HOG2 3D-CNN Testing set
58.7% 64.5% 48.3%
Training set
99.9%
20
Recent success in deep learning benefited from large data
21
3D CNN Back propagation error update RGB Depth
22
Data augmentation Depth 3D CNN Back propagation error update RGB
23
Spatial geometric transformations Temporal augmentation Generating new training data
Original Augmented
24
Spatial geometric transformations Temporal augmentation Generating new training data
Original Augmented
25
Spatial geometric transformations Temporal augmentation Generating new training data
Original Augmented
26
Spatial geometric transformations Temporal augmentation Generating new training data
Original Augmented
27
Spatial geometric transformations Temporal augmentation Generating new training data
Original Augmented
28
Spatial geometric transformations Temporal augmentation Generating new training data
Original Augmented
29
Spatial geometric transformations Temporal augmentation Generating new training data
30
Spatial geometric transformations Temporal augmentation Generating new training data
flip
31
32
36.4 44.6 54 58.7 64.5 48.3
10 20 30 40 50 60 70 80
Harris-3.5D HOG3D Dense Trajectories HON4D HOG+HOG2 NVIDIA (3D-CNN) No data augmentation
Classification accuracy, higher better
33
36.4 44.6 54 58.7 64.5 48.3 77.5
10 20 30 40 50 60 70 80
Harris-3.5D HOG3D Dense Trajectories HON4D HOG+HOG2 NVIDIA (3D-CNN)
Classification accuracy, higher better
with data augmentation
34
FPS, higher better
0.2 3 18 25 50 110 GPU +250 cuDNNv4 +400
100 200 300 400 500 600 700 800 900
Harris-3.5D HOG3D Dense Trajectories HON4D HOG+HOG2 NVIDIA (3D-CNN) CPU
35
Gesture time Start of the gesture End of the gesture Classification Decision Decision after gesture ends introduces latency
36
37
Gesture time Start of the gesture End of the gesture Classification Decision Decision before gesture ends improve feedback and user experience
38
Video server 3D CNN 3D CNN RNN RNN RNN
softmax softmax softmax
global motion descriptor local motion descriptor
8 frames
Forward recurrence only Detection and classification
109M parameters CTC for training only
Connectionist Temporal Classification (CTC)
39
Labeling dynamic gestures is difficult Labeling per frame is ambiguous
Input: Labels: Loss function: Per frame negative log likelihood
40
Sequence based training is the solution
Input: Sequence: nothing – slide right – nothing – slide left - nothing Loss function: Connectionist Temporal Classification (CTC) by A. Graves et al.
41
Chalearn2014 challenge held in 2014 RGBD videos of 20 Italian sign language 13K gestures 20 subjects
42
97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy (%) Improvement in accuracy
By seeing only
*L. Pigou et al. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video
43
Improvement in accuracy
By seeing only
No pre- or post-processing
44
In-house database Media player, navigation, phone 20 subjects, 25 gestures More information at CVPR2016
45
37 66 71 73 79 84 88
25 45 65 85
HOG+HOG2 Two stream CNN SNV iDT C3D Ours Human
In-house database Media player, navigation, phone 20 subjects, 25 gestures More information at CVPR2016
46
Suitability of hardware for inference:
IMAGE CLASSIFICATION GPU CPU VIDEO CLASSIFICATION GPU CPU
47
NVIDIA TX1 - for embedded solutions Credit card GPU in your pocket Our R3DCNN takes only 30% of GPU
48
Data augmentation helps a lot to deep learning R3DCNN are the best for sign language and gesture recognition CTC helps a lot for video sequence learning Scalable enough to run on NVIDIA TX1
April 4-7, 2016 | Silicon Valley
CTC Deep Learning Data Augmentation
April 4-7, 2016 | Silicon Valley
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join