GESTURE RECOGNITION WITH 3D CNNS Pavlo Molchanov 4/6/2016 Xiaodong - - PowerPoint PPT Presentation

gesture recognition with 3d cnns
SMART_READER_LITE
LIVE PREVIEW

GESTURE RECOGNITION WITH 3D CNNS Pavlo Molchanov 4/6/2016 Xiaodong - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley GESTURE RECOGNITION WITH 3D CNNS Pavlo Molchanov 4/6/2016 Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz Motivation Problem statement AGENDA Selecting the best classifier Online gesture


slide-1
SLIDE 1

April 4-7, 2016 | Silicon Valley

Pavlo Molchanov

Xiaodong Yang Shalini Gupta Kihwan Kim Stephen Tyree Jan Kautz

GESTURE RECOGNITION WITH 3D CNNS

4/6/2016

slide-2
SLIDE 2

2

AGENDA

Motivation Problem statement Selecting the best classifier Online gesture detection and classification Demos

slide-3
SLIDE 3

3

MOTIVATION

slide-4
SLIDE 4

4

GESTURE IS NATURAL FORM OF COMMUNICATION photo.elsoar.com

slide-5
SLIDE 5

5

SAFE INTERFACES @ bmw.com

slide-6
SLIDE 6

6

IN NEED FOR VIDEO RELAY SERVICES @ http://relayservice.gov.au/

slide-7
SLIDE 7

7

GAMMING @ leapmotion

slide-8
SLIDE 8

8

PROBLEM STATEMENT

slide-9
SLIDE 9

9

PROBLEM STATEMENT

Single commodity sensor:

  • Gesture recognition
  • Skeleton tracking
  • Gaze estimation
  • Head tracking

No special devices

Kinectv1 SoftKinetic

slide-10
SLIDE 10

10

PROBLEM STATEMENT

Hand model fitting and tracking

*http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/

Understanding gesture concepts

Thumb up Wave hand

We don’t: We do:

Classifier Classifier

slide-11
SLIDE 11

11

PROBLEM STATEMENT

Hand model fitting and tracking

*http://www.virtualrealityreviewer.com/leap-motion-enters-vr-new-software-product-accessory-preview-what%C2%B9s-next/

Understanding gesture concepts

Thumb up Wave hand

We don’t: We do:

Classifier Classifier ??????

slide-12
SLIDE 12

12

SELECTING THE BEST CLASSIFIER

slide-13
SLIDE 13

13

SELECTING THE BEST CLASSIFIER

VIVA CHALLENGE 2015 organized by UCLA

19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total

slide-14
SLIDE 14

14

SELECTING THE BEST CLASSIFIER

VIVA CHALLENGE 2015 organized by UCLA

19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total

Gesture example: Slide 2 fingers left

slide-15
SLIDE 15

15

SELECTING THE BEST CLASSIFIER

VIVA CHALLENGE 2015 organized by UCLA

19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total

Gesture example: Zoom out

slide-16
SLIDE 16

16

SELECTING THE BEST CLASSIFIER

VIVA CHALLENGE 2015 organized by UCLA

19 classes, 8 subjects Driver and passenger RGB + Depth from Microsoft Kinect 885 gestures in total

Gesture example: Rotate CCW

slide-17
SLIDE 17

17

SELECTING THE BEST CLASSIFIER

3D Convolutional Neural Network

ReLU ReLU Softmax

Prediction

RGB Depth

3D convolution and max-pooling 3D convolution and max-pooling 3D convolution and max-pooling 3D convolution and max-pooling

slide-18
SLIDE 18

18

SEGMENTED GESTURE CLASSIFICATION

Training

3D CNN Back propagation error update RGB Depth

slide-19
SLIDE 19

19

SELECTING THE BEST CLASSIFIER

First result

Classification accuracy, higher better

1 Oreifej and Liu. HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences, CVPR, 2013 2 Ohn-Bar and Trivedi, IEEE Trans. on Intelligent Transportation Systems, 2014.

HON4D1 HOG2 3D-CNN Testing set

58.7% 64.5% 48.3%

Training set

99.9%

slide-20
SLIDE 20

20

SELECTING THE BEST CLASSIFIER

VIVA IMAGENET

1.5 M examples 885 examples

Recent success in deep learning benefited from large data

slide-21
SLIDE 21

21

SELECTING THE BEST CLASSIFIER

Training

3D CNN Back propagation error update RGB Depth

slide-22
SLIDE 22

22

SELECTING THE BEST CLASSIFIER

Training

Data augmentation Depth 3D CNN Back propagation error update RGB

slide-23
SLIDE 23

23

SELECTING THE BEST CLASSIFIER

Data augmentation

Spatial geometric transformations Temporal augmentation Generating new training data

Original Augmented

slide-24
SLIDE 24

24

SELECTING THE BEST CLASSIFIER

Data augmentation

Spatial geometric transformations Temporal augmentation Generating new training data

Original Augmented

slide-25
SLIDE 25

25

SELECTING THE BEST CLASSIFIER

Data augmentation

Spatial geometric transformations Temporal augmentation Generating new training data

Original Augmented

slide-26
SLIDE 26

26

SELECTING THE BEST CLASSIFIER

Data augmentation

Spatial geometric transformations Temporal augmentation Generating new training data

Original Augmented

slide-27
SLIDE 27

27

SELECTING THE BEST CLASSIFIER

Data augmentation

Spatial geometric transformations Temporal augmentation Generating new training data

Original Augmented

slide-28
SLIDE 28

28

SELECTING THE BEST CLASSIFIER

Data augmentation

Spatial geometric transformations Temporal augmentation Generating new training data

Original Augmented

slide-29
SLIDE 29

29

SELECTING THE BEST CLASSIFIER

Data augmentation

Spatial geometric transformations Temporal augmentation Generating new training data

slide-30
SLIDE 30

30

SELECTING THE BEST CLASSIFIER

Data augmentation

Spatial geometric transformations Temporal augmentation Generating new training data

flip

slide-31
SLIDE 31

31

SELECTING THE BEST CLASSIFIER

VIVA AUGMENTED

0.3 M examples 885 examples

slide-32
SLIDE 32

32

SELECTING THE BEST CLASSIFIER

Official challenge results

36.4 44.6 54 58.7 64.5 48.3

10 20 30 40 50 60 70 80

Harris-3.5D HOG3D Dense Trajectories HON4D HOG+HOG2 NVIDIA (3D-CNN) No data augmentation

Classification accuracy, higher better

slide-33
SLIDE 33

33

SELECTING THE BEST CLASSIFIER

Official challenge results

36.4 44.6 54 58.7 64.5 48.3 77.5

10 20 30 40 50 60 70 80

Harris-3.5D HOG3D Dense Trajectories HON4D HOG+HOG2 NVIDIA (3D-CNN)

Classification accuracy, higher better

with data augmentation

slide-34
SLIDE 34

34

SELECTING THE BEST CLASSIFIER

Speed

FPS, higher better

0.2 3 18 25 50 110 GPU +250 cuDNNv4 +400

100 200 300 400 500 600 700 800 900

Harris-3.5D HOG3D Dense Trajectories HON4D HOG+HOG2 NVIDIA (3D-CNN) CPU

slide-35
SLIDE 35

35

SEGMENTED GESTURE CLASSIFICATION

Gesture time Start of the gesture End of the gesture Classification Decision Decision after gesture ends introduces latency

slide-36
SLIDE 36

36

ONLINE GESTURE DETECTION AND CLASSIFICATION

slide-37
SLIDE 37

37

ONLINE GESTURE CLASSIFICATION

Gesture time Start of the gesture End of the gesture Classification Decision Decision before gesture ends improve feedback and user experience

slide-38
SLIDE 38

38

ONLINE GESTURE CLASSIFICATION

R3DCNN

Video server 3D CNN 3D CNN RNN RNN RNN

softmax softmax softmax

global motion descriptor local motion descriptor

8 frames

Forward recurrence only Detection and classification

109M parameters CTC for training only

Connectionist Temporal Classification (CTC)

slide-39
SLIDE 39

39

ONLINE GESTURE CLASSIFICATION

Training loss function

Labeling dynamic gestures is difficult Labeling per frame is ambiguous

Input: Labels: Loss function: Per frame negative log likelihood

slide-40
SLIDE 40

40

ONLINE GESTURE CLASSIFICATION

Training loss function

Sequence based training is the solution

Input: Sequence: nothing – slide right – nothing – slide left - nothing Loss function: Connectionist Temporal Classification (CTC) by A. Graves et al.

slide-41
SLIDE 41

41

ONLINE GESTURE CLASSIFICATION

Italian sign language recognition

Chalearn2014 challenge held in 2014 RGBD videos of 20 Italian sign language 13K gestures 20 subjects

slide-42
SLIDE 42

42

ONLINE GESTURE CLASSIFICATION

Italian sign language recognition

97.2 97.4 98.2 Pigou et al.* 3D-CNN 3D-CNN CTC Classification accuracy (%) Improvement in accuracy

35%

By seeing only

41%

  • f gesture

*L. Pigou et al. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video

slide-43
SLIDE 43

43

ONLINE GESTURE CLASSIFICATION

Italian sign language recognition

Improvement in accuracy

35%

By seeing only

41%

  • f gesture

No pre- or post-processing

slide-44
SLIDE 44

44

ONLINE GESTURE CLASSIFICATION

Car interfaces

In-house database Media player, navigation, phone 20 subjects, 25 gestures More information at CVPR2016

slide-45
SLIDE 45

45

ONLINE GESTURE CLASSIFICATION

Car interfaces

37 66 71 73 79 84 88

25 45 65 85

HOG+HOG2 Two stream CNN SNV iDT C3D Ours Human

In-house database Media player, navigation, phone 20 subjects, 25 gestures More information at CVPR2016

slide-46
SLIDE 46

46

ONLINE GESTURE CLASSIFICATION

Suitability of hardware for inference:

Latency is critical

IMAGE CLASSIFICATION GPU CPU VIDEO CLASSIFICATION GPU CPU

slide-47
SLIDE 47

47

ONLINE GESTURE CLASSIFICATION

NVIDIA TX1 - for embedded solutions Credit card GPU in your pocket Our R3DCNN takes only 30% of GPU

Scalability

slide-48
SLIDE 48

48

CONTRIBUTIONS

Data augmentation helps a lot to deep learning R3DCNN are the best for sign language and gesture recognition CTC helps a lot for video sequence learning Scalable enough to run on NVIDIA TX1

slide-49
SLIDE 49

April 4-7, 2016 | Silicon Valley

CTC Deep Learning Data Augmentation

slide-50
SLIDE 50

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join