Structured Deep Learning of Human Motion Christian Wolf Fabien - - PowerPoint PPT Presentation

structured deep learning of human motion
SMART_READER_LITE
LIVE PREVIEW

Structured Deep Learning of Human Motion Christian Wolf Fabien - - PowerPoint PPT Presentation

Structured Deep Learning of Human Motion Christian Wolf Fabien Baradel Natalia Neverova Julien Mille Graham W. Taylor Greg Mori 1 Deep Learning of Human Motion Gesture re Ge recognition Re Recognition of in indiv ivid idual act


slide-1
SLIDE 1

1

Structured Deep Learning of Human Motion

Christian Wolf Fabien Baradel Natalia Neverova Julien Mille Graham W. Taylor Greg Mori

slide-2
SLIDE 2

2

Ge Gesture re recognition

Deep Learning of Human Motion

Po Pose estimation Re Recognition of in indiv ivid idual act activities es & & interactions Re Recognition of group act activities es

slide-3
SLIDE 3

3

3 [Neverova, Wolf, Taylor, Nebout. CVIU 2017]

slide-4
SLIDE 4

4

Combining real and simulated data

Joint positions (NYU Dataset) Synthetic data (part segmentation)

Graham W. Taylor University of Guelph Canada Natalia Neverova Phd @ LIRIS, Now at Facebook Florian Nebout Awabot Christian Wolf LIRIS INSA-Lyon

slide-5
SLIDE 5

5

Semantic Segmentation with GridNetworks

[Fourure, Emonet, Fromont, Muselet, Tremeau, Wolf, BMVC 2017] Damien Fourure

  • E. Fromont, R. Emonet, A. Trémeau, D. Muselet, C. Wolf
slide-6
SLIDE 6

6

Activity recognition

Un Unconstrained in internet/yo youtube vi videos No No acquisition E.g. Youtube-8M dataset: 7M videos, 4716 classes, ~3.4 labels per video. > 1PB of data. Vi Vide deos wi with hum human an act activities es, , fr from yo youtube No No acquisition E.g. ActivityNet/Kinetics dataset: ~300k videos, 400 classes. Hu Human act activities es sh shot wi with de dept pth se senso sors Ac Acquisition is is ti time cons consum uming ng! E.g. NTU RGB-D dataset, MSR dataset, ChaLearn/Montalbano dataset, etc.

slide-7
SLIDE 7

7

Deep Learning is mostly based on global models.

Deep Learning (Global)

(Mostly after 2012) 7

[Baccouche, Mamalet, Wolf, Garcia, Baskur, HBU 2011] [Baccouche, Mamalet, Wolf, Garcia, Baskur, BMVC 2012]

[Carreira and Zisserman, CVPR 2017] [Ji et al., ICML 2010]

slide-8
SLIDE 8

8

8

The role of articulated pose

Reading Writing

slide-9
SLIDE 9

9

9

The role of articulated pose

Appearance is helpful Reading Writing

[Neverova, Wolf, Taylor, Nebout, PAMI 2016] [Baradel, Wolf, Mille, Taylor, BMVC 2018]

slide-10
SLIDE 10

10

Context

We need put attention to places which are not always determined by pose

slide-11
SLIDE 11

11

Context

We need put attention to places which are not always determined by pose

slide-12
SLIDE 12

12

Context

Frame from the NTU RGB-D Dataset

slide-13
SLIDE 13

13

Local representations

(Before 2012)

Im Images, o , objects ts and act activities es have often been represented as collections of local features, e.g. through DPMs.

[Felzenszwalb et al., PAMI 2010] 5

Local appearance Deformation

slide-14
SLIDE 14

14

Visual recognition

(activities, gestures,

  • bjects)

Deep Learning Structured and semi-structured models Structured deep learning Representation learning

Local context

Complex relationships,

Global context

Structured Deep Learning

1

l l2 l4

1

F4 F2 F

slide-15
SLIDE 15

15

Human attention: gaze patterns

[Johansson, Holsanova, Dewhurst, Holmqvist, 2012]

slide-16
SLIDE 16

16 [Song et al., AAAI 2016]

Attention on joints

[Sharma et al., ICLR 2016] [Mnih et al., NIPS 2015]

Hard attention Soft attention in feature maps

Local representations Deep Learning (Global)

(Before 2012) (Mostly after 2012)

Deep Learning (attention maps)

(~2016)

Deep Learning (Local representations)

slide-17
SLIDE 17

17

v3,1

Objective: fully trainable high-capacity local representations

1. Learn where to attend 2. Learn how to track attended points 3. Learn how to recognize from a local distributed representation

[Baradel, Wolf, Mille, Taylor, CVPR 2018]

Local representations Deep Learning (Global)

(Before 2012) (Mostly after 2012)

Deep Learning (attention maps)

(~2016)

Deep Learning (Local representations)

9

Recognize activity

slide-18
SLIDE 18

18

Attention in feature space

RGB input video

Time

Feature space

3D Global model: Inflated Resnet 50

Time

[Baradel, Wolf, Mille, Taylor, CVPR 2018]

slide-19
SLIDE 19

19

Unconstrained differentiable attention

[Baradel, Wolf, Mille, Taylor, CVPR 2018]

Frame context Hidden state from recurrent recognizers (workers) "Differentiable crop » (Spatial Transformer Network)

T i m e

slide-20
SLIDE 20

20

Distributed recognition

Distributed tracking/recognition RGB input video

Spatial Attention process

Time

Unconstrained Attention in feature space +

3D Global model: Inflated Resnet 50

Time

Distributed tracking/recognition

Workers

+

r1

+

r2 r3

h

slide-21
SLIDE 21

21

Results

slide-22
SLIDE 22

22

Dynamic visual attention

[Baradel, Wolf, Mille, Taylor (under review]

CNN

T i m e

Recognition Unstructured Glimpse Cloud

State-of-the-art comparaison

SOTA results on two datasets NTU and N-UCLA Larger difference between Glimpse clouds and global model on N-UCLA

[Baradel, Wolf, Mille, Taylor, CVPR 2018]

slide-23
SLIDE 23

23

Results

[Baradel, Wolf, Mille, Taylor (under review]

Ablation study

[Baradel, Wolf, Mille, Taylor, CVPR 2018]

slide-24
SLIDE 24

24

Pose conditioned attention

[Baradel, Wolf, Mille, Taylor, BMVC 2018]

slide-25
SLIDE 25

25

AI vs. NI

2014 Nobel Prize in Medecine

Border cells Head direction

slide-26
SLIDE 26

26

AI vs. NI

2014 Nobel Prize in Medecine

slide-27
SLIDE 27

27

AI vs. NI

2018 : discoverty of the same cells in neural networks trained on similar tasks.

[Cueva, Wei, ICLR 2018]

slide-28
SLIDE 28

28

AI vs. NI

Emergence of the different types of cells in the same

  • rder.

[Cueva, Wei, ICLR 2018]

slide-29
SLIDE 29

29

Reasoning : what happened?

slide-30
SLIDE 30

30

Human psychology

  • Daniel Kahnemann (Nobel prize in 2002)
  • Book: "Thinking Fast and Slow"
slide-31
SLIDE 31

31

Cognitive tasks

24*17 = ?

slide-32
SLIDE 32

32

Two systems

Sy System em 1

  • Continuously monitors environment (and mind)
  • No specific attention
  • Continuously generates assessments / judgments w/o

efforts, even in the presence of low data. Jumps to conclusions

  • Prone to errors. No capabilities for statistics

Sy System em 2

  • Receives questions or generates them
  • Directs attention and searches memory to find answers
  • Requires (eventually a lot of) effort
  • More reliable
slide-33
SLIDE 33

33

Where is ML today?

Claim: AI requires a combination of

  • Extraction of high-level information from high-

dimensional input (visual, audio, language): ma machine le learnin ing

  • High-level reasoning: com

compar pare, e, as asses ess, focus

  • cus at

attent ention

  • n,

, perform lo logic ical l deductio ions

22 22

Estimating semantics from low level information (Vision & Learning) Estimating causal relationships from data Reasoning: Logic + Statistics

Roadmap:

slide-34
SLIDE 34

34

Object level Visual Reasoning

Fabien Baradel Phd @ LIRIS, INSA-Lyon Christian Wolf INRIA Chroma Julien Mille LI, INSA VdL Natalia Neverova Facebook AI Research, Paris Greg Mori Simon Fraser University, Canada [Baradel, Neverova, Wolf, Mille, Mori, ECCV 2018]

slide-35
SLIDE 35

35

Object level Visual Reasoning

[Baradel, Neverova, Wolf, Mille, Mori, ECCV 2018]

slide-36
SLIDE 36

36

Object level Visual Reasoning

[Baradel, Neverova, Wolf, Mille, Mori, ECCV 2018]

slide-37
SLIDE 37

37

Learned interactions

Class: person-book interaction

slide-38
SLIDE 38

38

Failure cases

slide-39
SLIDE 39

39

Results

Something-something dataset VLOG dataset EPIC Kitchen dataset

slide-40
SLIDE 40

40

Conclusion

  • We propose a models which recognize activities from

– a cloud of unconstrained feature points – Interactions between spatially well defined objects

  • Visual spatial attention is useful and competitive

compared to pose

  • State of the art performance on 5 datasets (NTU RGB-D,

Northwestern UCLA, VLOG, Something-Something, Epic Kitchen)

  • Reasoning is key component of human cognition, also

important for IA systems